Leader and breakaway detection in racing sports videos

This paper addresses the important problem of leader detection in racing sports videos (e.g., cycling, boating and car racing events), as his/her proper framing is a pivotal issue in racing sports cinematography, where the events have a linear spatial deployment. Over the last few years, as autonomous drone vision and cinematography emerged, new challenges appeared in drone vision. While, until recently, most computer vision methods typically addressed still camera AV footage, drone sports cinematography typically employs moving cameras. In this paper, we solve the problem of leader detection in a group of similarly moving targets in sports videos, e.g. the leader of a sports cyclist group and his/her breakaway during a cycling event. This is very useful in drone sports cinematography, as it is important that the drone camera automatically centers on such a leader. We demonstrate that the novel method described in this paper can effectively solve the problem of leader detection in sports videos.


I. INTRODUCTION
As drone sports cinematography evolves over time, there is a constant need for automation. Sports filming tasks, which previously required a human operator, have reached the stage where quick decisions are compulsory and automation seems to be inevitable. When filming a group of moving targets (e.g., athletes) moving in the same direction, in conventional human-operated cinematography, the operator usually targets the camera either on the gravitational center of the athlete group or on the leader of this group. In this sense, leader target detection would enable automatic leader framing by the drone cinematography camera.
The novel leader detection method proposed in this paper, uses global optical flow, in order to estimate camera motion direction. The underlying assumption is that the drone already follows the athlete group, either from above or from a lateral position, according to the chosen drone cinematography mode [1]. A visual target (object) detector and tracker is employed for finding regions of interest (ROI) of the targets (athletes) on the image plane. In the next step, the target ROI centers are projected on the optical flow unit direction vector and, lastly, the leading target (athlete) is detected. Moreover, the This work has received funding from the European Union's Horizon 2020 research and innovation programmes under grant agreements No 731667 (MULTIDRONE) and No 951911 (AI4Media). athlete winning order (1st, 2nd, 3rd, etc), as well as their spatial distribution over time can be determined as well, thus providing very useful information for computational racing sports coaching.
Furthermore, in racing sports, a breakaway is the event where, starting from a spatially compact racer group, one athlete accelerates and distances fast from the rest of the athlete group. In the context of this paper, we solve the breakaway detection problem as well, by introducing additional target breakaway detection metrics and constraints.

II. RELATED WORK
Although there have been numerous works that study athletic performance at an individual athlete level, primarily by modeling biomechanics [2]- [4], there has been no research on the kinematics and dynamics of racing sports involving a group of athletes using only visual information, at least to the authors' knowledge.

III. LEADER DETECTION
The first step to leading athlete (leader) detection is visual athlete detection on a video frame, producing athlete ROIs, as shown in Figure 1. Alternatively, semantic segmentation [5]- [7] can be utilized for better localization accuracy. Then global motion is estimated, which is primarily due to drone and ensuing drone cinematography camera motion.
In such a framework, target group motion direction estimation can be obtained by assuming that the drone camera moves on a 3D line that is parallel to the 3D target trajectory, while maintaining the camera axis to be almost perpendicular to the 3D drone motion direction, trying to keep the group of racers inside the video frame, as shown in Figure 2. This means that the targets are almost static on the image plane, but there is global 2D motion to the opposite direction. Thus, detecting the global motion direction can help in determining the actual target group motion direction. The following drone cinematography camera motion types can meet these requirements: Lateral Tracking Shot (LTS) and Vertical Tracking Shot (VTS), according to the UAV shot type taxonomy recently defined and formalized in [1], [8] and [9]. This is not a major limitation, as LTS and VTS are the primary drone cinematography motion types used for when the director wants to frame the leader. The gimbal points at the center of gravity of the athlete group. The drone camera focal length should be fixed. Furthermore, the video framing should be of long shot type [1], so that many targets appear in the video frame.
Video frames can be greatly subsampled, resulting in, e.g., 144 × 256 pixel images, as this resolution is enough for our goals. Optical flow estimation algorithm [10] has proven to be reliable and fast, as it takes approximately 7.4 ms for optical flow estimation on such subsampled video frames. The 2D target group motion unit direction vector −v estimate, shown in Figure 1, is essentially the reverse of the 2D global motion unit direction vectorv estimate, which can be found by, e.g., averaging the estimated optical flow vectors To avoid excessive computational cost, only few optical flow vectors at chosen locations are used for the mean optical flow vector calculation. In order to avoid letting outliers affectv, zero motion vectors at target locations are removed, as targets appear static on the image plane. To further eliminate possible outliers, the Local Outlier Factor algorithm [11] is used to detect and remove other possible optical flow outliers. Another possibility is to use either a vector median [12], or angular statistics [13] to findv.
The ordering of the targets/athletes and leader detection can be easily achieved by projecting the target ROI centers on −v, as depicted in Figure 1. c (2) ... c (n) be their ordered rank [12] along the motion direction, o (1) , o (2) , ..., o (n) their corresponding coordinates and p (1) , p (2) , ..., p (n) their corresponding projections on −v. c (n) , c (1) denote the leader and the laggard, respectively. During breakaway, the distances of the leader location o (n) from either the mean target group location m = 1 n o (i) Fig. 2: Proper UAV cinematography camera motion with respect to the motion of the targets. and/or from the second target o (n−1) increases sharply over time, thus indicating breakaway. Therefore, we propose the following metrics that can indicate the breakaway occurrence.
Distance to variance ratio has two variations. The first one is the ratio of the squared distance between the first target location o (n) and the meanm of target ensemble C − {c (n) } coordinates divided by its variance: whereÕ contains the target coordinates corresponding to C − {c n }. The second one is the ratio of the squared distance between the leader target location o (n) and the second leader location o (n−1) divided by the variance of the target ensembleÕ: However, these metrics have a limitation. They assume that the spatial target location projection variance along the motion direction o∈Õ o −m 2 2 is almost static. This is not always true, since targets can move independently, hence altering r 1 and r 2 drastically, without an actual breakaway event occurence. Additionally, laggards affect target group position variance, hence r 1 ,r 2 as well. Finally, both r 1 ,r 2 depend onm, whose estimation can become unreliable if targets start disappearing from the video frame, due to a large target group spatial spread or due to changes in camera focal length.
Distance to distance ratio was created as a solution to the target disappearance problem, as well as to counter the effects of the laggards, as both the leader and the laggard positions are essentially outliers of the target position probability density functions (pdfs). The first variation is the ratio of the squared   distance between the leader and the second leader target, divided by median absolute deviation (MAD) inÕ [12]: where med 2 () can be any 2D median operator, e.g., a marginal or a vector median [12] [14].
The second variation is the ratio of the squared distance o (n) − o (n−1) 2 2 , divided by mean distance between each target and its nearest neighbor: Both r 3 ,r 4 assume that target ROI center variance along the direction perpendicular tov is small. If not, the same analysis can be performed on p (i) . In this case, the med 2 operator in (3) is reduced to the classic 1D median operator [12]. All r i , i = 1, ..., 4 vary with time t. As leader breakaway is a temporal event, their differentiation indicates breakaway: dri dt > T, i = 1, ..., 4. The optimal threshold T can be found by estimating the pdfs P ri (r i ), i = 1, ..., 4 in presence/absence of breakaway and by choosing a threshold minimizing the detection error [15].

V. EXPERIMENTAL RESULTS
Leader and breakaway performance was evaluated in a video dataset of cycling races provided by Radiotelevisione Italiana (RAI) 1 . From the whole dataset, a number of relatively short video clips was extracted, so that the drone camera motion conforms to the detection requirements set up in the previous sections. The cyclist detector [16] was employed for target detection. The chosen video clips were transformed to grayscale and their spatial resolution was reduced to 144×256 pixels.
Due to the fact that the leader detection algorithm uses bounding boxes that have been already produced by a cyclist detector, the quality of the overall results depends strongly on the target detection performance. If target detection performance is good [16], as in 3, the leader detection algorithm yields almost perfectly accurate results. Using a dataset of 1571 bicycle racing video frame pairs, the leader detection algorithm achieved a high leader detection accuracy of 97.2%, while in detecting the second best it achieved 95.6% accuracy. It can run in real time, as only around 24 ms are needed to process a single video frame pair.
As cyclist breakaway videos are rather scarce, at least to authors' knowledge, the event was simulated. Figures 5 (a) and 5 (b) are taken from a CGI video simulating a cyclist racing sports event shot with VTS CMT [1] that includes three breakaway events. Figure 4 depicts the derivatives dri dt , i = 1, ..., 4, respectively, filtered with median filtering of kernel size 7. As seen in this figure, if the derivative surpasses a certain threshold, it can successfully indicate the time instances (around video frames no. 250, 600 and 1000) when the breakaways take place. It is clearly seen that dr3 dt and particularly dr4 dt have much less background noise in the absence of a breakaway, as manifested in Table I p r3 (r 3 ) and p r4 (r 4 ) in presence/absence of breakaway using Kernel Density Estimation [17] with Gaussian Kernel and then determine the optimal thresholds that minimize the detection error, as seen in Figure 6.
(a) Video frame at a time instance when the cyclist group is compact.
(b) Video frame at a time instance when a breakaway is detected.

VI. CONCLUSIONS
Although the results show that the leader and breakaway detection algorithm can yield satisfactory results, there are some certain limitations that must taken into consideration. First and foremost, the leader detection algorithm applies only in scenarios where the targets move on a relatively straight line. Turns produce complex optical flow behavior, such that the motion direction can not be estimated by a single averaging. Secondly, as stated in a previous section, the algorithm works well only in certain CMTs, notably VTS and LTS [1], [8], [9], where the camera axis is perpendicular to the targets trajectory and the UAV and target trajectories are parallel to one another.