Sports2D: Compute 2D human pose and angles from a video or a webcam
Creators
Description
Depth perspective effects are now compensated. They make the farther limb look smaller. This represents 1-2% coordinate error at 10 m, more if the camera is closer.
Pixel to meter coordinates got improved, taking into account:
- the pixel to meter scale
- the camera horizon angle
- the floor height
- the perspective effects with an additional depth parameter <-- this is new
Added a flexible configuration argument for the user to choose which depth information to use. Either:
- Distance from camera to the lane: distance_m
- Focal length in pixels: distance_m = f_px * H/h
- Field of view in degrees or radians: distance_m = max(W,H)/2 / tan(fov/2)
- Calibration file: distance_m = K[0,0] * H/h
In the same way, the camera horizon angle and the floor height can be specified with:
- a manual input
- a calibration file
- automatically from gait
[px_to_meters_conversion] # Config.toml
# Compensate for perspective effects, which make the further limb look smaller. 1-2% coordinate error at 10 m, less if the camera is further away
perspective_value = 10 # Either camera-to-person distance (m), or focal length (px), or field-of-view (degrees or radians), or '' if perspective_unit=='from_calib'
perspective_unit = 'distance_m' # 'distance_m', 'f_px', 'fov_deg', 'fov_rad', or 'from_calib'
# Compensate for camera horizon
floor_angle = 'auto' # float, 'from_kinematics', 'from_calib', or 'auto' # 'auto' is equivalent to 'from_kinematics', ie angle calculated from foot contacts. 'from_calib' calculates it from a toml calibration file. Use float to manually specify it in degrees
xy_origin = ['auto'] # [px_x,px_y], or ['from kinematics'], ['from_calib'], or ['auto']. # BETWEEN BRACKETS! # ['auto'] is equivalent to ['from_kinematics'], ie origin estimated at first foot contact, direction is direction of motion. ['from_calib'] calculates it from a calibration file. Use [px_x,px_y] to manually specify it in pixels (px_y points downwards)
# Optional calibration file
calib_file = '' # Calibration file in the Pose2Sim toml format, or '' if not available
Note: If the user does not want perspective effects to be taken into effect, they can set distance_m to a very large value, such as 10000m for example.
Full Changelog: https://github.com/davidpagnon/Sports2D/compare/v0.8.24...v0.8.25
More about pixel-to-meter conversion
Pixel to meters scale
Let's start with the pinhole camera model.
The intercept theorem tells us:
distance_m / f_px = Y / y (1)
With:
- distance_m: the distance between the camera origin and the athlete in meters
- f_px: the focal length (the distance between the camera origin and the sensor), converted from mm to pixels
- Y: The coordinate of a point in the scene in meters
- y: The coordinate of a point on the camera sensor in pixels
<img width="703" height="318" alt="image" src="https://github.com/user-attachments/assets/b51a71b7-1eed-4ac8-8f1c-7629a457c3d2" />
<br><br>
A particular case of it is the coordinates of the athlete:
distance_m / f_px = H / h (2)
With:
- H: the height of the athlete in meters
- h: the height of the athlete on the camera sensor in pixels
Now, the image coordinates are generally not taken from the center of the image / sensor, but from its top left corner (see image), which means that:
x = u - cu (3)
y = - (v - cv)
With:
- u,v: the image coordinates
- cu, cv: the coordinates of the principal point of the sensor, approximated as the image center
<img width="583" height="357" alt="image" src="https://github.com/user-attachments/assets/deb76c49-450a-4bca-8ff0-4370434dc0e9" />
<br><br>
So we end up with all these relations:
distance_m / f_px = H / h = X/(u-cu) = -Y/(v-cv) (4)
And the simplest case is resolved:
X = H / h * (u-cu) (5)
Y = - H / h * (v-cv)
<table> <thead style="text-align: left;"> <tr> <th scope="col" style="text-align: left;">Person height calculation:</th> </tr> </thead> <tbody> <tr> <td scope="row" style="text-align: left;">I calculate the height in pixels from the following distances: <code>height = (rfoot+lfoot)/2 + (rshank+lshank)/2 + (rfemur+lfemur)/2 + (rback+lback)/2 + head</code>, with:<br> <ul> <li>Foot: distance from heel to ankle (or 10 cm if the pose does not provide any heel point) </li> <li>Shank: distance from ankle to knee </li> <li>Femur: distance from knee to hip </li> <li>Back: distance from hip to shoulder </li> <li>Head: distance from midshoulder to top head point (or distance from midshoulder to nose*1.33 if the pose model does not provide any top head point) </li> </ul> Not all frames are good, therefore: <ul> <li>I first remove the 20% fastest frames (potential outliers), the frames where speed is close to zero (person might be out of frame), and the frames where the hip and knee angles are below 45° (coordinates are imprecise when the person is crouching). </li> <li>And I take the trimmed mean over the remaining frames, after removing the 20% most extreme values. </li> </ul> </td> </tr> </table>
Compensation for the camera horizon
The camera is not always set perfectly horizontally, and we may want to compensate for it. After evaluating the angle from gait kinematics or from a calibration file, we can change the coordinate system:
xang = x*cos(ang) + y*sin(ang) (6)
yang = y*cos(ang) - y*sin(ang)
With ang the camera horizon angle.
Reinjecting this in the previous formula gives:
X = H / h * ((u-cu)*cos(ang) + (v-cv)*sin(ang)) (7)
Y = - H / h * ((v-cv)*cos(ang) - (u-cu)* sin(ang))
Moreover, we want the floor to be situated at Y = 0 so that feet are in contact with the floor. Instead of considering that the pixel origin is at the center of the image, we determine (cx,cy) as the intersection between the left border of the image and the floor line, determined from kinematics. We can simply replace (cu,cv) by (cx,cy) in the previous formula:
X = H / h * ((u-cx)*cos(ang) + (v-cy)*sin(ang)) (8)
Y = - H / h * ((v-cy)*cos(ang) - (u-cx)* sin(ang))
<img width="820" height="510" alt="image" src="https://github.com/user-attachments/assets/53d5bd22-b57d-4cee-9208-979fdf56f4df" />
<br><br>
<table> <thead style="text-align: left;"> <tr> <th scope="col" style="text-align: left;">Determination of the camera horizon and floor height from kinematics:</th> </tr> </thead> <tbody> <tr> <td scope="row" style="text-align: left;">The floor line (origin and angle) is estimated from the line that fits foot ground contacts. <br>Ground contacts are estimated as the coordinates where the feet's horizontal velocities are close to zero (default: 7 px/s). Points with low confidence are removed (default: 0.3). The output of the fit is (slope, intercept). We obtain: <ul> <li>ang = -arctan(slope) </li> <li>origin = (0,intercept) = (cu,cv), ie the intersection between the left border and the floor line </li> <li>gait_direction: left-to-right if ang>0, and right-to-left otherwise. </li> </ul> <img width="1173" height="349" alt="image" src="https://github.com/user-attachments/assets/61f340bc-a7ce-4efe-86ef-acf086427728" /> </td> </tr> </table>
Compensation for depth perspective
The person's left and right limbs are not situated at the same depth. Due to perspective, the further limb can look smaller, especially if the camera is close to the athlete. We can compensate for this effect.
We can extract from Equation (4):
distance_m / f_px = -Y/(v-cv) (9)
Adding in the depth offset, we get:
(distance_m + depth_offset) / f_px = X/(v-cv) (10)
With depth_offset the offset of the joint with regard to the body midline, in meters.
This equation can be reorganized to separate the coordinates at the body midline from their offsets due to depth:
Y = distance_m / f_px * (v-cv) + depth_offset / f_px * (v-cv) (11)
Now, there is a catch: we want the floor to be situated at Y=0, so (v-cv) should be replaced by (v-cy) in the first part of the equation. On the other hand, we want the depth offset to be null at the center of the image (and larger when getting further), so the second part of the equation should not be changed (see image). So we obtain:
Y = distance_m / f_px * (v-**cx**) + depth_offset / f_px * (v-cv) (12)
<img width="581" height="344" alt="image" src="https://github.com/user-attachments/assets/4dcabafd-9161-4355-9f2a-77435f73ea04" />
<br><br>
Equation (2) tells us that distance_m / f_px = H / h, so we can rearrange it:
X = H/h * \[ (u-cx) + depth_offset / distance_m * (u-cu)\] (13)
Finally, when taking the angles into consideration, the final formula becomes longer, although not more complex:
<table> <thead style="text-align: left;"> <tr> <th style="text-align: left;">Final formula:</th> </tr> </thead> <tbody> <tr> <td scope="row" style="text-align: left;">The floor line (origin and angle) is estimated from the line that fits foot ground contacts. <br>Final formula: X = H/h * \[ (14)<br> ( (u-cx) + depth_offset / distance_m * (u-cu) ) * cos(ang) + <br> ( (v-cy) + depth_offset / distance_m * (v-cv) ) * sin(ang) <br> \] <br> Y = - H/h * \[<br> ( (v-cy) + depth_offset / distance_m * (v-cv) ) * cos(ang) -<br> ( (u-cx) + depth_offset / distance_m * (u-cu) ) * sin(ang) <br> \] </td> </tr> </table>
With:
- X: the horizontal coordinate of a point in meters
- Y: the vertical coordinate of a point in meters, pointing upwards
- u: the horizontal image of a point coordinate in pixels
- v: the vertical image of a point coordinate in pixels, pointing downwards
- cx: the image coordinate of the left border, ie 0 px
- cy: the intersection between the left border and the floor line in pixels
- cu: the horizontal center of the image in pixels, ie width/2
- cv: the vertical center of the image in pixels, ie height/2
- H: the height of the athlete in meters
- h: the height of the athlete on the camera sensor in pixels
- depth_offset: the offset of the joint with regards to the body midline in meters
- ang: the camera horizon angle
- distance_m: the distance between the camera origin and the athlete in meters
- f_px: the focal length (the distance between the camera origin and the sensor), converted from mm to pixels
- Y: The coordinate of a point in the scene in meters
- y: The coordinate of a point on the camera sensor in pixels
<img width="839" height="827" alt="image" src="https://github.com/user-attachments/assets/40789d36-9926-43cd-83da-42d6341139bb" />
<br><br>
Note: I used canonical values for the joint-to-body midline distances, which range from about 10 cm (knees) to 20 cm (shoulders). This is a gross estimation since depths obviously evolve during the sprint, but we do 2D pose estimation, so that's the best we can have.
Corollary: If the camera is at a 10 m distance, the depth_offset/distance_m ratio is 1 to 2%. The (c-cv) factor ensures that the offset is nullified at the center of the image.
Calibration file loading and generation
Regardless of the user's choice, a calibration file is generated:
- The intrinsic matrix is written as:
[f, 0, cu ] [0, f, cv ] [0, 0, 1 ] - The rotation Rodrigues vector is calculated as:
[cos(ang), 0, sin(ang) ] [0, 1, 0 ] [-sin(ang), 0, cos(ang)] - The translation vector is calculated as:
[-H/h*(cx-cu), H/h*f_px, H/h*(cy-cv)] - The distortions are assumed to be nonexistent
Note that the actual formulas for rotation and translation as a bit more complex, as their vectors need to be converted back and forth from the world to the camera's point of view.
Incidentally, producing a calibration allows for a better visualization with an overlay of the OpenSim skeleton.
<img width="851" height="229" alt="image" src="https://github.com/user-attachments/assets/3a352b0d-973e-48ed-a8ec-182e3abfaaff" />
Notes
Files
davidpagnon/Sports2D-v0.8.25.zip
Files
(9.9 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:54e4493c1cda62c8a71afefce713a327
|
9.9 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/davidpagnon/Sports2D/tree/v0.8.25 (URL)
Software
- Repository URL
- https://github.com/davidpagnon/Sports2D