00000nmm##2200000uu#4500 3932973 doi 10.5281/zenodo.3932973 oai:zenodo.org:3932973 Peng, Boya Stanford University Luo, Zelun Stanford University Alahi, Alexandre Stanford University Yeung, Serena (orcid)0000-0003-0529-0628 Stanford University Fei-Fei, Li (orcid)0000-0002-7481-0810 Stanford University ITOP Dataset Haque, Albert (orcid)0000-0001-6769-6370 Stanford University arxiv:arXiv:1603.07076 info:eu-repo/semantics/openAccess Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 spdx depth sensor human pose estimation computer vision 3D vision Summary The ITOP dataset (Invariant Top View) contains 100K depth images from side and top views of a person in a scene. For each image, the location of 15 human body parts are labeled with 3-dimensional (x,y,z) coordinates, relative to the sensor's position. Read the full paper for more context [<a href="https://arxiv.org/pdf/1603.07076.pdf">pdf</a>]. Getting Started Download then decompress the h5.gz file. <pre><code class="language-bash">gunzip ITOP_side_test_depth_map.h5.gz</code></pre> Using Python and <a href="https://www.h5py.org/">h5py</a> (pip install h5py or conda install h5py), we can load the contents: <pre><code class="language-python">import h5py import numpy as np f = h5py.File('ITOP_side_test_depth_map.h5', 'r') data, ids = f.get('data'), f.get('id') data, ids = np.asarray(data), np.asarray(ids) print(data.shape, ids.shape) # (10501, 240, 320) (10501,)</code></pre> Note: For any of the *_images.h5.gz files, the underlying file is a tar file and not a h5 file. Please rename the file extension from h5.gz to tar.gz before opening. The following commands will work: <pre><code class="language-bash">mv ITOP_side_test_images.h5.gz ITOP_side_test_images.tar.gz tar xf ITOP_side_test_images.tar.gz</code></pre> Metadata File sizes for images, depth maps, point clouds, and labels refer to the uncompressed size. <pre><code>+-------+--------+---------+---------+----------+------------+--------------+---------+ | View | Split | Frames | People | Images | Depth Map | Point Cloud | Labels | +-------+--------+---------+---------+----------+------------+--------------+---------+ | Side | Train | 39,795 | 16 | 1.1 GiB | 5.7 GiB | 18 GiB | 2.9 GiB | | Side | Test | 10,501 | 4 | 276 MiB | 1.6 GiB | 4.6 GiB | 771 MiB | | Top | Train | 39,795 | 16 | 974 MiB | 5.7 GiB | 18 GiB | 2.9 GiB | | Top | Test | 10,501 | 4 | 261 MiB | 1.6 GiB | 4.6 GiB | 771 MiB | +-------+--------+---------+---------+----------+------------+--------------+---------+</code></pre> Data Schema Each file contains several HDF5 datasets at the root level. Dimensions, attributes, and data types are listed below. The key refers to the (HDF5) dataset name. Let \(n\) denote the number of images. Transformation To convert from point clouds to a \(240 \times 320\) image, the following transformations were used. Let \(x_{\textrm{img}}\) and \(y_{\textrm{img}}\) denote the \((x,y)\) coordinate in the image plane. Using the raw point cloud \((x,y,z)\) real world coordinates, we compute the depth map as follows: \(x_{\textrm{img}} = \frac{x}{Cz} + 160\) and \(y_{\textrm{img}} = -\frac{y}{Cz} + 120\) where \(C\approx 3.50×10^{−3} = 0.0035\) is the intrinsic camera calibration parameter. This results in the depth map: \((x_{\textrm{img}}, y_{\textrm{img}}, z)\). Joint ID (Index) Mapping <pre><code>joint_id_to_name = { 0: 'Head', 8: 'Torso', 1: 'Neck', 9: 'R Hip', 2: 'R Shoulder', 10: 'L Hip', 3: 'L Shoulder', 11: 'R Knee', 4: 'R Elbow', 12: 'L Knee', 5: 'L Elbow', 13: 'R Foot', 6: 'R Hand', 14: 'L Foot', 7: 'L Hand', }</code></pre> Depth Maps <ul> <li>Key: id <ul> <li>Dimensions: \((n,)\)</li> <li>Data Type: uint8</li> <li>Description: Frame identifier in the form XX_YYYYY where XX is the person's ID number and YYYYY is the frame number.</li> </ul> </li> <li>Key: data <ul> <li>Dimensions: \((n,240,320)\)</li> <li>Data Type: float16</li> <li>Description: Depth map (i.e. mesh) corresponding to a single frame. Depth values are in real world meters (m).</li> </ul> </li> </ul> Point Clouds <ul> <li>Key: id <ul> <li>Dimensions: \((n,)\)</li> <li>Data Type: uint8</li> <li>Description: Frame identifier in the form XX_YYYYY where XX is the person's ID number and YYYYY is the frame number.</li> </ul> </li> <li>Key: data <ul> <li>Dimensions: \((n,76800,3)\)</li> <li>Data Type: float16</li> <li>Description: Point cloud containing 76,800 points (240x320). Each point is represented by a 3D tuple measured in real world meters (m).</li> </ul> </li> </ul> Labels <ul> <li>Key: id <ul> <li>Dimensions: \((n,)\)</li> <li>Data Type: uint8</li> <li>Description: Frame identifier in the form XX_YYYYY where XX is the person's ID number and YYYYY is the frame number.</li> </ul> </li> <li>Key: is_valid <ul> <li>Dimensions: \((n,)\)</li> <li>Data Type: uint8</li> <li>Description: Flag corresponding to the result of the human labeling effort. This is a boolean value (represented by an integer) where a one (1) denotes clean, human-approved data. A zero (0) denotes noisy human body part labels. If is_valid is equal to zero, you should not use any of the provided human joint locations for the particular frame.</li> </ul> </li> <li>Key: visible_joints <ul> <li>Dimensions: \((n,15)\)</li> <li>Data Type: int16</li> <li>Description: Binary mask indicating if each human joint is visible or occluded. This is denoted by \(\alpha\) in the paper. If \(\alpha_j=1\) then the \(j^{th}\) joint is visible (i.e. not occluded). Otherwise, if \(\alpha_j = 0\) then the \(j^{th}\) joint is occluded.</li> </ul> </li> <li>Key: image_coordinates <ul> <li>Dimensions: \((n,15,2)\)</li> <li>Data Type: int16</li> <li>Description: Two-dimensional \((x,y)\) points corresponding to the location of each joint in the depth image or depth map.</li> </ul> </li> <li>Key: real_world_coordinates <ul> <li>Dimensions: \((n,15,3)\)</li> <li>Data Type: float16</li> <li>Description: Three-dimensional \((x,y,z)\) points corresponding to the location of each joint in real world meters (m).</li> </ul> </li> <li>Key: segmentation <ul> <li>Dimensions: \((n,240,320)\)</li> <li>Data Type: int8</li> <li>Description: Pixel-wise assignment of body part labels. The background class (i.e. no body part) is denoted by −1.</li> </ul> </li> </ul> Citation If you would like to cite our work, please use the following. Haque A, Peng B, Luo Z, Alahi A, Yeung S, Fei-Fei L. (2016). Towards Viewpoint Invariant 3D Human Pose Estimation. European Conference on Computer Vision. Amsterdam, Netherlands. Springer. <pre>@inproceedings{haque2016viewpoint, title={Towards Viewpoint Invariant 3D Human Pose Estimation}, author={Haque, Albert and Peng, Boya and Luo, Zelun and Alahi, Alexandre and Yeung, Serena and Fei-Fei, Li}, booktitle = {European Conference on Computer Vision}, month = {October}, year = {2016} }</pre> <ul> </ul> eng Zenodo 2016-10-08 info:eu-repo/semantics/other 20200731062425.0 3699135 md5:7205b0ba47f76892742ded774754d7a1 https://zenodo.org/records/3932973/files/ITOP_side_test_labels.h5.gz 2061701631 md5:3f5227d6f260011b19f325fffde08a65 https://zenodo.org/records/3932973/files/ITOP_side_test_point_cloud.h5.gz 257980348 md5:1803c50e44746dca7ccf03c2d46c466e https://zenodo.org/records/3932973/files/ITOP_side_test_images.h5.gz 245104261 md5:65f431c9f7540db6118d99bc9bae7576 https://zenodo.org/records/3932973/files/ITOP_side_test_depth_map.h5.gz 926228035 md5:80736f716b0e83f7cc73ec85bb13effc https://zenodo.org/records/3932973/files/ITOP_side_train_depth_map.h5.gz 1010377751 md5:e325ed23ed962f86594b70f17c048a30 https://zenodo.org/records/3932973/files/ITOP_side_train_images.h5.gz 16833112 md5:e62a67678d5cddc13e07cfdd1eb0a176 https://zenodo.org/records/3932973/files/ITOP_side_train_labels.h5.gz 7840345186 md5:6ca457e8471e7514222624e937e11a9c https://zenodo.org/records/3932973/files/ITOP_side_train_point_cloud.h5.gz 245493889 md5:d8ad31ecbbcd13ee5e1f02874c0cb3d0 https://zenodo.org/records/3932973/files/ITOP_top_test_depth_map.h5.gz 246678932 md5:21f702e3ce0e5602340957e6cae6148a https://zenodo.org/records/3932973/files/ITOP_top_test_images.h5.gz 17461 md5:5d6c045333e9f520c24d335f57e0422e https://zenodo.org/records/3932973/files/sample_top_labeled.jpg 18689 md5:0afbd5971faee803d14969e4c2a71267 https://zenodo.org/records/3932973/files/sample_top.jpg 22911 md5:25aaef40a70ad75f452438824a2bb71f https://zenodo.org/records/3932973/files/sample_front_labeled.jpg 9280299 md5:6a9c5d7845dc7fdf6d168ee4dd356afd https://zenodo.org/records/3932973/files/ITOP_top_test_labels.h5.gz 20450 md5:86d7be54b61841fe22b27949fffc042d https://zenodo.org/records/3932973/files/sample_front.jpg 7620649272 md5:f5fd64240296be0bfff5318beca19884 https://zenodo.org/records/3932973/files/ITOP_top_train_point_cloud.h5.gz 32165804 md5:95776e7beeb9a769bef25eb336afb5bd https://zenodo.org/records/3932973/files/ITOP_top_train_labels.h5.gz 923855225 md5:6e2daf5be0f0bf6eddf611913e718417 https://zenodo.org/records/3932973/files/ITOP_top_train_images.h5.gz 2020245383 md5:3ac977488864e27ac13e8cf17d03f8c7 https://zenodo.org/records/3932973/files/ITOP_top_test_point_cloud.h5.gz 917859800 md5:159a8694f653f5b639252de84469f7b9 https://zenodo.org/records/3932973/files/ITOP_top_train_depth_map.h5.gz open arXiv:1603.07076 Cites arxiv 10.5281/zenodo.3932972 isVersionOf doi