Dataset Open Access
Haque, Albert;
Peng, Boya;
Luo, Zelun;
Alahi, Alexandre;
Yeung, Serena;
Fei-Fei, Li
<?xml version='1.0' encoding='utf-8'?> <resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd"> <identifier identifierType="DOI">10.5281/zenodo.3932973</identifier> <creators> <creator> <creatorName>Haque, Albert</creatorName> <givenName>Albert</givenName> <familyName>Haque</familyName> <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0001-6769-6370</nameIdentifier> <affiliation>Stanford University</affiliation> </creator> <creator> <creatorName>Peng, Boya</creatorName> <givenName>Boya</givenName> <familyName>Peng</familyName> <affiliation>Stanford University</affiliation> </creator> <creator> <creatorName>Luo, Zelun</creatorName> <givenName>Zelun</givenName> <familyName>Luo</familyName> <affiliation>Stanford University</affiliation> </creator> <creator> <creatorName>Alahi, Alexandre</creatorName> <givenName>Alexandre</givenName> <familyName>Alahi</familyName> <affiliation>Stanford University</affiliation> </creator> <creator> <creatorName>Yeung, Serena</creatorName> <givenName>Serena</givenName> <familyName>Yeung</familyName> <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0003-0529-0628</nameIdentifier> <affiliation>Stanford University</affiliation> </creator> <creator> <creatorName>Fei-Fei, Li</creatorName> <givenName>Li</givenName> <familyName>Fei-Fei</familyName> <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-7481-0810</nameIdentifier> <affiliation>Stanford University</affiliation> </creator> </creators> <titles> <title>ITOP Dataset</title> </titles> <publisher>Zenodo</publisher> <publicationYear>2016</publicationYear> <subjects> <subject>depth sensor</subject> <subject>human pose estimation</subject> <subject>computer vision</subject> <subject>3D vision</subject> </subjects> <dates> <date dateType="Issued">2016-10-08</date> </dates> <language>en</language> <resourceType resourceTypeGeneral="Dataset"/> <alternateIdentifiers> <alternateIdentifier alternateIdentifierType="url">https://zenodo.org/record/3932973</alternateIdentifier> </alternateIdentifiers> <relatedIdentifiers> <relatedIdentifier relatedIdentifierType="arXiv" relationType="Cites" resourceTypeGeneral="Text">arXiv:1603.07076</relatedIdentifier> <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.3932972</relatedIdentifier> </relatedIdentifiers> <version>1.0</version> <rightsList> <rights rightsURI="https://creativecommons.org/licenses/by/4.0/legalcode">Creative Commons Attribution 4.0 International</rights> <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights> </rightsList> <descriptions> <description descriptionType="Abstract"><p><strong>Summary</strong></p> <p>The ITOP dataset (Invariant Top View) contains 100K depth images from side and top views of a person in a scene. For each image, the location of 15 human body parts are labeled with 3-dimensional (x,y,z) coordinates, relative to the sensor&#39;s position. Read the full paper for more context [<a href="https://arxiv.org/pdf/1603.07076.pdf">pdf</a>].</p> <p><strong>Getting Started</strong></p> <p>Download then decompress the h5.gz file.</p> <pre><code class="language-bash">gunzip ITOP_side_test_depth_map.h5.gz</code></pre> <p>Using Python and <a href="https://www.h5py.org/">h5py</a> (<em>pip install h5py</em> or <em>conda install h5py</em>), we can load the contents:</p> <pre><code class="language-python">import h5py import numpy as np f = h5py.File('ITOP_side_test_depth_map.h5', 'r') data, ids = f.get('data'), f.get('id') data, ids = np.asarray(data), np.asarray(ids) print(data.shape, ids.shape) # (10501, 240, 320) (10501,)</code></pre> <p><strong>Note:</strong> For any of the <em>*_images.h5.gz</em> files, the underlying file is a tar file and not a h5 file. Please rename the file extension from <em>h5.gz</em> to <em>tar.gz</em> before opening. The following commands will work:</p> <pre><code class="language-bash">mv ITOP_side_test_images.h5.gz ITOP_side_test_images.tar.gz tar xf ITOP_side_test_images.tar.gz</code></pre> <p><strong>Metadata</strong></p> <p>File sizes for images, depth maps, point clouds, and labels refer to the uncompressed size.</p> <pre><code>+-------+--------+---------+---------+----------+------------+--------------+---------+ | View | Split | Frames | People | Images | Depth Map | Point Cloud | Labels | +-------+--------+---------+---------+----------+------------+--------------+---------+ | Side | Train | 39,795 | 16 | 1.1 GiB | 5.7 GiB | 18 GiB | 2.9 GiB | | Side | Test | 10,501 | 4 | 276 MiB | 1.6 GiB | 4.6 GiB | 771 MiB | | Top | Train | 39,795 | 16 | 974 MiB | 5.7 GiB | 18 GiB | 2.9 GiB | | Top | Test | 10,501 | 4 | 261 MiB | 1.6 GiB | 4.6 GiB | 771 MiB | +-------+--------+---------+---------+----------+------------+--------------+---------+</code></pre> <p><strong>Data Schema</strong></p> <p>Each file contains several HDF5 datasets at the root level. Dimensions, attributes, and data types are listed below. The key refers to the (HDF5) dataset name. Let <span class="math-tex">\(n\)</span> denote the number of images.<br> <br> <strong>Transformation</strong></p> <p>To convert from point clouds to a&nbsp;<span class="math-tex">\(240 \times 320\)</span> image, the following transformations were used. Let&nbsp;<span class="math-tex">\(x_{\textrm{img}}\)</span> and&nbsp;<span class="math-tex">\(y_{\textrm{img}}\)</span> denote the&nbsp;<span class="math-tex">\((x,y)\)</span> coordinate in the image plane. Using the raw point cloud&nbsp;<span class="math-tex">\((x,y,z)\)</span> real world coordinates, we compute the depth map as follows:&nbsp;<span class="math-tex">\(x_{\textrm{img}} = \frac{x}{Cz} + 160\)</span> and&nbsp;<span class="math-tex">\(y_{\textrm{img}} = -\frac{y}{Cz} + 120\)</span> where <span class="math-tex">\(C\approx 3.50×10^{−3} = 0.0035\)</span> is the intrinsic camera calibration parameter. This results in the depth map:&nbsp;<span class="math-tex">\((x_{\textrm{img}}, y_{\textrm{img}}, z)\)</span>.</p> <p><strong>Joint ID (Index) Mapping</strong></p> <pre><code>joint_id_to_name = { 0: 'Head', 8: 'Torso', 1: 'Neck', 9: 'R Hip', 2: 'R Shoulder', 10: 'L Hip', 3: 'L Shoulder', 11: 'R Knee', 4: 'R Elbow', 12: 'L Knee', 5: 'L Elbow', 13: 'R Foot', 6: 'R Hand', 14: 'L Foot', 7: 'L Hand', }</code></pre> <p><strong>Depth Maps</strong></p> <ul> <li><em>Key:</em> id <ul> <li><em>Dimensions:</em> <span class="math-tex">\((n,)\)</span></li> <li><em>Data Type:</em> uint8</li> <li><em>Description:</em> Frame identifier in the form XX_YYYYY where XX is the person&#39;s ID number and YYYYY is the frame number.</li> </ul> </li> <li><em>Key: </em>data <ul> <li><em>Dimensions: </em><span class="math-tex">\((n,240,320)\)</span></li> <li><em>Data Type:</em> float16</li> <li><em>Description:</em> Depth map (i.e. mesh) corresponding to a single frame. Depth values are in real world meters (m).</li> </ul> </li> </ul> <p><strong>Point Clouds</strong></p> <ul> <li><em>Key:</em> id <ul> <li><em>Dimensions:</em> <span class="math-tex">\((n,)\)</span></li> <li><em>Data Type:</em> uint8</li> <li><em>Description:</em> Frame identifier in the form XX_YYYYY where XX is the person&#39;s ID number and YYYYY is the frame number.</li> </ul> </li> <li><em>Key: </em>data <ul> <li><em>Dimensions: </em><span class="math-tex">\((n,76800,3)\)</span></li> <li><em>Data Type: float16</em></li> <li><em>Description:</em> Point cloud containing 76,800 points (240x320). Each point is represented by a 3D tuple measured in real world meters (m).</li> </ul> </li> </ul> <p><strong>Labels</strong></p> <ul> <li><em>Key: </em>id <ul> <li><em>Dimensions: </em><span class="math-tex">\((n,)\)</span></li> <li><em>Data Type: </em>uint8</li> <li><em>Description:</em> Frame identifier in the form XX_YYYYY where XX is the person&#39;s ID number and YYYYY is the frame number.</li> </ul> </li> <li><em>Key: </em>is_valid <ul> <li><em>Dimensions: </em><span class="math-tex">\((n,)\)</span></li> <li><em>Data Type: </em>uint8</li> <li><em>Description:</em> Flag corresponding to the result of the human labeling effort. This is a boolean value (represented by an integer) where a one (1) denotes clean, human-approved data. A zero (0) denotes noisy human body part labels. If is_valid is equal to zero, you should not use any of the provided human joint locations for the particular frame.</li> </ul> </li> <li><em>Key: </em>visible_joints <ul> <li><em>Dimensions: </em><span class="math-tex">\((n,15)\)</span></li> <li><em>Data Type: </em>int16</li> <li><em>Description:</em> Binary mask indicating if each human joint is visible or occluded. This is denoted by&nbsp;<span class="math-tex">\(\alpha\)</span> in the paper. If&nbsp;<span class="math-tex">\(\alpha_j=1\)</span> then the&nbsp;<span class="math-tex">\(j^{th}\)</span> joint is visible (i.e. not occluded). Otherwise, if&nbsp;<span class="math-tex">\(\alpha_j = 0\)</span> then the <span class="math-tex">\(j^{th}\)</span> joint is occluded.</li> </ul> </li> <li><em>Key: </em>image_coordinates <ul> <li><em>Dimensions: </em><span class="math-tex">\((n,15,2)\)</span></li> <li><em>Data Type: </em>int16</li> <li><em>Description:</em> Two-dimensional&nbsp;<span class="math-tex">\((x,y)\)</span> points corresponding to the location of each joint in the depth image or depth map.</li> </ul> </li> <li><em>Key: </em>real_world_coordinates <ul> <li><em>Dimensions: </em><span class="math-tex">\((n,15,3)\)</span></li> <li><em>Data Type: </em>float16</li> <li><em>Description:</em> Three-dimensional&nbsp;<span class="math-tex">\((x,y,z)\)</span> points corresponding to the location of each joint in real world meters (m).</li> </ul> </li> <li><em>Key: </em>segmentation <ul> <li><em>Dimensions: </em><span class="math-tex">\((n,240,320)\)</span></li> <li><em>Data Type: </em><em>int8</em></li> <li><em>Description:</em> Pixel-wise assignment of body part labels. The background class (i.e. no body part) is denoted by &minus;1.</li> </ul> </li> </ul> <p><strong>Citation</strong></p> <p>If you would like to cite our work, please use the following.</p> <p><strong>Haque A, Peng B, Luo Z, Alahi A, Yeung S, Fei-Fei L. (2016). Towards Viewpoint Invariant 3D Human Pose Estimation. European Conference on Computer Vision. Amsterdam, Netherlands. Springer.</strong></p> <pre>@inproceedings{haque2016viewpoint, title={Towards Viewpoint Invariant 3D Human Pose Estimation}, author={Haque, Albert and Peng, Boya and Luo, Zelun and Alahi, Alexandre and Yeung, Serena and Fei-Fei, Li}, booktitle = {European Conference on Computer Vision}, month = {October}, year = {2016} }</pre> <ul> </ul></description> </descriptions> </resource>
All versions | This version | |
---|---|---|
Views | 1,592 | 1,592 |
Downloads | 6,614 | 6,614 |
Data volume | 18.8 TB | 18.8 TB |
Unique views | 1,320 | 1,320 |
Unique downloads | 1,012 | 1,012 |