CHOC-NOCS (trained machine learning model - parameters)
Description
The model is based on Normalised Object Coordinates Space (NOCS). This NOCS model was trained at the Centre for Intelligent Sensing (CIS), Queen Mary University of London, U.K., for the task of category-level 6D pose estimation on real hand-occluded containers (e.g., food boxes, drinking glasses and cups) and on a combination of data from the CORSMAL Hand-Occluded Containers dataset (CHOC), the COCO dataset (1,700 images from the categories 'person', 'cup', 'wine glass'), and the Open-Images-v6 dataset (1,700 images from the category 'box', excluding images with people).
Model date
v1.1.0: 25 January 2023
v1.0.0: 28 October 2022
Note: this is the date the model was trained.
Model type
Vision model based on a convolutional neural network with multiple branches to perform 2D detection and classification, 2D segmentation, and 3D normalised object coordinates (NOCS) predictions using an RGB image as input. The outputs of the model to estimate the object position and orientation in 3D (or 6D object pose) and the object size as a post-processing step. The 6D object pose can be recovered up to a scale factor with the Perspective-n-Points algorithm (e.g., EPnP) directly from the 3D normalised object coordinates and the corresponding pixels. The metric pose and size of the object can be recovered using priors, e.g., from the training set. If a corresponding depth image is available, the 6D object pose can be recovered in a metric scale with a similarity transformation between the predicted 3D normalised object coordinates and the corresponding 3D points in camera coordinate system (from the depth image) using Umeyama's algorithm.
Model version
v1.1.0
The version 1.1.0 was re-trained on the updated mixed-reality set where the annotated NOCS maps were corrected to consider the rotation towards the optical axis of the forearm (not applied for the NOCS case).
The version 1.0.0 of the model was used to in the pre-print A mixed-reality dataset for category-level 6D pose and size estimation of hand-occluded containers.
Training details
The model was trained with a batch size of 1 and with a strategy consisting of 3 stages, for a total of 300,000 iterations. The ResNet-101 backbone was initialised with pre-trained ImageNet weights. The remaining weights are randomly intialised. The first stage samples 130,000 images with a learning rate of 0.001, while fixing the ResNet-101 backbone and training only the heads (branches). The second stage sample 40,000 images with a learning rate of 0.0001, while training the ResNet-101 backbone from the block layer 4. The third stage sample 130,000 images with a learning rate of 0.00001, while training all the layers of the model. For each iteration, we sample with a 80% chance from the set of CHOC images, 13.33% chance from the set of COCO images, and 6.67% chance from the set of Open-Images-v6 images.
Paper or other resources
Enquiries
For enquiries, questions, or comments, please contact corsmal-challenge@qmul.ac.uk or a.xompero@qmul.ac.uk.
Citation
Plain:
X. Weber, A. Xompero, A. Cavallaro, "A mixed-reality dataset for category-level 6D pose and size estimation of hand-occluded containers", arxiv:2211.10470 [cs.CV], 2022.
Bibtex:
@misc{Weber2022ArXiv, author = {Weber, Xavier and Xompero, Alessio and Cavallaro, Andrea}, title = {A mixed-reality dataset for category-level 6D pose and size estimation of hand-occluded containers}, year = {2022}, eprint={2211.10470}, archivePrefix={arXiv}, primaryClass={cs.CV}, copyright = {Creative Commons Attribution Share Alike 4.0 International}, doi = {10.48550/ARXIV.2211.10470} }
License
Creative Commons Attribution 4.0 International
Intended use
The primary intended users of this model are academic researchers, scholars, and practitioners working in the fields of computer vision and robotics.
Primary intended use cases:
- 6D object pose estimation for robotic tasks and applications (grasping, manipulation, and human-to-robot handovers), augmented and mixed reality, and immersive online gaming.
- Baseline (RGB and/or RGB-D) 6D object pose estimation methods on data with either food boxes, cups or drinking glasses.
- Baseline method for integration into a robotic arm system, to predict the 6D poses of the objects in the camera view of the robot. This can be used as input to a robotic control system to grasp the object of interest.
Out-of-scope use cases:
- Any system critical application which requires a large degree of accuracy and/or real-time requirements.
Factors
The model was trained on the CHOC dataset, which includes human forearm and hands that have textures from the SURREAL dataset Note that these textures vary widely in skin tones. Backgrounds include both indoor and outdoor settings. Training images have been supplemented with images from COCO and Open-Images-v6.
Metrics
Mean Average Precision (mAP) was used to measure the performance of the model. Namely for 3D object detection, where mAP was measured when the Intersection over Union or Jaccard Index was over a certain threshold; and for 6D object pose estatimation where the mAP was measured when the rotation error and the translation error was less than a certain threshold x1 and x2 respectively.
Other possible metrics to evaluate this model can include recall, precision, accuracy, F1-score, etc.
Evaluation data
The model has been evaluated on two subsets of the CHOC dataset. The first is the test set of the mixed-reality CHOC data, containing 17,280 images. These images contain 6 objects not seen during training (note: only a single object per image). The second is the real set of the CHOC dataset, containing 3,951 selected images from the CORSMAL Containers Manipulation (CCM) dataset, containing 15 unique objects. Since the CCM dataset does not contain 6D pose annotations, the poses have been manually annotated.
Training data
The model was trained on the mixed-reality training dataset from the CHOC dataset, from COCO 2017 (1700 images randomly sampled from each of 'person', 'cup', 'wine glass') and Open-Images-v6 (1700 images randomly sampled from 'box' (those images which not contain humans)).
Quantitative Analyses
Results and quantitative analyses are provided in the accompanying pre-print.
Ethical Considerations
The person category is not used to extract sensitive data about the people, but simply for a more accurate segmentation under occlusions.
Caveats and Recommendations
- This NOCS model was trained to produce NOCS map of 28x28 patches, which is later upsampled via bilinear interpolation to the original image crops. However, due to the bilinear interpolation of these 3D points, the upsampled point-cloud might not resemble the estimated object shape anymore. Possible solution could be to follow the bilinear interpolation upsampling with a convolutional layer.
- Because the NOCS estimation is treated as a classification task (i.e. 32 bins per dimension), the estimated point-cloud is quantized and therefore not smooth.
- There are quite some false positive detections in different settings/backgrounds. Recommendation could be to use a newer object instance segmentation model such as Mask2Former (as opposed to Mask R-CNN).
- The model has been trained on only a subset of mixed-reality CHOC images (namely the training set). Recommendation could be to train on the whole CHOC set or do a cross-fold validation with shuffling of the CHOC synthetic containers.
How to use
The model can be run using the software and instructions provided in the accompanying repository CHOC_NOCS. A demo to try the model on RGB or RGB-D images is also provided.
Files
choc_nocs.zip
Files
(266.5 MB)
Name | Size | Download all |
---|---|---|
md5:61cf4e3ec589c2b5650ef0b0dfc08414
|
266.5 MB | Preview Download |
Additional details
Related works
- Is supplemented by
- Preprint: 10.48550/arXiv.2211.10470 (DOI)
- References
- Dataset: 10.5281/zenodo.5085800 (DOI)
Funding
- UK Research and Innovation
- CORSMAL: Collaborative object recognition, shared manipulation and learning EP/S031715/1