Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations

Messina, Nicola; Leonardi, Rosario; Ciampi, Luca; Carrara, Fabio; FARINELLA, Giovanni Maria; Falchi, Fabrizio; FURNARI, Antonino

doi:10.48550/arXiv.2509.26004

Published September 30, 2025 | Version v1

Preprint Open

Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations

1. Consiglio Nazionale delle Ricerche Area della Ricerca di Pisa
2. University of Catania
3. Istituto di Scienza e Tecnologie dell'Informazione "Alessandro Faedo"
4. Istituto di Scienza e Tecnologie dell'Informazione Alessandro Faedo Consiglio Nazionale delle Ricerche
5. Università degli Studi di Catania
6. National Research Council

Under review. Pre-print version.

Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels. In this paper, we propose to learn human-object interaction detection leveraging narrations -- natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects (e.g., "I am pouring vegetables from the chopping board to the pan"). Narrations provide a form of weak supervision that is cheap to acquire and readily available in state-of-the-art egocentric datasets. We introduce Narration-Supervised in-Hand Object Segmentation (NS-iHOS), a novel task where models have to learn to segment in-hand objects by learning from natural-language narrations. Narrations are then not employed at inference time. We showcase the potential of the task by proposing Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH), an end-to-end model distilling knowledge from narrations to learn plausible hand-object associations and enable in-hand object segmentation without using narrations at test time. We benchmark WISH against different baselines based on open-vocabulary object detectors and vision-language models, showing the superiority of its design. Experiments on EPIC-Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine-grained pixel-wise annotations.

Files

2509.26004v1.pdf

Files (1.8 MB)

Name	Size	Download all
2509.26004v1.pdf md5:1ed7dbee2b7ebd0185452318f96a044f	1.8 MB	Preview Download

Additional details

European Commission
SUN - Social and hUman ceNtered XR 101092612

Repository URL: https://github.com/fpv-iplab/WISH

	All versions	This version
Views	14	14
Downloads	13	13
Data volume	36.9 MB	36.9 MB

Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations

Files

2509.26004v1.pdf

Files (1.8 MB)

Additional details

Funding

Software

Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations

Creators

Description

Files

2509.26004v1.pdf

Files (1.8 MB)

Additional details

Funding

Software