Published May 28, 2025
| Version V1
Dataset
Open
EPFL-Smart-Kitchen-30 Collected data
Authors/Creators
-
Bonnetto, Andy
(Data collector)1
-
Qi, Haozhe
(Data collector)1
-
Leong, Franklin
(Data collector)1
-
Tashkovska, Matea
(Project member)
-
Rad, Mahdi
(Project member)2
-
Shokur, Solaiman
(Project member)3
-
Hummel, Friedhelm
(Project manager)1
-
Micera, Silvestro
(Project manager)4, 1
-
Pollefeys, Marc
(Project manager)5, 6
-
Mathis, Alexander
(Project leader)1
Description
# The EPFL-Smart-Kitchen-30
Ecole Polytechnique de Lausanne (EPFL)
> ⚠️ 3D pose and action annotations can be found at https://zenodo.org/records/15551913
Understanding behavior requires datasets that capture humans while carrying out complex tasks. The kitchen is an excellent environment for assessing human motor and cognitive function, as many complex actions are naturally exhibited in kitchens from chopping to cleaning. Here, we introduce the EPFL-Smart-Kitchen-30 dataset, collected in a noninvasive motion capture platform inside a kitchen environment. Nine static RGB-D cameras, inertial measurement units (IMUs) and one head-mounted HoloLens~2 headset were used to capture 3D hand, body, and eye movements. The EPFL-Smart-Kitchen-30 dataset is a multi-view action dataset with synchronized exocentric, egocentric, depth, IMUs, eye gaze, body and hand kinematics spanning 29.7 hours of 16 subjects cooking four different recipes. Action sequences were densely annotated with 33.78 action segments per minute. Leveraging this multi-modal dataset, we propose four benchmarks to advance behavior understanding and modeling through
1) a vision-language benchmark,
2) a semantic text-to-motion generation benchmark,
3) a multi-modal action recognition benchmark,
4) a pose-based action segmentation benchmark.
## General informations
* **Authors**: Andy Bonnetto 1, Haozhe Qi 1, Franklin Leong 1, Matea Tashkovska 1, Mahdi Rad 3, Solaiman Shokur 1,3, Friedhelm Hummel 1,4,5, Silvestro Micera 1,3, Marc Pollefeys 2,6, Alexander Mathis 1
* **Affiliation**: 1 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, 2 Microsoft, 3 Scuola Superiore Sant’Anna, Pisa, 4 Swiss Federal Institute of Technology Valais (EPFL Valais), Clinique Romande de Réadaptation, Sion, 5 University of Geneva Medical School, Geneva, 6 Eidgenössische Technische Hochschule (ETH), Zürich
* **Date of collection**: 05.2023 - 01.2024 (MM.YYYY - MM.YYYY)
* **Geolocation data**: Campus Biotech, Genève, Switzerland
* **Associated publication URL**: https://arxiv.org/abs/2506.01608
* **Funding**: Our work was funded by EPFL and Microsoft Swiss Joint Research Center and a Boehringer Ingelheim Fonds PhD stipend (H.Q.). We are grateful to the Brain Mind Institute for providing funds for the cameras and to the Neuro-X Institute for providing funds to annotate data.
## Dataset availability
* **License**: This dataset is released under the non-commercial [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/legalcode) license.
* **Citation**: Please cite the associated publication when using our data.
* **Repository URL**: https://github.com/amathislab/EPFL-Smart-Kitchen
* **Repository DOI**: 10.5281/zenodo.15535461
* **Dataset version**: v1
## Data and files overview
* **Data preparation**: unzip `Public_release.zip`
* **Repository structure**:
```
Public_release_videos
├── train
| ├── YH2002 (participant)
| | ├── 2023_12_04_10_15_23 (session)
| | | ├── IMUs
| | | | ├── 2023-12-4 12h10.csv
| | | | ├── knife.csv
| | | | └── spatula.csv
| | | ├── meta_data
| | | | ├── camera_matrix.json
| | | | ├── holo_data_wpose.csv
| | | | └── timestamps.txt
| | | ├── videos
| | | | ├── hololens.mp4
| | | | ├── output0.mp4
| | | | └── ...
| | | ├── videos_depth
| | | | └── output0.txt
| | | └── timestamps_depth
| | | ├── output0.txt
| | | └── ...
| | └── ...
| └── ...
└── test
└── ...
lemonade_benchmark.csv
```
* `train` and `test`: Contains the train and test data for the action recognition task, the actions segmentation task and the full-body motion generation task. These folders are structured in participants and sessions. Each session contains 5 modalities:
* **IMUs**: IMU data, currently not synchronized with the videos.
* **meta_data**: Calibration information and timestamps of the hololens recordings. The file `holo_data_wpose.csv` stores hand pose estimation from the HoloLens and eye gaze measurements.
* **videos**: RGB videos recorded from each exocentric camera (output0, ...) and the egocentric view (hololens).
* **videos_depth**: Depth videos recorded from each exocentric view.
* **timestamps_depth**: Timestamps of the depth recording for each exocentric view.
> We refer the reader to the associated publication for details about data processing and tasks description.
### Naming conventions
* Exocentric camera names are the following : output0 , Aoutput0, Aoutput1, Aoutput2, Aoutput3, Boutput0, Boutput1, Boutput2, Boutput3.
* Participant are identified with YH and a random identifier, sessions are given by the date and time of recording.
### File characteristics
* imu_sensors (`DATE.csv`) : IMU recordings are stored in csv files with the following fields:
* Date: Precise sampling time
* sensor0X : accelerometer value in the X axis for the first sensor
* sensor0Y : accelerometer value in the Y axis for the first sensor
* sensor0Z : accelerometer value in the Z axis for the first sensor
* ...
* sensor5Z : accelerometer value in the Z axis for the sixth sensor
* `knife.csv` and `spatula.csv` : IMU recordings for the knife and spatula accelerometers have the same fields as previously:
* Date: Precise samplign time
* AccX : accelerometer value in the X axis for the first sensor
* AccY : accelerometer value in the Y axis for the first sensor
* AccZ : accelerometer value in the Z axis for the first sensor
* `camera_matrix.json`: Contains calibration information of the cameras, it has the following fields:
* hololens:
* K : Intrinsic parameters of the camera
* dist: Distorsion coefficients
* output0:
* word2cam : Transformation matrix (4x4) from world coordinate to camera coordinates
* K : Intrisic parameters of the RGB camera
* dist: Distorsion coefficients of the RGB camera
* world2depth : Transformation matrix from world coordinate to the depth camera coordinates
* depth_K : Intrinsic parameters of the depth camera
* depth_dist: Distorsion coefficients of the depth camera
* ...
* `holo_data_wpose.csv`: Files with HoloLens recordings in the following fields:
* world2holo : Transformation matrix (4x4) from world coordinate to HoloLens coordinates at each timestep
* eyes: Eye gaze information as a 7 value vector as the concatenation of the cartesian coordinate of the origin point, the direction vector and 1.
* holorights/hololefts : 3D hand pose estimation of the HoloLens~2 (26 keypoints) with confidence.
* All videos are in .mp4 format
> We refer the reader to the associated publication for details about data processing and tasks description.
## Methodological informations
**Benchmark evaluation code**: Will be available soon
> We refer the reader to the associated publication for details about data processing and tasks description.
## Acknowledgements
Our work was funded by EPFL and Microsoft Swiss Joint Research Center and a Boehringer Ingelheim Fonds PhD stipend (H.Q.). We are grateful to the Brain Mind Institute at EPFL for providing funds for the cameras and to the Neuro-X Institute at EPFL for providing funds to annotate data.
## Change log (DD.MM.YYYY)
[03.06.2025]: First data release !
Files
lemonade_benchmark.csv
Additional details
Related works
- Is continued by
- Dataset: 10.5281/zenodo.15551913 (DOI)
- Is published in
- Preprint: arXiv:2506.01608 (arXiv)
Funding
- Swiss National Science Foundation
- Joint behavior and neural data modeling for naturalistic behavior 10000950
Software
- Development Status
- Active