ODIN2026 Challenge: Multimodal Text Report Generation for Oral and Dental Image Analysis
Authors/Creators
- Bolelli, Federico1
- Ben-Hamadou, Achraf2
- Lumetti, Luca3
- Pujades Rocamora, Sergi4, 5, 6, 7, 8
- van Nistelrooij, Niels9
- Marchesini, Kevin3
- Cremonini, Francesca10
- Di Bartolomeo, Mattia
- Fix, Lucas11
- Morelli, Nicola3
- Rekik, Ahmed2
- Neifar, Nour2
- Abida, Ons2
- Smaoui, Oussama11
- Xi, Ton9
- Vinayahalingam, Shankeeth9
- Lombardo, Luca10
- Anesi, Alexandre3
- Grana, Costantino3
Description
Three-dimensional imaging is now routine in dentistry and maxillofacial surgery, where cone-beam computed tomography (CBCT) and intraoral scanning (IOS) support diagnosis, treatment planning, and follow-up across workflows such as implantology, tooth removal, and orthodontics. CBCT captures internal dental and craniofacial anatomy (e.g., bone quality/quantity and proximity to critical structures), while IOS provides highly accurate surface geometry of crowns and gingiva. However, despite the increasing availability of rich 3D (and complementary 2D) data, the production of structured, high-quality clinical reports remains largely manual—time-consuming for clinicians and prone to inter-observer variability—creating a clear bottleneck for scalable, consistent clinical decision support.
From a technical perspective, the community has made substantial progress on foundational subtasks such as 3D segmentation, landmarking, and multimodal registration, including approaches that begin to approach full automation in constrained settings. Yet, these advances have not translated into standardized, end-to-end benchmarks that directly evaluate systems transforming multimodal oral imaging into clinically meaningful textual descriptions. Compared with the growing body of work on 2D image-to-report generation, multimodal 3D-to-text reporting is still underexplored, largely due to the added complexity of learning and summarizing 3D spatial relationships across volumes, meshes, and photographs.
ODIN 2026 addresses this gap by introducing a dedicated benchmark for multimodality-to-text report generation, building on the ToothFairy and 3DTeethLand/Seg series. The challenge targets two clinically intertwined scenarios: (1) maxillofacial and surgical planning report generation from CBCT (ToothFairy4), and (2) orthodontic report generation from IOS meshes and intraoral photographs (Bite2Text). By design, both tasks leverage multi-center training data and a hidden test set from an independent center to rigorously assess robustness to domain shifts: an essential requirement for real-world deployment.
The envisioned impact is twofold. Biomedically, ODIN 2026 aims to accelerate and standardize documentation by enabling automatic draft reports that improve consistency across centers, reduce clinician workload, and support less experienced practitioners by systematically surfacing critical findings and risks. Technically, the challenge will catalyze research in multimodal 3D-to-text learning by benchmarking models that must jointly reason over high-dimensional volumes, surface meshes, 2D photographs, and clinical language, while producing template-constrained outputs suitable for clinical use. By releasing training resources and evaluation protocols and combining automatic text metrics with expert review, ODIN 2026 is designed to drive clinically grounded innovation at the intersection of dental imaging, multimodal learning, and language generation.
Files
293-ODIN2026_Challenges_-_Multimodal_Text_Report_Generation_for_2026-04-22T16-36-59.pdf
Files
(164.6 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:e9a80625577bc7574d9c523fed5a59d0
|
164.6 kB | Preview Download |