Published April 18, 2024 | Version v1
Other Open

AMOS-MM: Abdominal Multimodal Analysis Challenge: Structured description of the challenge design

  • 1. The University of Hongkong
  • 2. Stanford University

Description

Medical image analysis technology has played a significant role in enhancing the diagnostic and therapeutic capabilities of healthcare professionals, particularly in the area of abdominal CT image analysis. AI-driven localization and detection algorithms have significantly improved the efficiency of medical workflows and the accuracy of diagnoses. Clinicians can accurately analyse key indicators in images, such as the location and size of tumours, enabling them to develop appropriate treatment plans for patients and accurately assess disease prognosis.


Current models, while showing significant progress, are still limited to modes that only process image data. These methods do not effectively integrate other important clinical information, such as textual records and patient history, which limits a comprehensive understanding of cases. Large-scale Language Vision Models (LLVMs) have
proven to be effective in handling tasks beyond those for which it was specifically designed. However, the lack of flexibility and adaptability is no longer a limitation due to the rapid development of large-scale language vision models (LLVMs) [0,1] in recent years. These models creatively address a variety of real-world problems by demonstrating zero-shot generalization capabilities through the use of natural language queries. While recent LLVMs have made significant progress in processing complex, open-ended visual queries, they still fall short when it comes to understanding and interacting with biomedical images. This is partly due to a lack of high-quality datasets and a lack of adapting for medical fields. The AMOS-MM: Abdominal Multimodal Analysis Challenge was launched to address the specific challenge of creating such a system, a unified knowledge framework that bridges vision and language in the domain of abdominal CT scans. Our high-quality image-text dataset integrates natural language processing and computer vision technologies to enable deep understanding of medical images and natural language responses to diverse clinical queries.


The AMOS-MM Challenge is an extension of the AMOS22 Challenge (https://zenodo.org/records/6361922), which aimed to promote accurate segmentation of 15 different organs under multimodal conditions. Building on this, we have introduced two basic vision language tasks: medical report generation and medical visual question  answering (VQA). The medical report generation task requires participants to automatically extract key medical information from CT images and generate comprehensive (focusing on the impression section), yet easy-to-understand medical reports. The medical VQA task requires models to integrate external knowledge and answer clinical questions based on CT images.


Building on the foundation of AMOS22, we have expanded our dataset to 2300 scans (training/validation/testing), with each case accompanied by comprehensive clinical narratives in both English and Chinese, and tailored VQA (Visual Question Answering) queries. To the best of our knowledge, this is the first publicly available 3D CT
multimodal vision-text dataset and aims to push the boundaries of multimodal medical image analysis.

In summary, AMOS-MM offers the following remarkable features:

  • The first 3D CT multimodal image-text database: Containing 2300 cases, it supports voxel-based medical reporting and VQA tasks.
  • Innovative dual tasks:
    • Automated medical reporting: Provides 2300 image-text pairs for medical reports, bilingual English and Chinese versions; this task specifically targets the synthesis of the 'impression section' of medical reports, a crucial part that summarises key findings and diagnostic interpretations. The quality of the reports produced is assessed using BLEU [2], Rouge-L [3] and METEOR [4] scores, ensuring a comprehensive assessment of linguistic accuracy and coherence, which are essential for reliable medical documentation.
    • Medical visual question answering: Based on 2300 cases, closed-ended, multiple-choice questions are provided, divided into eight key skill areas, categorised as 'Imaging Perception' and 'Clinical Reasoning'. The Imaging Perception category includes Anatomical Identification, Abnormality Detection, Sign Recognition and Quantitative Assessment, focusing on the fundamental aspects of medical image analysis. The Clinical Reasoning category includes Disease Diagnosis, Etiological Analysis, Therapeutic Suggestion, and Risk Assessment, addressing higher-level clinical insight and decision making based on imaging. This comprehensive scope is designed to cover essential aspects of medical image analysis and clinical application, with the accuracy of responses as the primary metric for evaluation.


[0] OpenAI. (2023). ChatGPT (Mar 14 version) [Large language model]. https://chat.openai.com/chat
[1] Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning. arXiv preprint arXiv:2304.08485.
[2] Lavie, Alon, and Michael J. Denkowski. "The METEOR metric for automatic evaluation of machine translation." Machine translation 23 (2009): 105-115.
[3] Papineni, Kishore, et al. "Bleu: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002.
[4] Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." Text summarization branches out. 2004.

Files

AMOS-MM_ Abdominal Multimodal Analysis Challenge.pdf

Files (128.7 kB)