MUSK: A Vision-Language Foundation Model for Precision Oncology
Creators
Description
Abstract: Clinical decision-making is a process driven by multimodal data, including clinical notes and pathologic characteristics. Artificial intelligence approaches that can effectively integrate multimodal data hold significant promise to advance clinical care. However, the scarcity of well-annotated multimodal datasets in the clinical setting hinders the development of useful models. Here, we develop Multimodal transformer with Unified maSK modeling (MUSK), a vision-language foundation model designed to leverage large-scale, unlabeled, unpaired image-text data. MUSK is pre-trained on 50 million pathology images and 1 billion pathology-related text tokens using unified masked modeling. MUSK is further pre-trained on 1 million pathology image-text pairs to align vision and language features efficiently. After pretaining, MUSK is applied to a wide range of downstream tasks involving pathology images and/or text with minimal or no further fine-tuning. MUSK achieves superior performance across 21 patch-level and slide-level benchmarks, including image-to-text and text-to-image retrieval, visual question answering, and image classification. Importantly, MUSK shows promising performance in outcome prediction, including melanoma relapse prediction, pan-cancer prognosis prediction, and immunotherapy response prediction in lung and gastro-esophageal cancers. MUSK effectively combines complementary information from pathology images and clinical reports and can potentially improve diagnosis and precision cancer therapy.
Files
Additional details
Dates
- Submitted
-
2024-08-13