One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts

Ziheng Zhao^1,2

Yao Zhang^1,2

Chaoyi Wu^1,2

Xiaoman Zhang^1,2

Ya Zhang^1,2

Yanfeng Wang^1,2

Weidi Xie^1,2

¹CMIC, Shanghai Jiao Tong University

²Shanghai AI Laboratory

Code [GitHub]

Data [Data]

Paper [arXiv]

Cite [BibTeX]

Abstract

In this study, we aim to build up a model that can Segment Anything in radiology scans, driven by Text prompts, termed as SAT. Our main contributions are three folds: (i) for dataset construction, we construct the first multi-modal knowledge tree on human anatomy, including 6502 anatomical terminologies; Then we build up the largest and most comprehensive segmentation dataset for training, by collecting over 22K 3D medical image scans from 72 segmentation datasets, across 497 classes, with careful standardization on both image scans and label space; (ii) for architecture design, we propose to inject medical knowledge into a text encoder via contrastive learning, and then formulate a universal segmentation model, that can be prompted by feeding in medical terminologies in text form; (iii) As a result, we have trained SAT-Nano (110M parameters) and SAT-Pro (447M parameters), demonstrating comparable performance to 72 specialist nnU-Nets trained on each dataset/subsets. We validate SAT as a foundational segmentation model, with better generalization ability on external (unseen) datasets, and can be further improved on specific tasks after fine-tuning adaptation. Comparing with interactive segmentation model, for example, MedSAM, segmentation model prompted by text enables superior performance, scalability and robustness. As a use case, we demonstrate that SAT can act as a powerful out-of-the-box agent for large language models, enabling visual grounding in clinical procedures such as report generation.

Datasets

Domain Knowledge

We construct a knowledge tree based on multiple medical knowledge source. It encompassing thousands of anatomy concepts and definitions throughout the human body. They are linked via the relations and additionally, some are further mapped to segmentations on the atlas images, demonstrating their visual features that may hardly be described purely by text.

Segmentation Datasets

To equip our universal segmentation model with the ability to handle segmentation tasks of different targets across various modalities and anatomical regions, we collect and integrate 72 diverse publicly available medical segmentation datasets, totaling 22,186 scans including both CT and MRI and 302,033 segmentation annotations, covering 497 anatomical structures and lesions spanning 8 different regions of the human body: Brain, Head and Neck, Upper Limb, Thorax, Abdomen, Plevis, and Lower Limb. The dataset collection is termed as SAT-DS. Detailed composition of the dataset are present in the paper.

Method

Towards building up our universal segmentation model by text prompt, i.e., SAT, we consider two main stages, namely, multimodal knowledge injection (a) and universal segmentation training (b). In the first stage, we pair the data in the constructed knowledge tree into text-text or text-atlas segmentation pairs, and use them for visual-language pre-training, injecting rich multimodal medical domain knowledge into the visual and text encoders; In the second stage, we build a universal segmentation model prompted by text. Here, the pretrained text encoder is applied to generate embedding for any anatomical target terminology as the text prompt for segmentation.

Results

A Universal Model is Worth 72 Specialist Models

In the internal validation, we evaluate SAT-Nano and SAT-Pro on all the 72 segmentation datasets in the collection. As there is no appropriate benchmark for evaluating the universal segmentation model, we randomly split each dataset in the collection into 80% for training and 20% for testing. We take nnU-Net as a strong baseline for comparison, and train specialist nnU-Net model on each of the datasets, resulting in 72 nnU-Net models with optimized configuration on each dataset. Here, we present the comparison with 72 nnU-Net models on each region, and model size. In general, SAT-Pro shows comparable performance to the combination of 72 nnU-Net models, with only 1/5 model size.

SAT as a Strong Pre-trained Segmentation Model

Built on the unprecedented data collection, SAT can be used as a strong pre-trained model, and adapted towards specific downstream dataset, by further finetuning on the training split of certain dataset. As shown in the following table, we report the results of SAT-Pro-FT. When comparing to the original SAT-Pro, it shows significant improvement on all the regions and lesion classes, and ultimately outperforms nnU-Nets.

SAT with Superior Generalization Ability

In the external validation, we use a subset collection of SAT-DS as training data (composed of 49 datasets), and take 10 held-out datasets for zero-shot evaluation. Although SAT-Pro is generally superior over SAT-Nano in internal evaluation, to save computational cost, we only train and evaluate SAT-Nano in this setting. While to systematically evaluate nnU-Nets, for each held-out dataset, we traverse all the 49 datasets and corresponding models, and evaluate them on all the classes shared between the two datasets. As shown in the following table, SAT-Nano outperforms nnU-Nets on 9 out of 10 datasets on DSC score (+2.21), and excels consistently on all datasets on NSD score (+6.73). This suggest the superior generalization ability of SAT.

Text As Prompt for Segmentation

We compare our proposed SAT with MedSAM, an out-of-the-box interactive segmentation model finetuned from SAM. We carefully choose 32 out of 72 datasets for internal evaluation, which are also involved in training MedSAM. We report three results for MedSAM: simulate box prompts based on the ground truth segmentation, with minimum rectangle covering the ground truth (denoted as Tight), i.e., the most accurate prompts; randomly shift each box corner up to 8% of the image resolution (denoted as Loose), i.e., allow errors to some extent; directly use the tight box prompts as prediction (denoted as Oracle Box), i.e., the input baseline of MedSAM. The results are demonstrated in the following table. SAT-Pro consistently outperforms MedSAM on all regions in terms of both DSC (+7.08) and NSD (+10.8). While SAT-Pro underperforms MedSAM on lesions, due to the small size of lesions, and the box prompts to MedSAM have lready provided very strong prior, as indicated by the results of inputting oracle box, which even shows better performance than MedSAM output on DSC score.

We further present qualitative results from two representative examples to illustrate the performance gap between SAT-Pro with MedSAM. The upper row shows the segmentation of myocardium, where the box prompt is ambiguous and MedSAM wrongly segment the left heart ventricle surrounded by the myocardium. The lower row shows the segmentation of colon cancer, where the MedSAM prediction is no better than the oracle box.

Domain knowledge can boost segmentation performance

We configure and train three SAT-Nano variants, with different text encoders: The text encoder pre-trained on our multimodal medical knowledge graph; MedCPT, the state-of-the-art text encoder for various medical language tasks; And BERT-Base, a prevalent text encoder in natural language processing but not specifically finetuned on medical corpus. With respect to the severe long-tail distribution of SAT-DS, we demonstrate the knowledge injection we proposed can significantly improve the segmentation performance, especially on the tail classes. For detailed results, please refer to Section 2.2 in our paper.

Zero-shot transfer to real clinic data

On several in-house clinical images with manual reports, we utilize GPT-4 to directly extract the anatomical targets of interests from reports and prompt SAT-Pro to segment them on the image. This demonstrate the potentail of SAT as a powerful agent for LLMs, and its powerful generalization capability to real clinic data, despite the diversity of images and segmentation targets.

Acknowledgements

Based on a template by Phillip Isola and Richard Zhang.