Large-Vocabulary Segmentation for Medical Images with Text Prompts
|
1Shanghai Jiao Tong University
|
2Shanghai AI Laboratory
|
Abstract
This paper aims to build a model that can Segment Anything in 3D medical images, driven
by medical terminologies as Text prompts, termed as SAT. Our main contributions are three-fold: (i)
We construct the first multimodal knowledge tree on human anatomy, including 6502 anatomical
terminologies; Then, we build the largest and most comprehensive segmentation dataset for training,
collecting over 22K 3D scans from 72 datasets, across 497 classes, with careful standardization
on both image and label space; (ii) We propose to inject medical knowledge into a text encoder via
contrastive learning and formulate a large-vocabulary segmentation model that can be prompted by
medical terminologies in text form. (iii) We train SAT-Nano (110M parameters) and SAT-Pro (447M
parameters). SAT-Pro achieves comparable performance to 72 nnU-Nets—the strongest specialist models
trained on each dataset (over 2.2B parameters combined)—over 497 categories. Compared with the
interactive approach MedSAM, SAT-Pro consistently outperforms across all 7 human body regions with
+7.1% average Dice Similarity Coefficient (DSC) improvement, while showing enhanced scalability and
robustness. On 2 external (cross-center) datasets, SAT-Pro achieves higher performance than all baselines
(+3.7% average DSC), demonstrating superior generalization ability.
|
Datasets
Domain Knowledge
We construct a knowledge tree based on multiple medical knowledge source.
It encompassing thousands of anatomy concepts and definitions throughout the human body.
They are linked via the relations and additionally, some are further mapped to segmentations on the atlas images,
demonstrating their visual features that may hardly be described purely by text.
Segmentation Datasets
To equip our universal segmentation model with the ability to handle segmentation tasks of different targets across various modalities and anatomical regions,
we collect and integrate 72 diverse publicly available medical segmentation datasets,
totaling 22,186 scans including both CT and MRI and 302,033 segmentation annotations,
covering 497 anatomical structures and lesions spanning 8 different regions of the human body: Brain, Head and Neck, Upper Limb, Thorax, Abdomen, Plevis, and Lower Limb.
The dataset collection is termed as SAT-DS.
Detailed composition of the dataset are present in the paper.
Method
Towards building up our universal segmentation model by text prompt, i.e., SAT, we consider two main stages, namely, multimodal knowledge injection (a) and universal segmentation training (b).
In the first stage, we pair the data in the constructed knowledge tree into text-text or text-atlas segmentation pairs,
and use them for visual-language pre-training,
injecting rich multimodal medical domain knowledge into the visual and text encoders;
In the second stage, we build a universal segmentation model prompted by text.
Here, the pretrained text encoder is applied to generate embedding for any anatomical target terminology as the text prompt for segmentation.
Results
A Generalist Model is Worth 72 Specialist Models
In the internal validation, we evaluate SAT-Nano and SAT-Pro on all the 72 segmentation datasets in the collection.
As there is no appropriate benchmark for evaluating the universal segmentation model,
we randomly split each dataset in the collection into 80% for training and 20% for testing.
We firstly take nnU-Net, U-Mamba and SwinUNETR as representative specialist methods and strong baselines for comparison. We train specialist models on each of the datasets,
resulting in 72 nnU-Net/U-Mamba/SwinUNETR models with optimized configuration on each dataset.
In general, SAT-Pro shows comparable performance to the combination of 72 nnU-Net models, with only 1/5 model size.
Text-prompted Segmentation Can Be User-Friendly
Driven by text prompts, SAT outlines a novel paradigm for segmentation foundation model, as opposed to previous interactive approaches that rely on spatial prompts.
This could save tremendous manual efforts from prompting in clinical applications. On performance, SAT-Pro consistently outperforms the
state-of-the-art interactive model MedSAM across 7 human body regions, while being robust to targets
with ambiguous spatial relationships.
For more experiment results and interesting findings, please refer to our paper.
Acknowledgements
Based on a template by Phillip Isola and Richard Zhang.