Large-Vocabulary Segmentation for Medical Images with Text Prompts

npj Digital Medicine


Ziheng Zhao1,2
Yao Zhang2
Chaoyi Wu1,2
Xiaoman Zhang1,2
Xiao Zhou2

Ya Zhang1,2
Yanfeng Wang1,2
Weidi Xie1,2

1Shanghai Jiao Tong University
2Shanghai AI Laboratory

Code [GitHub]

Data [Data]

Paper [arXiv]

Cite [BibTeX]


Abstract

This paper aims to build a model that can Segment Anything in 3D medical images, driven by medical terminologies as Text prompts, termed as SAT. Our main contributions are three-fold: (i) We construct the first multimodal knowledge tree on human anatomy, including 6502 anatomical terminologies; Then, we build the largest and most comprehensive segmentation dataset for training, collecting over 22K 3D scans from 72 datasets, across 497 classes, with careful standardization on both image and label space; (ii) We propose to inject medical knowledge into a text encoder via contrastive learning and formulate a large-vocabulary segmentation model that can be prompted by medical terminologies in text form. (iii) We train SAT-Nano (110M parameters) and SAT-Pro (447M parameters). SAT-Pro achieves comparable performance to 72 nnU-Nets—the strongest specialist models trained on each dataset (over 2.2B parameters combined)—over 497 categories. Compared with the interactive approach MedSAM, SAT-Pro consistently outperforms across all 7 human body regions with +7.1% average Dice Similarity Coefficient (DSC) improvement, while showing enhanced scalability and robustness. On 2 external (cross-center) datasets, SAT-Pro achieves higher performance than all baselines (+3.7% average DSC), demonstrating superior generalization ability.



Datasets

Domain Knowledge

We construct a knowledge tree based on multiple medical knowledge source. It encompassing thousands of anatomy concepts and definitions throughout the human body. They are linked via the relations and additionally, some are further mapped to segmentations on the atlas images, demonstrating their visual features that may hardly be described purely by text.

Segmentation Datasets

To equip our universal segmentation model with the ability to handle segmentation tasks of different targets across various modalities and anatomical regions, we collect and integrate 72 diverse publicly available medical segmentation datasets, totaling 22,186 scans including both CT and MRI and 302,033 segmentation annotations, covering 497 anatomical structures and lesions spanning 8 different regions of the human body: Brain, Head and Neck, Upper Limb, Thorax, Abdomen, Plevis, and Lower Limb. The dataset collection is termed as SAT-DS. Detailed composition of the dataset are present in the paper.



Method

Towards building up our universal segmentation model by text prompt, i.e., SAT, we consider two main stages, namely, multimodal knowledge injection (a) and universal segmentation training (b). In the first stage, we pair the data in the constructed knowledge tree into text-text or text-atlas segmentation pairs, and use them for visual-language pre-training, injecting rich multimodal medical domain knowledge into the visual and text encoders; In the second stage, we build a universal segmentation model prompted by text. Here, the pretrained text encoder is applied to generate embedding for any anatomical target terminology as the text prompt for segmentation.



Results

A Generalist Model is Worth 72 Specialist Models

In the internal validation, we evaluate SAT-Nano and SAT-Pro on all the 72 segmentation datasets in the collection. As there is no appropriate benchmark for evaluating the universal segmentation model, we randomly split each dataset in the collection into 80% for training and 20% for testing. We firstly take nnU-Net, U-Mamba and SwinUNETR as representative specialist methods and strong baselines for comparison. We train specialist models on each of the datasets, resulting in 72 nnU-Net/U-Mamba/SwinUNETR models with optimized configuration on each dataset. In general, SAT-Pro shows comparable performance to the combination of 72 nnU-Net models, with only 1/5 model size.

Text-prompted Segmentation Can Be User-Friendly

Driven by text prompts, SAT outlines a novel paradigm for segmentation foundation model, as opposed to previous interactive approaches that rely on spatial prompts. This could save tremendous manual efforts from prompting in clinical applications. On performance, SAT-Pro consistently outperforms the state-of-the-art interactive model MedSAM across 7 human body regions, while being robust to targets with ambiguous spatial relationships.
For more experiment results and interesting findings, please refer to our paper.

Acknowledgements

Based on a template by Phillip Isola and Richard Zhang.