One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts

Ziheng Zhao^1,2

Yao Zhang^1,2

Chaoyi Wu^1,2

Xiaoman Zhang^1,2

Ya Zhang^1,2

Yanfeng Wang^1,2

Weidi Xie^1,2

¹CMIC, Shanghai Jiao Tong University

²Shanghai AI Laboratory

Code [GitHub]

Paper [arXiv]

Cite [BibTeX]

Abstract

In this study, we focus on building up a model that can Segment Anything in medical scenarios, driven by Text prompts, termed as SAT. Our main contributions are three folds: (i) on data construction, we combine multiple knowledge sources to construct a multi-modal medical knowledge tree; Then we build up a large-scale segmentation dataset for training, by collecting over 11K 3D medical image scans from 31 segmentation datasets with careful standardization on both visual scans and label space; (ii) on model training, we formulate a universal segmentation model, that can be prompted by inputting medical terminologies in text form. We present a knowledge-enhanced representation learning framework, and a series of strategies for effectively training on the combination of a large number of datasets; (iii) on model evaluation, we train a SAT-Nano with only 107M parameters, to segment 31 different segmentation datasets with text prompt, resulting in 362 categories. We thoroughly evaluate the model from three aspects: averaged by body regions, averaged by classes, and average by datasets, demonstrating comparable performance to 36 specialist nnU-Nets, i.e., we train nnU-Net models on each dataset/subset, resulting in 36 nnU-Nets with around 1000M parameters for the 31 datasets. We will release the codes, and models of SAT-Nano. Moreover, we will offer SAT-Ultra in the near future, which is trained with model of larger size, on more diverse datasets.

Datasets

Domain Knowledge

As is shown in the following figure, we construct a knowledge tree based on multiple medical knowledge source, including e-Anatomy, UMLS and abundant segmentation datasets. It encompassing thousands of anatomy concepts throughout the human body. They are linked via the relations and further extended with their definitions, containing their characteristics. Additionally, some are further mapped to segmentations or grounding locations on the atlas images, demonstrating their visual features that may hardly be described purely by text.

Segmentation Datasets

To equip our universal segmentation model with the ability to handle segmentation tasks of different targets across various modalities and anatomical regions, we collect and integrate 31 diverse publicly available medical segmentation datasets, totaling 11,462 scans including both CT and MRI and 142,254 segmentation annotations, covering 362 anatomical structures and lesions spanning 8 different regions of the human body: Brain, Head and Neck, Upper Limb, Thorax, Abdomen, Plevis, and Lower Limb. The dataset collection is termed as SAT-DS, listed in the following table. More details are present in the paper.

Method

Towards building up our universal segmentation model by text prompt, i.e., SAT, we consider two main stages, namely, multimodal knowledge injection (a) and universal segmentation training (b). In the first stage, we pair the data in the constructed knowledge tree into text-text or text-atlas segmentation pairs, and use them for visual-language pre-training, injecting rich multimodal medical domain knowledge into the visual and text encoders; In the second stage, we build a universal segmentation model prompted by text. Here, the pretrained text encoder generates neural embedding for any anatomical target terminology as the text prompt for segmentation, providing helpful guidance from the knowledge injected.

Result

We evaluate SAT-nano on all the 31 segmentation datasets in the collection. As there is no existing benchmark for evaluating the universal segmentation model, we randomly split each dataset in the collection into 80% for training and 20% for testing. We take nnU-Net as a strong baseline for comparison, and train one nnU-Net model on each of the datasets, resulting in 36 nnU-Net models customized on each dataset.

R1: Region-Wise Evaluation

Comparison with 36 nnU-Net models on each region, and model size.

R2: Class-Wise Evaluation

Comparison with 36 nnU-Net models on each class.

R3: Dataset-Wise Evaluation

Comparison with 36 nnU-Net models on each dataset.

Acknowledgements

Based on a template by Phillip Isola and Richard Zhang.