Sol Research

STPNet: Scale-aware Text Prompt Network for Medical Image Segmentation.

Last Updated Jun 02, 2025 in IEEE transactions on image processing : a publication of the IEEE Signal Processing Society by Dandan Shan, Zihan Li, Yunxiang Li, Qingde Li, Jie Tian, Qingqi Hong

TLDR

STPNet is a novel vision-language approach that leverages multi-scale textual descriptions to enhance medical image segmentation, outperforming state-of-the-art methods and offering potential benefits for medical image analysis.

Abstract

Accurate segmentation of lesions plays a critical role in medical image analysis and diagnosis. Traditional segmentation approaches that rely solely on visual features often struggle with the inherent uncertainty in lesion distribution and size. To address these issues, we propose STPNet, a Scale-aware Text Prompt Network that leverages vision-language modeling to enhance medical image segmentation. Our approach utilizes multi-scale textual descriptions to guide lesion localization and employs retrieval-segmentation joint learning to bridge the semantic gap between visual and linguistic modalities. Crucially, STPNet retrieves relevant textual information from a specialized medical text repository during training, eliminating the need for text input during inference while retaining the benefits of cross-modal learning. We evaluate STPNet on three datasets: COVID-Xray, COVID-CT, and Kvasir-SEG. Experimental results show that our vision-language approach outperforms state-of-the-art segmentation methods, demonstrating the effectiveness of incorporating textual semantic knowledge into medical image analysis. The code has been made publicly on https://github.com/HUANGLIZI/STPNet.

Overview

The study proposes STPNet, a Scale-aware Text Prompt Network that leverages vision-language modeling to enhance medical image segmentation.
STPNet utilizes multi-scale textual descriptions to guide lesion localization and employs retrieval-segmentation joint learning to bridge the semantic gap between visual and linguistic modalities.
The primary objective is to develop a vision-language approach that can outperform traditional segmentation methods and eliminate the need for text input during inference.

Comparative Analysis & Findings

Experimental results show that STPNet outperforms state-of-the-art segmentation methods on three datasets: COVID-Xray, COVID-CT, and Kvasir-SEG.
The study highlights the effectiveness of incorporating textual semantic knowledge into medical image analysis.
The evaluation demonstrates that STPNet's vision-language approach can produce accurate lesions segmentation results.

Implications and Future Directions

The proposed method has significant implications for improving medical image analysis and diagnosis, particularly in applications where lesion distribution and size uncertainties are inherent.
Future studies can build upon this work to explore novel vision-language approaches for other medical image analysis tasks, such as segmentation of other organs or tissues.
The publicly available code and dataset can facilitate further research and development of advanced medical image analysis techniques.

Read Full Article