Debesh Jha, Nikhil Kumar Tomar, Vanshali Sharma, Quoc-Huy Trinh, Koushik Biswas, Hongyi Pan, Ritika K. Jha, Gorkem Durak, Alexander Hann, Jonas Varkey, Hang Viet Dao, Long Van Dao, Binh Phuc Nguyen, Khanh Cong Pham, Quang Trung Tran, Nikolaos Papachrysos, Brandon Rieders, Peter Thelin Schmidt, Enrik Geissler, Tyler Berzin, Pål Halvorsen, Michael A. Riegler, Thomas de Lange, Ulas Bagci Debesh Jha, Nikhil Kumar Tomar, Vanshali Sharma, Quoc-Huy Trinh, Koushik Biswas, Hongyi Pan,
Ritika K. Jha, Gorkem Durak, Alexander Hann, Jonas Varkey, Hang Viet Dao, Long Van Dao, Binh Phuc Nguyen, Khanh Cong Pham, Quang Trung Tran, Nikolaos Papachrysos, Brandon Rieders, Peter Thelin Schmidt, Enrik Geissler, Tyler Berzin, Pål Halvorsen, Michael A. Riegler, Thomas de Lange, Ulas BagciD. Jha, N. Tomar, V. Sharma, Q.H. Trinh, K. Biswas, H. Pan, R. K. Jha, G. Durak, U. Bagci are with Machine & Hybrid Intelligence Lab, Department of Radiology, Northwestern University, Chicago, USA. (corresponding email: debesh.jha@northwestern.edu)T. de Lange, Jonas Varkey, and Nikolaos Papachrysos are with Region Västra Götaland, Sahlgrenska University Hospital & The University of Gothenburg, Gothenburg, Sweden.T. Berzin and E. Geissler are with Harvard University.A. Hann is with University Hospital Würzburg, Germany.Hang Viet Dao and Long Van Dao are with Internal Medicine Faculty, Hanoi Medical University, Hanoi, Vietnam, Endoscopic Centre, Hanoi Medical University hospital, Hanoi, Vietnam & The Institute of Gastroenterology and Hepatology, Hanoi, Vietnam.Binh Phuc Nguyen is with The Institute of Gastroenterology and Hepatology, Hanoi, VietnamKhanh Cong Pham is with Department of Endoscopy, Ho Chi Minh University Medical Center, Ho Chi Minh, Vietnam.Brandon Rieders is with Valley Stream, New York, United States.Peter Thelin Schmidt is with Karolinska University Hospital, Sweden. Debesh Jha, Nikhil Kumar Tomar, Vanshali Sharma, Quoc-Huy Trinh, Koushik Biswas, Hongyi Pan, Ritika K. Jha,
Gorkem Durak, Alexander Hann, Jonas Varkey, Hang Viet Dao, Long Van Dao, Binh Phuc Nguyen, Khanh Cong Pham,
Quang Trung Tran, Nikolaos Papachrysos, Brandon Rieders, Peter Thelin Schmidt, Enrik Geissler, Tyler Berzin,
Pål Halvorsen, Michael A. Riegler, Thomas de Lange, Ulas BagciD. Jha, N. Tomar, V. Sharma, Q.H. Trinh, K. Biswas, H. Pan, R. K. Jha, G. Durak, U. Bagci are with Machine & Hybrid Intelligence Lab, Department of Radiology, Northwestern University, Chicago, USA. (corresponding email: debesh.jha@northwestern.edu)T. de Lange, Jonas Varkey, and Nikolaos Papachrysos are with Region Västra Götaland, Sahlgrenska University Hospital & The University of Gothenburg, Gothenburg, Sweden.T. Berzin and E. Geissler are with Harvard University.A. Hann is with University Hospital Würzburg, Germany.Hang Viet Dao and Long Van Dao are with Internal Medicine Faculty, Hanoi Medical University, Hanoi, Vietnam, Endoscopic Centre, Hanoi Medical University hospital, Hanoi, Vietnam & The Institute of Gastroenterology and Hepatology, Hanoi, Vietnam.Binh Phuc Nguyen is with The Institute of Gastroenterology and Hepatology, Hanoi, Vietnam.Khanh Cong Pham is with Department of Endoscopy, Ho Chi Minh University Medical Center, Ho Chi Minh, Vietnam.
Debesh Jha, Nikhil Kumar Tomar, Vanshali Sharma, Quoc-Huy Trinh, Koushik Biswas, Hongyi Pan, Ritika K. Jha, Gorkem Durak, Alexander Hann, Jonas Varkey, Hang Viet Dao, Long Van Dao, Binh Phuc Nguyen, Khanh Cong Pham, Quang Trung Tran, Nikolaos Papachrysos, Brandon Rieders, Peter Thelin Schmidt, Enrik Geissler, Tyler Berzin, Pål Halvorsen, Michael A. Riegler, Thomas de Lange, Ulas Bagci Debesh Jha, Nikhil Kumar Tomar, Vanshali Sharma, Quoc-Huy Trinh, Koushik Biswas, Hongyi Pan,
Ritika K. Jha, Gorkem Durak, Alexander Hann, Jonas Varkey, Hang Viet Dao, Long Van Dao, Binh Phuc Nguyen, Khanh Cong Pham, Quang Trung Tran, Nikolaos Papachrysos, Brandon Rieders, Peter Thelin Schmidt, Enrik Geissler, Tyler Berzin, Pål Halvorsen, Michael A. Riegler, Thomas de Lange, Ulas BagciD. Jha, N. Tomar, V. Sharma, Q.H. Trinh, K. Biswas, H. Pan, R. K. Jha, G. Durak, U. Bagci are with Machine & Hybrid Intelligence Lab, Department of Radiology, Northwestern University, Chicago, USA. (corresponding email: debesh.jha@northwestern.edu)T. de Lange, Jonas Varkey, and Nikolaos Papachrysos are with Region Västra Götaland, Sahlgrenska University Hospital & The University of Gothenburg, Gothenburg, Sweden.T. Berzin and E. Geissler are with Harvard University.A. Hann is with University Hospital Würzburg, Germany.Hang Viet Dao and Long Van Dao are with Internal Medicine Faculty, Hanoi Medical University, Hanoi, Vietnam, Endoscopic Centre, Hanoi Medical University hospital, Hanoi, Vietnam & The Institute of Gastroenterology and Hepatology, Hanoi, Vietnam.Binh Phuc Nguyen is with The Institute of Gastroenterology and Hepatology, Hanoi, VietnamKhanh Cong Pham is with Department of Endoscopy, Ho Chi Minh University Medical Center, Ho Chi Minh, Vietnam.Brandon Rieders is with Valley Stream, New York, United States.Peter Thelin Schmidt is with Karolinska University Hospital, Sweden. Debesh Jha, Nikhil Kumar Tomar, Vanshali Sharma, Quoc-Huy Trinh, Koushik Biswas, Hongyi Pan, Ritika K. Jha,
Gorkem Durak, Alexander Hann, Jonas Varkey, Hang Viet Dao, Long Van Dao, Binh Phuc Nguyen, Khanh Cong Pham,
Quang Trung Tran, Nikolaos Papachrysos, Brandon Rieders, Peter Thelin Schmidt, Enrik Geissler, Tyler Berzin,
Pål Halvorsen, Michael A. Riegler, Thomas de Lange, Ulas BagciD. Jha, N. Tomar, V. Sharma, Q.H. Trinh, K. Biswas, H. Pan, R. K. Jha, G. Durak, U. Bagci are with Machine & Hybrid Intelligence Lab, Department of Radiology, Northwestern University, Chicago, USA. (corresponding email: debesh.jha@northwestern.edu)T. de Lange, Jonas Varkey, and Nikolaos Papachrysos are with Region Västra Götaland, Sahlgrenska University Hospital & The University of Gothenburg, Gothenburg, Sweden.T. Berzin and E. Geissler are with Harvard University.A. Hann is with University Hospital Würzburg, Germany.Hang Viet Dao and Long Van Dao are with Internal Medicine Faculty, Hanoi Medical University, Hanoi, Vietnam, Endoscopic Centre, Hanoi Medical University hospital, Hanoi, Vietnam & The Institute of Gastroenterology and Hepatology, Hanoi, Vietnam.Binh Phuc Nguyen is with The Institute of Gastroenterology and Hepatology, Hanoi, Vietnam.Khanh Cong Pham is with Department of Endoscopy, Ho Chi Minh University Medical Center, Ho Chi Minh, Vietnam.
Abstract
Colonoscopy is the primary method for examination, detection, and removal of polyps. Regular screening helps detect and prevent colorectal cancer at an early curable stage. However, challenges such as variation among the endoscopists’ skills, bowel quality preparation, and complex nature of the large intestine which cause large number of polyp miss-rate. These missed polyps can develop into cancer later on, which underscores the importance of improving the detection methods. A computer-aided diagnosis system can support physicians by assisting in detecting overlooked polyps. However, one of the important challenges for developing novel deep learning models for automatic polyp detection and segmentation is the lack of publicly available, multi-center large and diverse datasets. To address this gap, we introduce PolypDB, a large scale publicly available dataset that contains 3934 still polyp images and their corresponding ground truth from real colonoscopy videos to design efficient polyp detection and segmentation architectures. The dataset has been developed and verified by a team of 10 gastroenterologists. PolypDB comprises of images from five modalities: Blue Light Imaging (BLI), Flexible Imaging Color Enhancement (FICE), Linked Color Imaging (LCI), Narrow Band Imaging (NBI), and White Light Imaging (WLI) and three medical centers from Norway, Sweden and Vietnam. Thus, we split the dataset based on modality and medical center for modality-wise and center-wise analysis. We provide a benchmark on each modality using eight popular segmentation methods and six standard benchmark polyp detection methods. Furthermore, we also provide benchmark on center-wise under federated learning settings. Our dataset is public and can be downloaded at https://osf.io/pr7ms/. More information about the dataset, segmentation, detection, federated learning benchmark and train-test split can be found in https://github.com/DebeshJha/PolypDB.
I Introduction
Colorectal cancer (CRC) represent the third highest cancer incidence and is the second most common cause of cancer-related death worldwide. In 2020, approximately 1.9 million new cases of CRC was detected and causing approximately 935,000 deaths[1]. The relative five-year survival rate for persons younger than 64 years is 68.8%[2]. Colonoscopy is the golden standard for detecting CRC and removal of precancerous lesions such as polyps and very early CRCs. However, colonoscopy is an operator-dependent procedure causing an important variation in polyp detection[3]. Smaller polyps, diminutive () or (6 to 9mm) sized colon polyps are often missed by the endoscopists. The adenoma miss-rate is reported to be 20%–24%[4] and some missed polyps develop into CRC later on called postcolonoscopy CRC or interval cancer[5] . For a couple of years, computer aided detection (CADe) systems for polyp detection are commercially available and have shown to increase the ADR but the polyps are just marked with a bounding box and do not help the endoscopists to delineate the polyp and to confirm complete resection of the polyp, essential to avoid recurrence and potentially post colonoscopy CRC.
Dataset | Findings | Size | Availability |
Kvasir-SEG[6] | Polyps | 1000 images† | open academic |
HyperKvasir[7] | GI findings and polyps | 110,079 images and 374 videos | open academic |
Kvasir-Capsule[8] | GI findings and polyps⋄ | 4,741,504 images | open academic |
CVC-VideoClinicDB[9] | Polyps | 11,954 images† | by request∙ |
ASU-Mayo polyp database[10] | Polyps | 18,781 images† | by request∙ |
BKAI-IGH[11] | Polyps | 1000 images† | open academic |
PolypGen[12] | Polyps | 1531 images† and 2000 video frames | open academic |
PolypDB (Ours) | Polyps | 3934 polyp images from 3 centers | open academic |
†contains ground truth segmentation masks ⋄Video capsule endoscopy ∙Not available anymore |
In this perspective an exact delineation of the polyp would be very helpful. Accurate polyp segmentation is a challenging process. This is because (i) polyp changes their characteristics over time during their development stage, and (ii) shape, size, colors, and appearance of polyps may be very similar to the surrounding mucosa. In some polyp cases, there is a strong camouflage that might trick the endoscopists. Even state-of-the-art (SOTA) deep learning algorithms show false positives for such examples. (iii) The imaging device introduces artifacts like blurriness, flares and lightning conditions that also affect colonoscopy procedure, for example, objects too close to the camera, under or over scene lightning, low resolution of capsular endoscopes, overexposure, reflection from the bright spot, low contrast areas. All these can affect colonoscopy procedure and limit accurate polyp segmentation. (iv) presence of surgical instrument and intestinal residue can also affect accurate polyp segmentation[13].
It is important to improve the diagnostic performance of the colonoscopy procedure. Fulfilling the gap between expert and non-expert endoscopists in detecting and diagnosing colon polyps, one of the significant challenges in colonoscopy[14, 15]. An automatic polyp segmentation algorithm can highlight the potential presence of polyps with pixel-level accuracy and could help endoscopists. Most of the above methods perform reasonably well on the large adenomas (), which are easy to segment while overlooking small, diminutive, and even flat large serrated lesions the main reason for right sided pots colonoscopy colorectal cancer[16]. These types of polyps are challenging to detect even for experienced endoscopists[17]. The performance of these methods is only reported on a single-center dataset. Although they might obtain high performance on the particular test set, they suffer from huge performance gaps when tested on out-of-sample datasets, or datasets collected from different cohort populations and hospitals, captured using different types of scopes and modality leading to generalization failure. Training algorithms on multi-center datasets can improve generalizability and robustness of the network.
The main motivation of our work is to develop and publicly release a large-scale multi-center polyp segmentation and detection dataset to develop advanced computer aided diagnosis (CAD) systems that are robust and generalizable for polyp segmentation method useful for integration into the clinical settings. Our dataset consists of a diverse set of images covering the global representativeness of the population and their annotations useful for performance evaluation and comparison of different Deep learning (DL) based algorithms. Our multi-center dataset consists of datasets from a variety of sources, imaging modalities ( Blue laser imaging (BLI), Flexible spectral Imaging Color Enhancement (FICE), white light imaging (WLI), linked color imaging (LCI)), populations (Norway, Vietnam, Sweden), acquisition protocols (Fujinon system, Olympus) and imaging conditions captured by multi-national expert that are better for early polyp diagnosis. Furthermore, we exploit this multi-center dataset and propose to deveop novel techincal for polyp detection and segmentation. The main contribution of this work are as follow:
- 1.
PolypDB— We develop and publicly release a multi-center, multi-modality polyp segmentation and detection dataset that consists of 3934 polyp images, pixel-precise ground truth and bounding box annotations collected from medical centers in Norway, Sweden and Vietnam.
- 2.
First-ever open access multi-modality dataset — PolypDB consists of five distinct modalities such as BLI, FICE, LCI, NBI and WLI. This is the first-ever open access dataset to feature five distinct modalities along by gastroenterologist-verified ground truth.
- 3.
Baseline benchmark — We evaluated PolypDB on each modality using eight segmentation methods, six object detection methods, and six federated learning approaches, establishing a robust baseline benchmark.
II PolypDB dataset details
II-A Study design
The PolypDB dataset is a collection of colonoscopy examination images from three medical hospitals in Norway, Sweden, and Vietnam. Figure3 presents the example images from BLI, FICE, LCI, NBI and WLI along with their corresponding bounding box ground truth and color coded segmentation masks. The main motivation behind the development of the dataset is the imperative need for early detection of early diagnosis of CRC precursors, aimed to reduce the incidence. Although some publicly available datasets already exists (please refer to TableI), there is no modality wise dataset till date. Also multi-center open access dataset is limited in the community. Therefore, a new multi-modality and multi-center dataset holds a significant relevance. This diversity is crucial due to regional and demographic disparities in CRC incidence rates. The collection of multi-center data increases the diversity and broadens population representation. Additionally, having a multi-center dataset also allows for the inclusion of different types of equipment and imaging protocols, which can also improve the robustness and generalizability of the CAD system leading to better patient outcomes.
II-B Dataset acquisition: Inclusion and exclusion criteria of the images
II-B1 Inclusion criteria
The inclusion criteria of the polyp frames are as follows:
- •
Images with polyp(s) in WLI mode and FICE mode.
- •
Images with a minimum resolution of pixels.
- •
Polyp’s boundary must be clear and well-defined.
- •
Boston Bowel Preparation Score (BBPS) .
- •
Images captured in magnification mode.
- •
Images with poor quality.
II-B2 Exclusion criteria
The exclusion criteria for the polyp frames are as follows:
- •
Already resected (removed) polyps – dead polyp and resection site are visible.
- •
Polyps resected and transported in a net.
- •
Non-colon polyps (eg stomach or small bowel).
- •
Polyps injected with blue dye and snare around polyp neck.
- •
Resection sites covered in blood, where residual polyps are unclear.
- •
Unclear if polyp or stool remnants.
- •
Normal anatomical structure.
- •
Images in magnification mode.
- •
Images with poor quality.
- •
Blurry, shaky.
- •
Too dark or having flare.
- •
Have much liquid (faeces, blood) and mucus.
- •
Image with already resected polyps or resection sites.
- •
Images containing endoscopic tools such as caps, injection needles, snares, biopsy forceps, and clips.
II-C Dataset collection and construction
II-C1 Baerum Hospital, Vestre Viken Hospital Trust, Bærum, Norway (Center 1)
The polyp images were collected and verified by experienced gastroenterologists from Vestre Viken Health Trust in Norway. Some images have been collected from the unlabelled class of HyperKvasir dataset[7]. There are 99,417 endoscopic frames in HyperKvasir dataset. We identified 3000 WLI polyps frames, label them and sent it our our gastroenterologists. Out of 3000 images, only 2588 were incorporate them into our datasets. Others were excluded based on the exclusion criteria.
Additionally, we also selected 136 NBI images from Unlabelled HyperKvasir class. We curated the ground truth for both WLI and NBI which was verified by a team of expert gastroenterologists. By labelling such datasets, we are making use of unlabelled frames, which were never explored for the development of new tools.
II-C2 Karolinska University Hospital, Stockholm, Sweden (Center 2)
The images were collected and verified by an experienced gastroenterologist (10+ years of experience) from Karolinska Medical Hospital in Sweden. Although from their center, we received images from entire GI tract, the number of polyp images were relatively limited. Based on exclusion criteria, we selected only 30 WLI polyp images and 10 NBI polyp images from Karolinska. All of these images were completely anonymized according to GDPR requirements for full anonymization.
II-C3 Hanoi Medical University (HMU) & Institute of Gastroenterology and Hepatology (IGH), Hanoi, Vietnam (Center 3)
The dataset BKAI consisted of 1200 endoscopic images with polyps in 4 light modes: WLI, LCI, BLI and FICE. The data acquisition procedure for both centers are identical and they have same cohort population for examination. Therefore, we consider them as the single center in this study given that both centers are located in the same city. Out of total of 1200 images, 600 images were obtained from HMU, while other 600 images were sourced from IGH. Specifically, BKAI consists of 1000 WLI polyp images, 60 LCI, 70 FICE and 70 BLI images. These images were labelled and annotated by three expert endoscopists with an experience more than 10 years in Vietnam. We provide both bounding box information and pixel-precise annotation for all of these images so that the dataset can be useful for both object detection and segmentation tasks. We also organized the dataset in center-wise and modality-wise so that it could be useful to facilitate the research in towards specific objective in multiple direction.
II-D Annotation strategies and quality assurance
A team of 10 gastroenterologists (with most of them over 10 years of experience in colonoscopy) and one experienced senior research associate were involved in the data annotation, sorting and the review process of the quality of annotations. The annotations were performed by a senior research associate who has extensive experience in data curation and development using online annotation tool called Labelboxhttps://labelbox.com/. All images were uploaded to Labelbox, and each frame was labelled considering the reason of interest (area covered by polyp), and the ground truth for each sample was created. Each annotation was cross-verified by at least two senior gastroenterologists. Furthermore, we assign an independent reviewer (senior gastroenterologist) to review all 3934 images. All of the images were annotated by one researcher using the Wacom Cintiq tablet to minimize the heterogeneity in the manual delineation process.
During the review process, the gastroenterologists marked if the frame represented colon polyps and should be included in the dataset. After that, they checked if the annotations for each polyp in a frame were “correct” and clinically acceptable. Finally, the non-polyp images were removed, and annotations were adjusted for incorrect annotations. For modality-wise organization, we provide “images”, “corresponding groundtruth masks” in the segmentation folder and “images” and “corresponding bounding box information” in the detection folder for each modality. The images and corresponding groundtruth contain the same filename.
For the center-wise data organization, we divide the dataset into three centers: Simula, Karolinska, and BKAI. Each center has images, segmentation ground truth, and bounding box information useful for segmentation and polyp detection tasks. All the images are encoded using JPEG compression. Our datasets can be easily downloaded for research and educational purposes. We also encourage the use of our dataset for industrial applications. However, prior consent is required.
Centers | System info. | Ethical approval | Patient consenting type |
Bærum Hospital, Vestre Viken Hospital Trust, Bærum, Norway | Olympus Evis Exera III, CF 190 | Exempted† | Written informed consent |
Karolinska University Hospital, Stockholm, Sweden | Olympus Evis Exera III, CF 190 | Not required‡ | Written informed consent |
Hanoi Medical University (HMU), Hanoi Vietnam | Fujinon system | Not required | Not required‡ |
Institute of Gastroenterology and Hepatology (IGH), Hanoi, Vietnam | Fujinon system | Not required‡ | Not required‡ |
† Approved by the data inspector. No further ethical approval was required as it did not interfere with patient treatment | |||
‡ Fully anonymized, no further ethical approval was required |
II-E Ethical and privacy aspects of the data
PolypDB was collected from 3 different medical hospitals. Each center handled the legal, ethical and privacy aspect of the dataset from their center. The data collection center handled all or two of these steps before providing the dataset from the center. Additionally, we believe releasing these datasets would help in the technological development, for example, the development of robust CAD system for polyps and there is a high potential benefit compared to the potential risk. Therefore, we make this dataset public after carefully considering ethical and privacy issues. TableII illustrates the ethical and legal processes fulfilled by each center, along with the endoscopy equipment and recorders used for the data collection.
- 1.
Informed consent from the patient was obtained when required. Approval from the institution was always obtained. This also included the purpose of the study and how their datasets will be used.
- 2.
Review and approval of the collected data from data inspector, institutional review board or local medical ethics committee depending on their country’s regulations.
- 3.
De-identification of the colonoscopy frame prior to release by following laws and regulations related to data privacy and protection in their nation.
III Experiments and Results
III-A Dataset and implementation details
III-A1 Dataset
We have conducted experiments in two different settings: (i) modality-wise and (ii) center-wise. For modality-wise settings, we have 3558 WLI polyp images, 146 NBI images, 60 LCI images, 70 BLI, and 70 FICE images. We only experiment with WLI images for center-wise settings because it is common in all three centers. Although, there are 136 NBI polyp images in center 1 and 10 polyp images in center 2. Due to the minimal number of images present in both centers, we exclude it from the experiment.
III-A2 Implementation details
All the methods are implemented using the PyTorch 1.9[18] framework, which is processed on an NVIDIA GeForce RTX 3090 system. We have used 80% of the dataset for training, 10% for validation and the remaining 10% for testing. For polyp segmentation, we have first resized all the images into pixels. Next, we have used a minimal data augmentation, which includes random rotation, horizontal flipping, vertical flipping, and course dropout. A combination of binary cross-entropy loss and dice loss was selected as the loss function with the Adam optimizer, and a learning rate of was set. We have trained all the models with the same set of hyperparameters for 200 epochs (empirically set) with a batch size of 12. Early stopping and ReduceLROnPlateau was used to prevent the model from overfitting.
For polyp detection, we employed different hyperparameters tailored to optimize the performance of the detection algorithms. At first, we resized all the images to and used a simple data augmentation strategy which include random flipping, random rotation, random blur, mixup, mosaic and cutmix. All YOLO models were trained using a uniform set of hyperparameters, with the AdamW optimizer applied at a learning rate of and a batch size of .
III-B Results
Method | Backbone | mIoU | mDSC | Recall | Precision | F2 |
Dataset: PolypDB (BLI) | ||||||
U-Net[19] | - | 0.1822 | 0.2855 | 0.6862 | 0.2180 | 0.3962 |
DeepLabV3+[20] | ResNet50[21] | 0.6055 | 0.7293 | 0.8462 | 0.7146 | 0.7751 |
PraNet[22] | Res2Net50[23] | 0.6581 | 0.7831 | 0.8876 | 0.7390 | 0.8348 |
CaraNet[24] | Res2Net101[23] | 0.5853 | 0.7237 | 0.6895 | 0.8052 | 0.6978 |
TGANet[25] | ResNet50[21] | 0.5217 | 0.6520 | 0.8108 | 0.6344 | 0.7076 |
PVT-CASCADE[26] | PVTv2-B2[27] | 0.6737 | 0.7873 | 0.8750 | 0.7748 | 0.8205 |
DuAT[28] | PVTv2-B2[27] | 0.6979 | 0.8048 | 0.9082 | 0.7647 | 0.8501 |
SSFormer-L[29] | MiT-PLD-B4 | 0.6750 | 0.7848 | 0.8436 | 0.7708 | 0.8091 |
Dataset: PolypDB (FICE) | ||||||
U-Net[19] | - | 0.1384 | 0.2021 | 0.5600 | 0.1425 | 0.2840 |
DeepLabV3+[20] | ResNet50[21] | 0.6129 | 0.6759 | 0.6653 | 0.9441 | 0.6668 |
PraNet[22] | Res2Net50[23] | 0.6013 | 0.6513 | 0.6559 | 0.7984 | 0.6530 |
CaraNet[24] | Res2Net101[23] | 0.5694 | 0.6286 | 0.6082 | 0.8135 | 0.6146 |
TGANet[25] | ResNet50[21] | 0.5922 | 0.6898 | 0.7086 | 0.7279 | 0.6960 |
PVT-CASCADE[26] | PVTv2-B2[27] | 0.7209 | 0.7799 | 0.8110 | 0.7588 | 0.7971 |
DuAT[28] | PVTv2-B2[27] | 0.5589 | 0.6746 | 0.9082 | 0.5867 | 0.7729 |
SSFormer-L[29] | MiT-PLD-B4 | 0.7607 | 0.8300 | 0.8713 | 0.8013 | 0.8526 |
Dataset: PolypDB (LCI) | ||||||
U-Net[19] | - | 0.3513 | 0.4712 | 0.5526 | 0.7644 | 0.4955 |
DeepLabV3+[20] | ResNet50[21] | 0.8066 | 0.8898 | 0.8694 | 0.9294 | 0.8758 |
PraNet[22] | Res2Net50[23] | 0.7936 | 0.8825 | 0.8890 | 0.8992 | 0.8834 |
CaraNet[24] | Res2Net101[23] | 0.7600 | 0.8576 | 0.8335 | 0.9190 | 0.8398 |
TGANet[25] | ResNet50[21] | 0.8358 | 0.9061 | 0.8816 | 0.9474 | 0.8899 |
PVT-CASCADE[26] | PVTv2-B2[27] | 0.8344 | 0.9065 | 0.9074 | 0.9205 | 0.9056 |
DuAT[28] | PVTv2-B2[27] | 0.8551 | 0.9194 | 0.9200 | 0.9247 | 0.9191 |
SSFormer-L[29] | MiT-PLD-B4 | 0.8567 | 0.9207 | 0.9057 | 0.9466 | 0.9106 |
Dataset: PolypDB (NBI) | ||||||
U-Net[19] | - | 0.2161 | 0.2986 | 0.6472 | 0.2622 | 0.3905 |
DeepLabV3+[20] | ResNet50[21] | 0.6881 | 0.7733 | 0.8279 | 0.8511 | 0.7939 |
PraNet[22] | Res2Net50[23] | 0.6749 | 0.7473 | 0.7816 | 0.8836 | 0.7618 |
CaraNet[24] | Res2Net101[23] | 0.7249 | 0.8090 | 0.8312 | 0.8781 | 0.8194 |
TGANet[25] | ResNet50[21] | 0.7317 | 0.8402 | 0.8368 | 0.8645 | 0.8354 |
PVT-CASCADE[26] | PVTv2-B2[27] | 0.7769 | 0.8586 | 0.9385 | 0.8320 | 0.8941 |
DuAT[28] | PVTv2-B2[27] | 0.7494 | 0.8260 | 0.8662 | 0.8741 | 0.8476 |
SSFormer-L[29] | MiT-PLD-B4 | 0.7608 | 0.8432 | 0.9089 | 0.8462 | 0.8664 |
Dataset: PolypDB (WLI) | ||||||
U-Net[19] | - | 0.7452 | 0.8250 | 0.8275 | 0.8936 | 0.8203 |
DeepLabV3+[20] | ResNet50[21] | 0.8650 | 0.9168 | 0.9183 | 0.9380 | 0.9157 |
PraNet[22] | Res2Net50[23] | 0.8570 | 0.9089 | 0.9046 | 0.9460 | 0.9042 |
CaraNet[24] | Res2Net101[23] | 0.8582 | 0.9128 | 0.9149 | 0.9322 | 0.9114 |
TGANet[25] | ResNet50[21] | 0.8536 | 0.9088 | 0.9165 | 0.9284 | 0.9104 |
PVT-CASCADE[26] | PVTv2-B2[27] | 0.8731 | 0.9219 | 0.9268 | 0.9372 | 0.9227 |
DuAT[28] | PVTv2-B2[27] | 0.8695 | 0.9197 | 0.9170 | 0.9437 | 0.9168 |
SSFormer-L[29] | MiT-PLD-B4 | 0.8821 | 0.9294 | 0.9314 | 0.9438 | 0.9288 |
Method | MAP50 | MAP50-95 | MAP75 | P | R |
Dataset: PolypDB (BLI) | |||||
YOLOv8[30] | 0.659 | 0.502 | 0.559 | 1 | 0.318 |
YOLOv10[31] | 0.534 | 0.416 | 0.485 | 0.84 | 0.5 |
YOLOv9[32] | 0.688 | 0.558 | 0.638 | 0.846 | 0.5 |
YOLOv7[33] | 0.398 | 0.321 | 0.362 | 0.818 | 0.409 |
YOLOv6[34] | 0.594 | 0.418 | 0.438 | ||
YOLOv5[35] | 0.618 | 0.499 | 0.534 | 0.899 | 0.404 |
Dataset: PolypDB (FICE) | |||||
YOLOv8[30] | 0.759 | 0.667 | 0.759 | 0.981 | 0.625 |
YOLOv10[31] | 0.887 | 0.752 | 0.875 | 1 | 0.853 |
YOLOv9[32] | 0.856 | 0.711 | 0.737 | 0.937 | 0.75 |
YOLOv7[33] | 0.734 | 0.642 | 0.734 | 0.856 | 0.75 |
YOLOv6[34] | 0.761 | 0.608 | 0.658 | ||
YOLOv5[35] | 0.781 | 0.674 | 0.781 | 0.901 | 0.625 |
Dataset: PolypDB (LCI) | |||||
YOLOv8[30] | 0.833 | 0.771 | 0.833 | 1 | 0.667 |
YOLOv10[31] | 0.995 | 0.831 | 0.995 | 1 | 0.854 |
YOLOv9[32] | 0.972 | 0.878 | 0.972 | 0.857 | 1 |
YOLOv7[33] | 0.754 | 0.581 | 0.754 | 0.833 | 0.833 |
YOLOv6[34] | 0.832 | 0.778 | 0.832 | ||
YOLOv5[35] | 0.833 | 0.687 | 0.833 | 1 | 0.667 |
Dataset: PolypDB (NBI) | |||||
YOLOv8[30] | 0.659 | 0.502 | 0.559 | 1 | 0.318 |
YOLOv10[31] | 0.534 | 0.416 | 0.485 | 0.84 | 0.5 |
YOLOv9[32] | 0.688 | 0.558 | 0.638 | 0.846 | 0.5 |
YOLOv7[33] | 0.398 | 0.321 | 0.362 | 0.818 | 0.409 |
YOLOv6[34] | 0.594 | 0.418 | 0.438 | ||
YOLOv5[35] | 0.618 | 0.499 | 0.534 | 0.899 | 0.404 |
Dataset: PolypDB (WLI) | |||||
YOLOv8[30] | 0.913 | 0.766 | 0.868 | 0.883 | 0.88 |
YOLOv10[31] | 0.555 | 0.391 | 0.434 | 0.603 | 0.525 |
YOLOv9[32] | 0.912 | 0.757 | 0.836 | 0.899 | 0.856 |
YOLOv7[33] | 0.902 | 0.71 | 0.807 | 0.925 | 0.872 |
YOLOv6[34] | 0.925 | 0.744 | 0.831 | ||
YOLOv5[35] | 0.916 | 0.766 | 0.852 | 0.918 | 0.872 |
To evaluate the segmentation performance of the dataset, we employed several established segmentation methods: UNet[19], DeepLabv3+[20], PraNet[22], CaraNet[24], TGANet[25], PVT-CASCADE[26], DuAT[28] and, SSFormer-L[29]. To ensure equitable comparison, identical hyperparameters were applied across all models. To evaluate the detection performance, we utilized established algorithms such as YOLOv10[31], YOLOv9[32], YOLOv8[30], YOLOv7[33], YOLOv6[34], YOLOv5[35]. Below, we present both segmentation and detection wise performance for modality and center-wise based data.
III-C Segmentation results on each modality of the dataset
III-C1 Results on BLI
In the BLI dataset, DuAT emerged as the top-performing model, achieving the highest mIoU of 0.6979 and mDSC of 0.8048. DuAT also excelled in recall with a leading score of 0.9082 and maintained a strong precision of 0.7647, resulting in the best F2 score of 0.8501. SSFormer-L followed closely with the second-highest mIoU of 0.6750, trailing by 2.29%. Both PVT-CASCADE and SSFormer-L provided close competition in mDSC, scoring 0.7873 and 0.7874, respectively. PraNet secured the second-best scores in recall (0.8876) and F2 (0.8348). Overall, DuAT demonstrated superior performance, with no other model consistently securing second-best across multiple metrics.
III-C2 Results on FICE
SSFormer-L dominated the FICE modality, achieving the highest mIoU of 0.7607 and mDSC of 0.8300, along with an impressive F2 score of 0.8526. Its recall score of 0.8713 was the second-best, while its precision score of 0.8013 remained competitive. PVT-CASCADE also performed well, with an mIoU of 0.7209 and mDSC of 0.7799. DuAT excelled in recall, achieving the highest score of 0.9082 for this modality, but its lower precision score of 0.5867 impacted its overall performance. Although DeepLabV3+ achieved the highest precision score of 0.9441, it did not lead in other metrics.
III-C3 Results on LCI
For the LCI dataset, SSFormer-L once again led the performance metrics, achieving an mIoU of 0.8567 and an mDSC of 0.9207. It attained a near-perfect precision score of 0.9466 and an impressive F2 score of 0.9106, making it the top choice for LCI segmentation. DuAT also performed exceptionally well, with an mIoU of 0.8551 and mDSC of 0.9194, leading in recall with a score of 0.9200 and delivering a strong precision score of 0.9247. PVT-CASCADE closely followed, showing balanced results across all metrics, particularly in recall (0.9074) and precision (0.9205). While TGANet exhibited high precision with a score of 0.9474, its slightly lower recall and mIoU scores prevented it from outperforming SSFormer-L and DuAT.
III-C4 Results on NBI
In the NBI dataset, segmentation models exhibited varying performance levels. PVT-CASCADE, based on PVTv2-B2, demonstrated superior performance with an mIoU of 0.7769, mDSC of 0.8586, and recall of 0.9385, highlighting its efficacy in polyp identification. Additionally, it achieved an F2 score of 0.8941, underscoring its dominance in this domain. SSFormer-L followed with the second-best performance, achieving an mIoU of 0.7608 and mDSC of 0.8432, alongside a strong recall of 0.9089, which was 2.96% lower than that of PVT-CASCADE. PraNet secured the highest precision score at 0.8836. The DuAT model also delivered competitive results, particularly notable in recall (0.8662) and precision (0.8741), although it did not surpass the comprehensive performance of PVT-CASCADE.
III-C5 Results on WLI
The WLI modality results were highly competitive, with SSFormer-L standing out as the top performer, achieving the best mIoU of 0.8821 and mDSC of 0.9294. SSFormer-L also led in recall with a score of 0.9314 and secured the second-best precision score of 0.9438, resulting in an impressive F2 score of 0.9288. PVT-CASCADE followed closely with an mIoU of 0.8731 and mDSC of 0.9219, demonstrating consistent performance with a recall of 0.9268 and precision of 0.9372. Although the performance gap between SSFormer-L and PVT-CASCADE was minimal, SSFormer-L’s slight edge in multiple metrics made it the best choice for WLI segmentation. The DuAT model also delivered strong results, with a mIoU of 0.8695 and mDSC of 0.9197, showcasing competitive recall and precision scores.
III-D Detection results on each modality of the dataset
III-D1 Results on BLI
In the BLI dataset, YOLOv9 stood out with the best MAP50, MAP50-95, and MAP75 scores of 0.688, 0.558, and 0.638, respectively. Precision was dominated by YOLOv8, which achieved a perfect score of 1, although it lagged behind in recall. YOLOv10 and YOLOv9 were tied for the best recall score of 0.5, indicating their strong performance in detecting positive cases.
III-D2 Results on FICE
The FICE dataset results showed YOLOv10 outperforming other methods with the best MAP50, MAP50-95, and MAP75 scores of 0.887, 0.752, and 0.875, respectively. Additionally, YOLOv10 excelled in precision and recall, both achieving scores of 1 and 0.853, respectively, making it the most robust model for this modality.
III-D3 Results on LCI
In the LCI dataset, YOLOv10 achieved the highest MAP50 and MAP75 scores of 0.995, although YOLOv9 closely followed with the best MAP50-95 score of 0.878. YOLOv10 also demonstrated superior performance in precision with a score of 1, while YOLOv9 achieved the best recall score of 1, highlighting its effectiveness in identifying true positive cases.
III-D4 Results on NBI
For the NBI dataset, YOLOv9 delivered the best results with a MAP50 score of 0.688, a MAP50-95 score of 0.558, and a MAP75 score of 0.638. Precision was led by YOLOv8 with a perfect score of 1, but recall was highest for YOLOv10 and YOLOv9, both achieving a score of 0.5, showing their balanced performance in this modality.
III-D5 Results on WLI
For the WLI dataset, YOLOv6 achieved the best MAP50 score of 0.925, while YOLOv8 and YOLOv5 shared the best MAP50-95 score of 0.766. The best MAP75 score of 0.868 was achieved by YOLOv8, which also demonstrated strong performance in precision with a score of 0.883. However, YOLOv8 excelled in recall, achieving the highest score of 0.88, closely followed by YOLOv5 and YOLOv7, both of which exhibited robust overall performance.
IV Discussion
The quantitative results across the diverse datasets and modalities in PolypDB highlight the effectiveness of contemporary segmentation models, especially those utilizing advanced backbone architectures like PVTv2 and MiT-PLD. The variation in performance observed across modalities—NBI, WLI, BLI, FICE, and LCI—emphasizes the complex challenges inherent in polyp segmentation, where selecting the appropriate model architecture is crucial for attaining superior performance.
IV-A Impact of multi-modality and multi-center data
One of the key strengths of PolypDB is its inclusion of data from five distinct imaging modalities—BLI, FICE, LCI, NBI, and WLI—collected from three different medical centers across Norway, Sweden, and Vietnam. This diversity is crucial in ensuring that models trained on PolypDB can generalize well across different clinical environments and patient populations. The inclusion of multi-center data helps mitigate the risk of overfitting to a specific type of imaging or patient demographic, a common challenge in medical image analysis. As our results demonstrate, models trained and evaluated on this dataset show consistent performance across different modalities, suggesting that PolypDB can serve as a valuable resource for developing more universal and robust polyp detection and segmentation models.
IV-B Superior performance of PVT-CASCADE and SSFormer-L
In this study, PVT-CASCADE and SSFormer-L consistently demonstrated top-tier performance across several metrics, including mIoU, mDSC, recall, and F2 scores. Particularly in the NBI and LCI modalities, PVT-CASCADE stood out with its highest mIoU (0.7769 and 0.8344, respectively) and mDSC (0.8586 and 0.9065, respectively). This can be attributed to the powerful feature extraction capabilities of the PVTv2-B2 backbone, which effectively captures both global and local contextual information necessary for accurate polyp segmentation. The superior recall (0.9385) and F2 score (0.8941) observed in the NBI dataset further confirm PVT-CASCADE’s robustness in detecting subtle polyp structures, which is crucial in clinical settings where missing even a single polyp can have significant consequences. SSFormer-L, on the other hand, showed exceptional consistency across all modalities, particularly excelling in the WLI and LCI datasets. Its highest mIoU (0.8821) and mDSC (0.9294) in the WLI dataset, combined with its balanced performance in recall (0.9314) and precision (0.9438), underline its effectiveness in segmenting polyps with high accuracy. The integration of the MiT-PLD-B4 backbone likely contributes to this, as it enables the model to capture multi-scale information and maintain robustness across diverse visual features found in endoscopic images.
IV-C Implications for clinical applications
The clinical implications of this work is substantial. By providing a publicly accessible, large-scale dataset, PolypDB enables researchers to develop more accurate and generalizable CAD systems that can assist gastroenterologists in detecting and segmenting polyps with higher precision. This can lead to a reduction in polyp miss rates, which is critical in preventing colorectal cancer. Moreover, the modality-specific benchmarks provided in this study offer guidance on selecting the most appropriate models for different imaging modalities, potentially improving the overall quality of colonoscopy procedures.
IV-D Limitation and Future directions
While PolypDB represents a significant step forward, there are limitations that warrant further exploration. For example, the dataset, while large and diverse, may still not capture the full range of variability encountered in global clinical practice. Future work could involve expanding the dataset to include more centers, imaging systems, and patient demographics. Additionally, while the current study provides a strong benchmark for segmentation and detection, there is room for improvement in addressing specific challenges, such as detecting diminutive and flat polyps, which remain difficult even for state-of-the-art models.
Despite the promising results, there are several areas for future exploration. The current models, while effective, could benefit from further refinement to enhance their robustness across all modalities. Integrating multi-modal learning, where models are trained simultaneously on multiple modalities, could improve performance by allowing the models to leverage complementary information from different imaging techniques. Additionally, incorporating real-time processing capabilities will be crucial for practical deployment in clinical settings. While this study focuses on segmentation accuracy, future work should also consider these models’ computational efficiency and speed to ensure they can be seamlessly integrated into endoscopic procedures. Finally, expanding the datasets used for training and validation, including more diverse patient populations and a wider range of polyp types, will help to ensure that these models generalize well across different clinical environments.
V Conclusion
In this paper, we introduced PolypDB, a multi-center and multi-modality polyp segmentation and detection dataset designed to improve the advancement of polyp detection and segmentation in colonoscopy. PolypDB, comprising 3,934 polyp images from diverse imaging modalities and multiple medical centers, addresses the critical need for robust and generalizable data in developing CAD systems. The dataset’s diversity, in terms of both imaging modalities and geographical locations, ensures that models trained on PolypDB can perform effectively across a wide range of clinical settings, thereby enhancing their applicability in real-world scenarios. Our extensive benchmarking of SOTA segmentation and detection models demonstrated that models like SSFormer-L and YOLOv10 achieved superior performance across multiple modalities, establishing strong baselines for future research. While our study highlights the significant potential of PolypDB, it also opens up avenues for future research, including expanding the dataset’s diversity and exploring novel model architectures to address the remaining challenges in polyp detection, particularly with diminutive and flat polyps. In future research, we aim to develop a comprehensive video dataset that captures the dynamic aspects of polyp detection and segmentation during real-time colonoscopy procedures. This dataset will feature complete video sequences from multiple centers and modalities, offering richer temporal and contextual information.
Acknowledgment
This project is supported by NIH funding: R01-CA246704, R01-CA240639, U01-DK127384-02S1, and U01-CA268808.
References
- [1]H.Sung, J.Ferlay, R.L. Siegel, M.Laversanne, I.Soerjomataram, A.Jemal, and F.Bray, “Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries,” CA: a cancer journal for clinicians, vol.71, no.3, pp. 209–249, 2021.
- [2]K.R. Yabroff, A.Mariotto, F.Tangka, J.Zhao, F.Islami, H.Sung, R.L. Sherman, S.J. Henley, A.Jemal, and E.M. Ward, “Annual Report to the Nation on the Status of Cancer, Part 2: Patient Economic Burden Associated With Cancer Care,” JNCI: Journal of the National Cancer Institute, vol. 113, no.12, pp. 1670–1682, 2021.
- [3]J.T. Hetzel, C.S. Huang, J.A. Coukos, K.Omstead, S.R. Cerda, S.Yang, M.J. O’brien, and F.A. Farraye, “Variation in the detection of serrated polyps in an average risk colorectal cancer screening cohort,” Official journal of the American College of Gastroenterology— ACG, vol. 105, no.12, pp. 2656–2664, 2010.
- [4]A.Leufkens, M.VanOijen, F.Vleggaar, and P.Siersema, “Factors influencing the miss rate of polyps in a back-to-back colonoscopy study,” Endoscopy, vol.44, no.05, pp. 470–475, 2012.
- [5]M.D. Rutter, I.Beintaris, R.Valori, H.M. Chiu, D.A. Corley, M.Cuatrecasas, E.Dekker, A.Forsberg, J.Gore-Booth, U.Haug etal., “World endoscopy organization consensus statements on post-colonoscopy and post-imaging colorectal cancer,” Gastroenterology, vol. 155, no.3, pp. 909–925, 2018.
- [6]D.Jha, P.H. Smedsrud, M.A. Riegler, P.Halvorsen, T.d. Lange, D.Johansen, and H.D. Johansen, “Kvasir-SEG: A Segmented Polyp Dataset,” in Proceedings of the International Conference on Multimedia Modeling (MMM), 2020, pp. 451–462.
- [7]H.Borgli, V.Thambawita, P.H. Smedsrud, S.Hicks, D.Jha, S.L. Eskeland, K.R. Randel, K.Pogorelov, M.Lux, D.T.D. Nguyen etal., “Hyperkvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy,” Scientific data, vol.7, no.1, pp. 1–14, 2020.
- [8]P.H. Smedsrud, V.Thambawita, S.A. Hicks, H.Gjestang, O.O. Nedrejord, E.Næss, H.Borgli, D.Jha, T.J.D. Berstad, S.L. Eskeland etal., “Kvasir-capsule, a video capsule endoscopy dataset,” Scientific Data, vol.8, no.1, pp. 1–10, 2021.
- [9]J.Bernal and H.Aymeric, “MICCAI endoscopic vision challenge polyp detection and segmentation,” 2017.
- [10]N.Tajbakhsh, S.R. Gurudu, and J.Liang, “Automated polyp detection in colonoscopy videos using shape and context information,” IEEE transactions on medical imaging, vol.35, no.2, pp. 630–644, 2015.
- [11]P.NgocLan, N.S. An, D.V. Hang, D.V. Long, T.Q. Trung, N.T. Thuy, and D.V. Sang, “Neounet: Towards accurate colon polyp segmentation and neoplasm detection,” in Proceedings of the International Symposium on Visual Computing, 2021, pp. 15–28.
- [12]S.Ali, D.Jha, N.Ghatwary, S.Realdon, R.Cannizzaro, O.E. Salem, D.Lamarque, C.Daul, M.A. Riegler, K.V. Anonsen etal., “Polypgen: A multi-center polyp detection and segmentation dataset for generalisability assessment,” Scientific Report, 2023.
- [13]D.Jha, P.H. Smedsrud, D.Johansen, T.deLange, H.D. Johansen, P.Halvorsen, and M.A. Riegler, “A Comprehensive Study on Colorectal Polyp Segmentation With ResUNet++, Conditional Random Field and Test-Time augmentation,” IEEE Journal of Biomedical and Health Informatics, vol.25, no.6, pp. 2029–2040, 2021.
- [14]U.Ladabaum, A.Fioritto, A.Mitani, M.Desai, J.P. Kim, D.K. Rex, T.Imperiale, and N.Gunaratnam, “Real-time optical biopsy of colon polyps with narrow band imaging in community practice does not yet meet key thresholds for clinical decisions,” Gastroenterology, vol. 144, no.1, pp. 81–91, 2013.
- [15]C.J. Rees, P.T. Rajasekhar, A.Wilson, H.Close, M.D. Rutter, B.P. Saunders, J.E. East, R.Maier, M.Moorghen, U.Muhammad etal., “Narrow band imaging optical diagnosis of small colorectal polyps in routine clinical practice: the detect inspect characterise resect and discard 2 (discard 2) study,” Gut, vol.66, no.5, pp. 887–895, 2017.
- [16]D.E. van Toledo, J.E. IJspeert, P.M. Bossuyt, A.G. Bleijenberg, M.E. van Leerdam, M.vander Vlugt, I.Lansdorp-Vogelaar, M.C. Spaander, and E.Dekker, “Serrated polyp detection and risk of interval post-colonoscopy colorectal cancer: a population-based study,” The Lancet Gastroenterology & Hepatology, vol.7, no.8, pp. 747–754, 2022.
- [17]J.C. VanRijn, J.B. Reitsma, J.Stoker, P.M. Bossuyt, S.J. VanDeventer, and E.Dekker, “Polyp miss rate determined by tandem colonoscopy: a systematic review,” Official journal of the American College of Gastroenterology— ACG, vol. 101, no.2, pp. 343–350, 2006.
- [18]A.Paszke, S.Gross, F.Massa, A.Lerer, J.Bradbury, G.Chanan, T.Killeen, Z.Lin, N.Gimelshein, L.Antiga etal., “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), vol.32, 2019.
- [19]O.Ronneberger, P.Fischer, and T.Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Proceedings of the International Conference on Medical image computing and computer-assisted intervention (MICCAI), 2015, pp. 234–241.
- [20]L.-C. Chen, Y.Zhu, G.Papandreou, F.Schroff, and H.Adam, “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818.
- [21]K.He, X.Zhang, S.Ren, and J.Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp. 770–778.
- [22]D.-P. Fan, G.-P. Ji, T.Zhou, G.Chen, H.Fu, J.Shen, and L.Shao, “PraNet: Parallel Reverse Attention Network for Polyp Segmentation,” in Proceedings of the International conference on medical image computing and computer-assisted intervention (MICCAI), 2020, pp. 263–273.
- [23]S.-H. Gao, M.-M. Cheng, K.Zhao, X.-Y. Zhang, M.-H. Yang, and P.Torr, “Res2net: A new multi-scale backbone architecture,” IEEE transactions on pattern analysis and machine intelligence, vol.43, no.2, pp. 652–662, 2019.
- [24]A.Lou, S.Guan, H.Ko, and M.H. Loew, “Caranet: Context axial reverse attention network for segmentation of small medical objects,” in Medical Imaging 2022: Image Processing, vol. 12032, 2022, pp. 81–92.
- [25]N.K. Tomar, D.Jha, U.Bagci, and S.Ali, “TGANet: text-guided attention for improved polyp segmentation,” in Proceedings of the Medical Image Computing and Computer Assisted Intervention (MICCAI), 2022, pp. 151–160.
- [26]M.M. Rahman and R.Marculescu, “Medical image segmentation via cascaded attention decoding,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 6222–6231.
- [27]W.Wang, E.Xie, X.Li, D.-P. Fan, K.Song, D.Liang, T.Lu, P.Luo, and L.Shao, “Pvt v2: Improved baselines with pyramid vision transformer,” Computational Visual Media, vol.8, no.3, pp. 415–424, 2022.
- [28]F.Tang, Z.Xu, Q.Huang, J.Wang, X.Hou, J.Su, and J.Liu, “Duat: Dual-aggregation transformer network for medical image segmentation,” in Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), 2023, pp. 343–356.
- [29]W.Shi, J.Xu, and P.Gao, “Ssformer: A lightweight transformer for semantic segmentation,” in Proceedings of the IEEE 24th International Workshop on Multimedia Signal Processing (MMSP), 2022, pp. 1–5.
- [30]Ultralytics, “Yolov8 - ultralytics,” https://github.com/ultralytics/ultralytics, 2023, accessed: [Date].
- [31]A.Wang, H.Chen, L.Liu, K.Chen, Z.Lin, J.Han, and G.Ding, “Yolov10: Real-time end-to-end object detection,” arXiv preprint arXiv:2405.14458, 2024.
- [32]C.-Y. Wang, I.-H. Yeh, and H.-Y.M. Liao, “Yolov9: Learning what you want to learn using programmable gradient information,” arXiv preprint arXiv:2402.13616, 2024.
- [33]C.-Y. Wang, A.Bochkovskiy, and H.-Y.M. Liao, “Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 7464–7475.
- [34]C.Li, L.Li, H.Jiang, K.Weng, Y.Geng, L.Li, Z.Ke, Q.Li, M.Cheng, W.Nie etal., “Yolov6: A single-stage object detection framework for industrial applications,” arXiv preprint arXiv:2209.02976, 2022.
- [35]G.Jocher, “Yolov5 by ultralytics,” 2020. [Online]. Available: https://github.com/ultralytics/yolov5