QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge

Hongwei Bran Li Fernando Navarro Ivan Ezhov Amirhossein Bayat Dhritiman Das Florian Kofler Suprosanna Shit Diana Waldmannstetter Johannes C. Paetzold Xiaobin Hu Benedikt Wiestler Lucas Zimmer Tamaz Amiranashvili Chinmay Prabhakar Christoph Berger Jonas Weidner Michelle Alonso-Basanta Arif Rashid Ujjwal Baid Wesam Adel Deniz Alis Bhakti Baheti Yingbin Bai Ishaan Bhat Sabri Can Cetindag Wenting Chen Li Cheng Prasad Dutande Lara Dular Mustafa A. Elattar Ming Feng Shengbo Gao Henkjan Huisman Weifeng Hu Shubham Innani Wei Ji Davood Karimi Hugo J. Kuijf Jin Tae Kwak Hoang Long Le Xiang Li Huiyan Lin Tongliang Liu Jun Ma Kai Ma Ting Ma Ilkay Oksuz Robbie Holland Arlindo L. Oliveira Jimut Bahan Pal Xuan Pei Maoying Qiao Anindo Saha Raghavendra Selvan Linlin Shen Joao Lourenco Silva Ziga Spiclin Sanjay Talbar Dadong Wang Wei Wang Xiong Wang Yin Wang Ruiling Xi Kele Xu Yanwu Yang Mert Yergin Shuang Yu Lingxi Zeng YingLin Zhang Jiachen Zhao Yefeng Zheng Martin Zukovec Richard Do Anton Becker Amber Simpson Ender Konukoglu Andras Jakab Spyridon Bakas Leo Joskowicz Bjoern Menze Department of Informatics, Technical University of Munich, Germany. Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Harvard Medical School, USA. Department of Quantitative Biomedicine, University of Zurich, Switzerland. University Children’s Hospital Zurich, University of Zurich, Switzerland. Department of Radioncology and Radiation Theraphy , Klinikum rechts der Isar, Technical University of Munich, Germany Department of Information Technology and Electrical Engineering, ETH-Zurich, Switzerland. Department of Radiology, Memorial Sloan Kettering Cancer Center in New York City, USA Department of Biomedical and Molecular Sciences, Queen’s University, Canada TranslaTUM - Central Institute for Translational Cancer Research, Technical University of Munich, Germany McGovern Institute, Massachusetts Institute of Technology, USA Institute for Diagnostic and Interventional Radiology, Unveristy Zurich Hospital, Switzerland. BioMedIA, Imperial College London, United Kingdom. Department of Radiation Oncology, University of Pennsylvania, PA, USA University of Pennsylvania, PA, USA Department of Radiation Oncology, Winship Cancer Institute of Emory University, Georgia, USA Nile University, Cairo, Egypt Department of Medical Sciences, Acibadem University, Istanbul, Turkey Shri Guru Gobind Singhji Institute of Engineering and Technology, Nanded, Maharashtra, India Trustworthy Machine Learning Lab, University of Sydney, Australia Image Sciences Institute, University Medical Center Utrecht, The Netherlands Computer Engineering Department, Istanbul Technical University, Istanbul, Turkey School of Computer Science, Shenzhen University, Shenzhen, China University of Alberta, USA University of Ljubljana, Faculty of Electrical Engineering, Ljubljana, Slovenia Tongji University, Shanghai, China OPPO Research Institute, Shanghai, China School of Biological and Medical Engineering, Beihang University, Beijing, China Harvard Medical School, Boston, USA School of Electrical Engineering, Korea University, Seoul, Korea Department of Computer Science and Engineering, Sejong University, Seoul, Korea Harbin Institute of Technology, China Southern University of Science and Technology, China Department of Electronic and Information Engineering, Harbin Institute of Technology at Shenzhen, China Peng Cheng Lab, Shenzhen, China Advanced Innovation Center for Human Brain Protection, Capital Medical University, Beijing, China National Clinical Research Center for Geriatric Disorders, Xuanwu Hospital Capital Medical University, Beijing, China Instituto Superior Tecnico / INESC-ID, Portugal Department of Computer Science, Ramakrishna Mission Vivekananda Educational and Research Institute, India Australian Catholic University, Australia Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia University of Copenhagen, Denmark National Key Lab of Parallel and Distributed Processing, Changsha, China National University of Defense Technology, Changsha, China Hevi AI, Istanbul, Turkey Department of Computer Science and Engineering, Hongkong University of Science and Technology, China Tencent Healthcare (Shenzhen) Co., Ltd, China Department of Mathematics, Nanjing University of Science and Technology, China Diagnostic Image Analysis Group, Radboud University Medical Center, Nijmegen, The Netherlands The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel Helmholtz AI, Helmholtz Zentrum München, Germany TranslaTUM - Central Institute for Translational Cancer Research, Technical University of Munich, Germany Department of Diagnostic and Interventional Neuroradiology, School of Medicine, Klinikum rechts der Isar, Technical University of Munich, Germany
Abstract

Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the development and evaluation of automated segmentation algorithms. Accurately modeling and quantifying this variability is essential for enhancing the robustness and clinical applicability of these algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ), which was organized in conjunction with International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2020 and 2021. The challenge focuses on the uncertainty quantification of medical image segmentation which considers the omnipresence of inter-rater variability in imaging datasets. The large collection of images with multi-rater annotations features various modalities such as MRI and CT; various organs such as the brain, prostate, kidney, and pancreas; and different image dimensions 2D-vs-3D. A total of 24 teams submitted different solutions to the problem, combining various baseline models, Bayesian neural networks, and ensemble model techniques. The obtained results indicate the importance of the ensemble models, as well as the need for further research to develop efficient 3D methods for uncertainty quantification methods in 3D segmentation tasks.

\UseRawInputEncoding\floatsetup

[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth] \newpageafterauthor

1 Introduction

Background

The segmentation of anatomical structures and pathologies in medical images frequently encounters substantial inter-rater variability (Lazarus et al., 2006; Watadani et al., 2013), which in turn significantly impacts downstream supervised-learning tasks and clinical decision-making processes. This variability becomes especially pronounced in the context of medical imaging, where manual annotations are often limited and costly to acquire (Kofler et al., 2023). A notable example of this challenge is the segmentation of liver lesions in CT scans, which is inherently complex even for experienced experts, due to the variability in lesion location, contrast, and size among different patients (Joskowicz et al., 2019). It has been observed that the range of variability in manual delineations for various structures and observers is extensive, encompassing a wide spectrum of structures and pathologies, as shown in Figure 1. The involvement of only two or three observers may be inadequate to capture the full breadth of potential variability in the outlines of the targeted structures. This variability, intrinsic to the biological problem, the imaging modality, and the expertise of the annotators has not yet been adequately addressed in the design of computerized algorithms for medical image quantification (Kofler et al., 2021b).

Refer to caption
Figure 1: Visualisation of the multi-rater segmentation masks on brain and prostate MRI scans and their derived uncertainty map.
Uncertainty quantification

Current methods for modeling uncertainty in predicted image segmentations primarily stem from general statistical model considerations, ensemble approaches involving resampling of training datasets, and aggregating multiple segmentation results, or systematic modifications to the predictive algorithm, as seen in techniques like Monte Carlo (MC) dropout. Yet, the exact delineation of segmented structures within an image inherently carries uncertainty, which is both task-specific and dependent on the dataset. Importantly, this uncertainty can be directly extrapolated from annotations made by multiple human experts. To our knowledge, there are currently no datasets available specifically for evaluating the accuracy of probabilistic model predictions against such multi-expert ground truths. Furthermore, there is a lack of consensus on which uncertainty quantification procedures yield realistic estimates and which do not.

Objective

The primary goal of the challenge is to establish a benchmark for algorithms that generate uncertainty estimates (such as probability scores and variability regions) in medical imaging segmentation tasks. The focus is to compare these algorithmic outputs against the uncertainties ascribed by human annotators in the local delineation of structures across various biomedical imaging segmentation tasks. These tasks include, but are not limited to, the segmentation of lesions (such as brain, pancreas, or prostate tumors) and anatomical structures (like brain, kidney, prostate, and pancreas). Multiple expert annotations have been gathered for several CT and MR image datasets to quantify boundary delineation variability.

Contributions

In an effort to assess the latest methods in uncertainty quantification for medical image segmentation, we organized the Uncertainty Quantification of Biomedical Image Quantification Challenge (QUBIQ) at MICCAI-2020 and MICCAI-2021. This paper highlights three major contributions to this field. Firstly, we introduce a new, publicly available multi-rater, multi-center, multi-modality dataset that includes both 2D and 3D segmentation tasks. Secondly, we present the setup and summarize the findings of our QUBIQ uncertainty quantification benchmarks held at two grand challenges. Lastly, we review, evaluate, rank, and analyze the state-of-the-art algorithms that emerged from these benchmarks.

2 Prior Work on Approaches and Datasets

2.1 Prior work.

There is a body of literature that models uncertainty and inter-rater variability in biomedical image segmentation (Lê et al., 2016; Sabuncu et al., 2010; Kwon et al., 2020; Roy et al., 2019; Ilg et al., 2018). Some of the prior methods directly extract uncertainty estimates from trained models, either by augmenting the input image Wang et al. (2019) or by generating multiple potential segmentations using MC dropout Nair et al. (2020). Others modify techniques into ensemble methods that generate multiple parallel predictions Ilg et al. (2018) or by running multiple models in parallel Calisto & Lai-Yuen (2020). Kofler et al. (2021a) extend this further to create an ensemble of multiple approaches from the literature and create a system to alert the user if there is low segmentation agreement within the ensemble. In contrast, others explicitly model inter-rater uncertainty. In Probabilistic U-Net, Kohl et al. Kohl et al. (2018) use variational inference to learn a prior distribution of variability, from which they sample plausible segmentations, while Baumgartner et al. Baumgartner et al. (2019) extend this to a hierarchical model capable of modeling uncertainty at different levels of abstraction within the U-Net architecture. Monteiro et al. Monteiro et al. (2020) explicitly model uncertainty by learning a low-rank pixel-wise covariance matrix.

2.2 Publicly available datasets.

Table 1 showcases available datasets for uncertainty quantification task. Most of the datasets feature multi-rater labeling. Each focuses on a particular pathological or healthy anatomy segmentation task. Therefore, the datasets contain either contain 2D or 3D images, CT or MRI modality. The QUBIQ challenge offers a dataset composed of multiple tasks for both image dimensions and imaging modalities.

Dataset Modality Target 2D 3D #Images multi-rater
LIDC-IDRI (Armato III et al., 2011) CT Lung nodule 1,018
MICCAI-2012 (Litjens et al., 2012) MRI Prostate 48
ISBI-2015 (Styner et al., 2008) MRI MS lesion 21
BraTS (Mehta et al., 2020) MRI brain tumor 335
QUBIQ CT,MRI six tasks **
Table 1: Overview of publicly available medical datasets for uncertainty quantification in image segmentation tasks. (to be updated)

3 QUBIQ challenge

3.1 QUBIQ datasets

3.1.1 Dataset creation.

For the adult glioma segmentation task, we employ three label sets. The first label set is the original label from the BraTS adult glioma segmentation challenge (Bakas et al., 2019). Additionally, we use two algorithm-based labels obtained from BraTS Toolkit (Kofler et al., 2020). To generate these, we first generate five algorithmic (Isensee et al., 2019; McKinley et al., 2019; Feng et al., 2020, 2020; Zhao et al., 2019; McKinley et al., 2020) glioma segmentations. Subsequently, we fuse these using basic majority voting and SIMPLE fusion (Langerak et al., 2010).

Dataset Modality [2020,2021] 2D 3D #Images #Tasks Source (NEED to double check
Prostate segmentation MRI [✓,✓] 55 2 ETH Zürich
Brain growth segmentation MRI [✓,✓] 39 1 University of Zürich
Brain tumor segmentation multimodal MRI [✓,✓] 32 3 University of Pennsylvania
Kidney segmentation CT [✓,✓] 24 1 Technical University of Munich
Pancreas segmentation CT [,✓] 38 1 University of Pennsylvania
Pancreatic lesion segmentation CT [,✓] 21 1 University of Pennsylvania
Table 2: Overview of QUBIQ datasets and the sub-tasks

3.2 Evaluation metrics and ranking

For the evaluation, each participant had to segment the given binary structures and predict the distribution of the experts’ labels by returning one mask with continuous values between 0 and 1 which is supposed to reproduce the average segmentations of the experts.

Predictions and continuous ground truth labels are compared by thresholding the continuous labels at predefined thresholds and calculating the volumetric overlap of the resulting binary volumes using the Dice score (the continuous ground truth labels are obtained by averaging multiple experts’ annotations). To this end, both the ground truth and prediction are binarized at ten probability levels (0.1, 0.2, …, 0.8, 0.9). Dice scores for all thresholds are averaged.

The Q-Dice, a staged Dice score, is used to quantify the quality of the predicted probability map p𝑝pitalic_p against the ground truth y𝑦yitalic_y in L𝐿Litalic_L discrete probability levels, formulated as:

TL(p,l)={𝟙{lLp<l+1L},if 0l<L1𝟙{lLpl+1L},if l=L1subscript𝑇𝐿𝑝𝑙cases1𝑙𝐿𝑝𝑙1𝐿if 0𝑙𝐿11𝑙𝐿𝑝𝑙1𝐿if 𝑙𝐿1T_{L}(p,l)=\begin{dcases}\mathbbm{1}\left\{\frac{l}{L}\leq p<\frac{l+1}{L}% \right\},&\text{if }0\leq l<L-1\\ \mathbbm{1}\left\{\frac{l}{L}\leq p\leq\frac{l+1}{L}\right\},&\text{if }l=L-1% \\ \end{dcases}italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_p , italic_l ) = { start_ROW start_CELL blackboard_1 { divide start_ARG italic_l end_ARG start_ARG italic_L end_ARG ≤ italic_p < divide start_ARG italic_l + 1 end_ARG start_ARG italic_L end_ARG } , end_CELL start_CELL if 0 ≤ italic_l < italic_L - 1 end_CELL end_ROW start_ROW start_CELL blackboard_1 { divide start_ARG italic_l end_ARG start_ARG italic_L end_ARG ≤ italic_p ≤ divide start_ARG italic_l + 1 end_ARG start_ARG italic_L end_ARG } , end_CELL start_CELL if italic_l = italic_L - 1 end_CELL end_ROW (1)

Compared to the original Dice score, Q-score quantifies the uncertainty by comparing the prediction and ground truth maps at different confidence levels. Since in most cases experts agree on most parts of the annotations, the variance of different Q-score demonstrates how well the prediction modeled the uncertainty on the borders of the structure of interest.

3.3 Challenge events

The QUBIQ challenge was organized within the MICCAI conference using the Grand Challenge platform. Below in Tables 3, 4, and 5, we provide descriptions of algorithms across the two iterations of the QUBIQ challenge (QUBIQ2020 and QUBIQ2021). Fig. 2 quantitatively compares the algorithms over the two iterations of the challenge.

3.4 Results

In Tables 3 and 4, we show the leaderboard for both iterations of the challenge.

Ref. Name Brain-growth Brain-tumor Brain-tumor Brain-tumor Kidney Prostate Prostate Average Average
Task 1 Task 2 Task 3 Task 1 Task 2 Ranking Dice
  Jun_Ma 0.921 0.936 0.809 0.822 0.310 0.970 0.918 7.857 0.812
  Yanwu_Yang 0.893 0.917 0.699 0.836 0.825 0.937 0.878 7.143 0.855
Macaroon 0.878 0.848 0.528 0.690 0.238 0.937 0.890 5 0.715
  Raghavendra_Selvan 2 0.885 0.899 0.617 0.682 0.695 0.883 0.800 4.857 0.780
  Raghavendra_Selvan 0.907 0.874 0.602 0.690 0.639 0.858 0.780 4.714 0.764
  Wei_Ji 0.900 0.755 0.323 0.605 0.915 0.941 0.845 4.714 0.755
  Xiang_Li 0.865 0.931 0.513 0.556 0.903 0.914 0.872 4.714 0.793
  Ujjwal_Baid 0.840 0.782 0.406 0.568 0.956 0.891 0.702 3.143 0.735
Maykol_Campos 0.849 0.799 0.522 0.613 0.805 0.838 0.630 2.857 0.722
anysys99 0.818 0.893 0.485 0.724 - 0.890 0.804 - -
  Davood_Karimi 0.874 0.900 0.452 - 0.785 0.947 0.897 - -
Table 3: Results QUIBIQ 2020 ordered according to the ranking score. The top 3 performing teams are highlighted in blue color. Notice that only teams participating in all tasks are considered for the overall ranking.
Team Brain-growth Brain-tumor Brain-tumor Brain-tumor Kidney Prostate Prostate Pancreas Pancreatic Average Average
Task 1 Task 2 Task 3 Task 1 Task 2 Lesion Ranking Dice
Peng-Cheng_Shi 0.929 0.938 0.819 0.847 0.954 0.969 0.920 0.550 0.272 11.111 0.800
  Yingbin_Bai 0.915 0.928 0.793 0.815 0.940 0.968 0.920 0.579 0.205 9.333 0.785
  Lara_Dular 0.928 0.938 0.820 0.899 0.467 0.958 0.909 0.499 0.283 8.778 0.745
  Lawrence_Schobs 0.300 0.939 0.780 0.798 0.503 0.969 0.915 0.683 0.330 8.556 0.691
 Sabrican_Cetindag 0.928 0.932 0.769 0.883 0.839 0.964 0.922 0.409 0.231 8.444 0.764
  Yucong_Chen 0.916 0.927 0.775 0.840 0.952 0.952 0.907 0.572 0.130 7.889 0.775
  Hoang_Long_Le 0.912 0.899 0.680 0.754 0.706 0.971 0.927 0.575 0.246 7.444 0.741
 Dewen_Zeng 0.927 0.940 0.695 0.835 0.894 0.947 0.911 0.423 0.126 7.111 0.744
 Joao_Lourenco_Silva 0.931 0.929 0.750 0.797 0.511 0.968 0.920 0.075 0.068 6.333 0.661
  Anindo_Saha 0.892 0.917 0.695 0.740 0.950 0.936 0.859 0.546 0.194 5.222 0.748
  Wang_Xiong 0.893 0.905 0.589 0.784 0.930 0.916 0.862 0.557 0.204 5.222 0.738
  Ishaan_Rajesh 0.892 0.919 0.638 0.704 0.858 0.861 0.799 0.316 0.122 3.111 0.679
  Stephan_Huschauer 0.719 0.865 0.525 0.551 0.856 0.911 0.842 0.423 0.118 2.444 0.646
  Jiachen_Zhao 0.873 0.844 0.547 0.787 0.835 0.931 0.884 - - - -
  Jimut_Bahan_Pal 0.869 0.842 0.456 0.690 0.769 0.833 0.781 - - - -
 Mohammad_Eslami 0.848 0.404 0.377 0.236 0.716 0.883 0.816 - - - -
  Shengbo_Gao 0.802 0.885 0.627 0.661 0.910 - - 0.557 0.130 - -
 Xiaofeng_Liu 0.800 - - - - - - - - - -
 Timothy_S 0.780 - - - - - - - - - -
Table 4: Results QUIBIQ 2021 ordered according to the ranking score. The top 3 performing teams are highlighted in blue color. Notice that only teams participating in all tasks are considered for the overall ranking.
Refer to caption
Figure 2: Pictorial evaluation of QUBIQ challenge for different tasks over two years. Observe that for every task, the top-performing methods produce higher scores in 2021 than in 2020. Also, the methods in 2021 are more competitive than in 2020.

4 Conclusion

In this paper, we report on the results of the QUBIQ challenge (Quantification of Uncertainties in Biomedical Image Quantification Challenge), which was organized in conjunction with International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI). Quantifying uncertainty in medical imaging is paramount for image analysis, as inter-rater variability is omnipresent in imaging datasets. Such quantification could reduce barriers to adopting learnable algorithms into clinical practice. With the QUBIQ challenge, we aim to fill the empty space among the medical imaging challenges, which are dominated by competition in deterministic segmentation, ignoring the importance of uncertainty prediction.

Acknowledgement

The research is supported through the SFB 824, subproject B12, as well as by Deutsche Forschungsgemeinschaft (DFG) via TUM International Graduate School of Science and Engineering (IGSSE), GSC 81. We acknowledge support by the Helmut Horten Foundation and by the Translational Brain Imaging Training Network (TRABIT) under the EU ‘Horizon 2020’ research & innovation program (Grant agreement ID: 765148). Research reported in this publication was partly supported by the National Institutes of Health (NIH) under award numbers NIH/NCI:U01CA242871 and NIH/NINDS:R01NS042645.

Table 5: Details of the participating teams’ methods in QUBIQ-Challenge-2020.
Lead Author &
Team Members
Method, Architecture &
Modifications
Data Augmentation
Loss Function Pre-processing Label Processing Ensemble strategy
  Jun Ma
Multiple 2D U-Nets
(one per annotator).
None
Cross-entropy
& Dice loss
None
None
Averaging
  Davood Karimi
2D U-Net with additional
connections between coarse
and fine feature layers in the
encoder. Dynamic loss
weighting for harder classes.
Multi-task training approach.
None
1-Dice similarity loss
Zero mean, unit variance
standardization.
Averaging
annotations
None
  Ming Feng;
Kele Xu,
Yin Wang
2D U-Net trained with
ground truth & predictions
binarized at different levels,
averaging Dice score of
each prediction.
Random scaling
[0.9,1.1]0.91.1[0.9,1.1][ 0.9 , 1.1 ]
Weighted cross-entropy
& Dice loss
Resizing to
256 ×\times× 256 (brain tumor)
512 ×\times× 512 (kidney & prostrate).
Normalization to [0, 255].
Averaging
annotations
None
  Raghavendra Selvan
Multi-channel U-Net,
one channel for each rater.
Based on concept of
Normalizing Flows.
None
Planar Flow
& Dice loss
None
None
None
  Ujjwal Baid;
Prasad Dutande,
Shubham Innani,
Bhakti Baheti,
Sanjay Talbar
ResNet34 based encoder
-decoder. Different annotations
included as individual
copies in training set.
Rotation, flip
& scaling
None
Resizing to
256 ×\times×256 (brain)
512 ×\times×512 (kidney)
640 ×\times×640 (prostate).
None
None
  Wesam Adel;
Mustafa A. Elattar
2D U-Net trained
with averaged annotations
and as a regression problem.
None
Weighted KL-divergence
Resizing brain to
256 ×\times× 256 with
rotation and elastic
deformation.
None
None
  Xiang Li
U Net with attention,
4x downsampling for Kidney
and 5x for others.
None
Weighted cross-entropy
Cropping to
128 ×\times× 128 (kidney)
416 ×\times× 416 (prostate)
Averaging
annotations
None
  Yanwu Yang;
Ting Ma
2D U-Net with multiple
branches. Instance Norm
instead of Batch Norm.
One model per annotation
integrated using auxiliary loss.
None
Cross-entropy
& Dice loss
MRI: z-score normalization.
CT: centering on ROI
and rescaling to [0,1]01[0,1][ 0 , 1 ]
None
Averaging
  Wei Ji;
Wenting Chen
Shuang Yu
Kai Ma
Li Cheng
Linlin Shen
Yefeng Zheng
U-Net with Resnet-34
encoder. One output
channel per label and
one model per annotation
integrated using auxiliary loss.
None
Cross-entropy
Resizing to
512 ×\times×512
Both fused final & individual
labels. Combining labels via
averaging, random sampling
& label sampling.
Weighted average
Table 6: Details of the participating teams’ methods in QUBIQ-Challenge-2021 (part 1).
Lead Author &
Team Members
Method, Architecture &
Modifications
Data Augmentation
Loss Function Pre-processing Label Processing Ensemble strategy
  Anindo Saha;
Henkjan Huisman
Probabilistic U-Net
with MC Dropout.
Gaussian noise, horizontal
flip, rotation, translation
& scaling
KL-Divergence
& Dice loss
z-Score normalization,
centre cropping to
512 ×\times×512 (Kidney)
256×\times×256 (Brain)
640×\times×640 (Prostrate)
None
Deep ensemble
(averaging)
  Hoang Long Le;
Jin Tae Kwak
DeepLabv3 & EfficientNet
(latter for classifying
pancreas existence). 9 binary
ground truths from thresholding.
Gaussian noise, horizontal
& vertical flip, rotation, shift
scaling, blur random brightness
Dice loss
Normalization to [0, 255]
Averaging
annotations
Multiplying binary
segmentation map with
threshold value & taking
per pixel maximum.
  Ishaan Bhat;
Hugo J. Kuijf
Probabilistic U-Net
with MC Dropout.
Random flip,
rotation, brightness &
contrast
Cross-entropy &
KL-Divergence
z-Score normalization,
Resizing to 256×\times×256 (brain)
512×\times×512 (kidney)
512×\times×512 (pancreas)
None
Deep ensemble
(averaging)
  Jiachen Zhao
U-Net
Random flip
Dice loss
Resizing to
256 ×\times× 256
Averaging
annotations
None
  Jimut Bahan Pal
Multiple U-Nets
(one per annotation).
None
Focal Tversky
None
None
None
  João Lourenço Silva;
Arlindo L. Oliveira
U-Net with
EfficientNet-B0
encoder.
Rotation, horizontal
& vertical flip,
translation & zoom
Cross-entropy
None
Averaging
annotations
None
  Lawrence Schobs
Multiple nnU-Nets
(one per annotator,
2D for pancreas, 3D for other).
Gaussian noise,
rotation, scaling
mirroring & inhomogeneity
Dice loss
Resizing prostate images
to 640×640640640640\times 640640 × 640. Image
sampling and
normalization.
Averaging
annotations
None
  Martin Z̃ukovec;
Lara Dular
Z̃iga S̃piclin
nnU-Net. Multi-task
training approach with labels
0–N (N annotators + background).
Gaussian noise,
rotation, scaling
mirroring & inhomogeneity
Dice loss
Image sampling
and normalization
Addition of
segmentations
None
  Sabri Can Cetindag;
Mert Yergin
Deniz Alis
Ilkay Oksuz
nnU-Net, training one U-Net
per annotator, then adding
segmentation map output
as extra channels.
None
Cross-entropy
& Dice loss
None
Average of
annotations
for stage 2
None
  Xiong Wang;
Shengbo Gao
Weifeng Hu
Xuan Pei
2D: U-Net (one per annotator,
obtained via label fusion),
MaskNet, and DeepLab V3+.
3D: SegResNet.
None
Weighted focal
& Dice loss
None
Label fusion
for training
individual models
Weighted combination
  YingLin Zhang;
Wei Wang
Ruiling Xi
Lingxi Zeng
Huiyan Lin
UNet, UNet++,
and TransUNet.
Gaussian noise, horizontal
& vertical flip, rotation,
translation, zoom, brightness,
sharpness, contrast, blur
& elastic deformation
Multi-level Dice loss
Center-cropping images
to ROI. Discarding images
with unclear ROI
Majority voting,
cumulative division & even
division of segmentations
Weighted combination
  Yingbin Bai;
Maoying Qiao
Dadong Wang
Tongliang Liu
UNet++ with
EfficientNet-B7 encoder.
None
Multi-level Dice loss
None
Weighted combination
of individual
segmentation maps
None
Table 7: Details of the participating teams’ methods in QUBIQ-Challenge-2021 (part 2).
Lead Author &
Team Members
Method, Architecture &
Modifications
Data Augmentation
Loss Function Pre-processing Label Processing Ensemble strategy
  Dewen Zeng;
Yukun Ding
Yiyu Shi
2D U-Net with multiple
loss functions (one each
for individual annotators,
aggregated label & multi-scale
threshold).
None
Cross-entropy
z-Score normalization,
resizing
Averaging
annotations
None
  Yanwu Yang;
Xutao Guo
Yiwei Pan
Pengcheng Shi
Haiyan Lv
Ting Ma
2D U-Net with one decoder
per annotator and
Layer Norm and skip-connections.
None
Cross-entropy & Dice loss
on individual decoders
A cross loss between
different decoders
& an auxiliary loss between average
prediction & average ground truth
z-Score normalization,
resizing
Averaging different
labels
Deep ensemble
(averaging)
  Stephan Huschauer
High-Resolution Network (HRNet)
with stem layers replaced by
2D wavelet scattering transformation.
None
None
Resizing to
512 ×\times× 512
Averaging
annotations
Deep ensemble
(averaging)
  Xiaofeng Liu;
Fangxu Xing
Georges El Fakhri
Jonghye Woo
Variational Inference encoding
multi-annotator variability with a
latent variable model.
None
Cross-entropy
& L2 reconstruction loss
None
Averaging
annotations
Deep ensemble
(averaging)
  Yucong Chen;
Guanqi He
Zhitong Gao
Xuming He
2D U-Net with multiple
decoders, one for each annotator.
Random cropping (training),
sliding window (inference)
None
None
None
Deep ensemble
(averaging)
  Mohammad Eslami;
Farzin Soleymani
Anirudh Ashok
Bernd Bischl
Mina Rezaei
Uncertainty-aware
progressive GAN.
Encoder modeled using
2D U-Net &
Patch Discriminators
from Pix2Pix
None
Multi-stage
GAN loss and
Soft Dice loss
Intensity values
noramlized between
0-255
Averaging
annotations
None
  Timothy Sum Hon Mun;
Simon J Doran
Paul Huang
Christina Messiou
Matthew D Blackledge
2D U-Net with
Monte Carlo dropout.
None
Dice loss
None
None
None

References

  • Armato III et al. (2011) Armato III, S. G., McLennan, G., Bidaut, L., McNitt-Gray, M. F., Meyer, C. R., Reeves, A. P., Zhao, B., Aberle, D. R., Henschke, C. I., Hoffman, E. A. et al. (2011). The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics, 38, 915–931.
  • Bakas et al. (2019) Bakas, S., Reyes, M., Jakab, A., Bauer, S., Rempfler, M., Crimi, A., Shinohara, R. T., Berger, C., Ha, S. M., Rozycki, M., Prastawa, M., Alberts, E., Lipkova, J., Freymann, J., Kirby, J., Bilello, M., Fathallah-Shaykh, H., Wiest, R., Kirschke, J., Wiestler, B., Colen, R., Kotrotsou, A., Lamontagne, P., Marcus, D., Milchenko, M., Nazeri, A., Weber, M.-A., Mahajan, A., Baid, U., Gerstner, E., Kwon, D., Acharya, G., Agarwal, M., Alam, M., Albiol, A., Albiol, A., Albiol, F. J., Alex, V., Allinson, N., Amorim, P. H. A., Amrutkar, A., Anand, G., Andermatt, S., Arbel, T., Arbelaez, P., Avery, A., Azmat, M., B., P., Bai, W., Banerjee, S., Barth, B., Batchelder, T., Batmanghelich, K., Battistella, E., Beers, A., Belyaev, M., Bendszus, M., Benson, E., Bernal, J., Bharath, H. N., Biros, G., Bisdas, S., Brown, J., Cabezas, M., Cao, S., Cardoso, J. M., Carver, E. N., Casamitjana, A., Castillo, L. S., Catà, M., Cattin, P., Cerigues, A., Chagas, V. S., Chandra, S., Chang, Y.-J., Chang, S., Chang, K., Chazalon, J., Chen, S., Chen, W., Chen, J. W., Chen, Z., Cheng, K., Choudhury, A. R., Chylla, R., Clérigues, A., Colleman, S., Colmeiro, R. G. R., Combalia, M., Costa, A., Cui, X., Dai, Z., Dai, L., Daza, L. A., Deutsch, E., Ding, C., Dong, C., Dong, S., Dudzik, W. et al. (2019). Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge.
  • Baumgartner et al. (2019) Baumgartner, C. F., Tezcan, K. C., Chaitanya, K., Hötker, A. M., Muehlematter, U. J., Schawkat, K., Becker, A. S., Donati, O., & Konukoglu, E. (2019). Phiseg: Capturing uncertainty in medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part II 22 (pp. 119–127). Springer.
  • Calisto & Lai-Yuen (2020) Calisto, M. B., & Lai-Yuen, S. K. (2020). Adaen-net: An ensemble of adaptive 2d–3d fully convolutional networks for medical image segmentation. Neural Networks, 126, 76–94.
  • Feng et al. (2020) Feng, X., Tustison, N. J., Patel, S. H., & Meyer, C. H. (2020). Brain tumor segmentation using an ensemble of 3d u-nets and overall survival prediction using radiomic features. Frontiers in computational neuroscience, 14, 25.
  • Ilg et al. (2018) Ilg, E., Cicek, O., Galesso, S., Klein, A., Makansi, O., Hutter, F., & Brox, T. (2018). Uncertainty estimates and multi-hypotheses networks for optical flow. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 652–667).
  • Isensee et al. (2019) Isensee, F., Kickingereder, P., Wick, W., Bendszus, M., & Maier-Hein, K. H. (2019). No new-net. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers, Part II 4 (pp. 234–244). Springer.
  • Joskowicz et al. (2019) Joskowicz, L., Cohen, D., Caplan, N., & Sosna, J. (2019). Inter-observer variability of manual contour delineation of structures in ct. European radiology, 29, 1391–1399.
  • Kofler et al. (2020) Kofler, F., Berger, C., Waldmannstetter, D., Lipkova, J., Ezhov, I., Tetteh, G., Kirschke, J., Zimmer, C., Wiestler, B., & Menze, B. H. (2020). Brats toolkit: translating brats brain tumor segmentation algorithms into clinical and scientific practice. Frontiers in neuroscience, (p. 125).
  • Kofler et al. (2021a) Kofler, F., Ezhov, I., Fidon, L., Pirkl, C. M., Paetzold, J. C., Burian, E., Pati, S., El Husseini, M., Navarro, F., Shit, S. et al. (2021a). Robust, primitive, and unsupervised quality estimation for segmentation ensembles. Frontiers in Neuroscience, 15, 752780.
  • Kofler et al. (2021b) Kofler, F., Ezhov, I., Isensee, F., Balsiger, F., Berger, C., Koerner, M., Paetzold, J., Li, H., Shit, S., McKinley, R. et al. (2021b). Are we using appropriate segmentation metrics? identifying correlates of human expert perception for cnn training beyond rolling the dice coefficient. arXiv preprint arXiv:2103.06205, .
  • Kofler et al. (2023) Kofler, F., Wahle, J., Ezhov, I., Wagner, S. J., Al-Maskari, R., Gryska, E., Todorov, M., Bukas, C., Meissen, F., Peng, T. et al. (2023). Approaching peak ground truth. In 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI) (pp. 1–6). IEEE.
  • Kohl et al. (2018) Kohl, S., Romera-Paredes, B., Meyer, C., De Fauw, J., Ledsam, J. R., Maier-Hein, K., Eslami, S., Jimenez Rezende, D., & Ronneberger, O. (2018). A probabilistic u-net for segmentation of ambiguous images. Advances in neural information processing systems, 31.
  • Kwon et al. (2020) Kwon, Y., Won, J.-H., Kim, B. J., & Paik, M. C. (2020). Uncertainty quantification using bayesian neural networks in classification: Application to biomedical image segmentation. Computational Statistics & Data Analysis, 142, 106816.
  • Langerak et al. (2010) Langerak, T. R., van der Heide, U. A., Kotte, A. N., Viergever, M. A., Van Vulpen, M., & Pluim, J. P. (2010). Label fusion in atlas-based segmentation using a selective and iterative method for performance level estimation (simple). IEEE transactions on medical imaging, 29, 2000–2008.
  • Lazarus et al. (2006) Lazarus, E., Mainiero, M. B., Schepps, B., Koelliker, S. L., & Livingston, L. S. (2006). Bi-rads lexicon for us and mammography: interobserver variability and positive predictive value. Radiology, 239, 385–391.
  • Lê et al. (2016) Lê, M., Unkelbach, J., Ayache, N., & Delingette, H. (2016). Sampling image segmentations for uncertainty quantification. Medical image analysis, 34, 42–51.
  • Litjens et al. (2012) Litjens, G., Debats, O., van de Ven, W., Karssemeijer, N., & Huisman, H. (2012). A pattern recognition approach to zonal segmentation of the prostate on mri. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 413–420). Springer.
  • McKinley et al. (2019) McKinley, R., Meier, R., & Wiest, R. (2019). Ensembles of densely-connected cnns with label-uncertainty for brain tumor segmentation. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers, Part II 4 (pp. 456–465). Springer.
  • McKinley et al. (2020) McKinley, R., Rebsamen, M., Meier, R., & Wiest, R. (2020). Triplanar ensemble of 3d-to-2d cnns with label-uncertainty for brain tumor segmentation. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 5th International Workshop, BrainLes 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 17, 2019, Revised Selected Papers, Part I 5 (pp. 379–387). Springer.
  • Mehta et al. (2020) Mehta, R., Filos, A., Gal, Y., & Arbel, T. (2020). Uncertainty evaluation metric for brain tumour segmentation. arXiv preprint arXiv:2005.14262, .
  • Monteiro et al. (2020) Monteiro, M., Le Folgoc, L., Coelho de Castro, D., Pawlowski, N., Marques, B., Kamnitsas, K., van der Wilk, M., & Glocker, B. (2020). Stochastic segmentation networks: Modelling spatially correlated aleatoric uncertainty. Advances in Neural Information Processing Systems, 33, 12756–12767.
  • Nair et al. (2020) Nair, T., Precup, D., Arnold, D. L., & Arbel, T. (2020). Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. Medical image analysis, 59, 101557.
  • Roy et al. (2019) Roy, A. G., Conjeti, S., Navab, N., Wachinger, C., Initiative, A. D. N. et al. (2019). Bayesian quicknat: Model uncertainty in deep whole-brain segmentation for structure-wise quality control. NeuroImage, 195, 11–22.
  • Sabuncu et al. (2010) Sabuncu, M. R., Yeo, B. T., Van Leemput, K., Fischl, B., & Golland, P. (2010). A generative model for image segmentation based on label fusion. IEEE transactions on medical imaging, 29, 1714–1729.
  • Styner et al. (2008) Styner, M., Lee, J., Chin, B., Chin, M., Commowick, O., Tran, H., Markovic-Plese, S., Jewells, V., & Warfield, S. (2008). 3d segmentation in the clinic: A grand challenge ii: Ms lesion segmentation. Midas Journal, 2008, 1–6.
  • Wang et al. (2019) Wang, G., Li, W., Aertsen, M., Deprest, J., Ourselin, S., & Vercauteren, T. (2019). Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks. Neurocomputing, 338, 34–45.
  • Watadani et al. (2013) Watadani, T., Sakai, F., Johkoh, T., Noma, S., Akira, M., Fujimoto, K., Bankier, A. A., Lee, K. S., Müller, N. L., Song, J.-W. et al. (2013). Interobserver variability in the ct assessment of honeycombing in the lungs. Radiology, 266, 936–944.
  • Zhao et al. (2019) Zhao, Y.-X., Zhang, Y.-M., Song, M., & Liu, C.-L. (2019). Multi-view semi-supervised 3d whole brain segmentation with a self-ensemble network. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part III 22 (pp. 256–265). Springer.