QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge

Hongwei Bran Li Fernando Navarro Ivan Ezhov Amirhossein Bayat Dhritiman Das Florian Kofler Suprosanna Shit Diana Waldmannstetter Johannes C. Paetzold Xiaobin Hu Benedikt Wiestler Lucas Zimmer Tamaz Amiranashvili Chinmay Prabhakar Christoph Berger Jonas Weidner Michelle Alonso-Basanta Arif Rashid Ujjwal Baid Wesam Adel Deniz Alis Bhakti Baheti Yingbin Bai Ishaan Bhat Sabri Can Cetindag Wenting Chen Li Cheng Prasad Dutande Lara Dular Mustafa A. Elattar Ming Feng Shengbo Gao Henkjan Huisman Weifeng Hu Shubham Innani Wei Ji Davood Karimi Hugo J. Kuijf Jin Tae Kwak Hoang Long Le Xiang Li Huiyan Lin Tongliang Liu Jun Ma Kai Ma Ting Ma Ilkay Oksuz Robbie Holland Arlindo L. Oliveira Jimut Bahan Pal Xuan Pei Maoying Qiao Anindo Saha Raghavendra Selvan Linlin Shen Joao Lourenco Silva Ziga Spiclin Sanjay Talbar Dadong Wang Wei Wang Xiong Wang Yin Wang Ruiling Xi Kele Xu Yanwu Yang Mert Yergin Shuang Yu Lingxi Zeng YingLin Zhang Jiachen Zhao Yefeng Zheng Martin Zukovec Richard Do Anton Becker Amber Simpson Ender Konukoglu Andras Jakab Spyridon Bakas Leo Joskowicz Bjoern Menze Department of Informatics, Technical University of Munich, Germany. Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Harvard Medical School, USA. Department of Quantitative Biomedicine, University of Zurich, Switzerland. University Children’s Hospital Zurich, University of Zurich, Switzerland. Department of Radioncology and Radiation Theraphy , Klinikum rechts der Isar, Technical University of Munich, Germany Department of Information Technology and Electrical Engineering, ETH-Zurich, Switzerland. Department of Radiology, Memorial Sloan Kettering Cancer Center in New York City, USA Department of Biomedical and Molecular Sciences, Queen’s University, Canada TranslaTUM - Central Institute for Translational Cancer Research, Technical University of Munich, Germany McGovern Institute, Massachusetts Institute of Technology, USA Institute for Diagnostic and Interventional Radiology, Unveristy Zurich Hospital, Switzerland. BioMedIA, Imperial College London, United Kingdom. Department of Radiation Oncology, University of Pennsylvania, PA, USA University of Pennsylvania, PA, USA Department of Radiation Oncology, Winship Cancer Institute of Emory University, Georgia, USA Nile University, Cairo, Egypt Department of Medical Sciences, Acibadem University, Istanbul, Turkey Shri Guru Gobind Singhji Institute of Engineering and Technology, Nanded, Maharashtra, India Trustworthy Machine Learning Lab, University of Sydney, Australia Image Sciences Institute, University Medical Center Utrecht, The Netherlands Computer Engineering Department, Istanbul Technical University, Istanbul, Turkey School of Computer Science, Shenzhen University, Shenzhen, China University of Alberta, USA University of Ljubljana, Faculty of Electrical Engineering, Ljubljana, Slovenia Tongji University, Shanghai, China OPPO Research Institute, Shanghai, China School of Biological and Medical Engineering, Beihang University, Beijing, China Harvard Medical School, Boston, USA School of Electrical Engineering, Korea University, Seoul, Korea Department of Computer Science and Engineering, Sejong University, Seoul, Korea Harbin Institute of Technology, China Southern University of Science and Technology, China Department of Electronic and Information Engineering, Harbin Institute of Technology at Shenzhen, China Peng Cheng Lab, Shenzhen, China Advanced Innovation Center for Human Brain Protection, Capital Medical University, Beijing, China National Clinical Research Center for Geriatric Disorders, Xuanwu Hospital Capital Medical University, Beijing, China Instituto Superior Tecnico / INESC-ID, Portugal Department of Computer Science, Ramakrishna Mission Vivekananda Educational and Research Institute, India Australian Catholic University, Australia Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia University of Copenhagen, Denmark National Key Lab of Parallel and Distributed Processing, Changsha, China National University of Defense Technology, Changsha, China Hevi AI, Istanbul, Turkey Department of Computer Science and Engineering, Hongkong University of Science and Technology, China Tencent Healthcare (Shenzhen) Co., Ltd, China Department of Mathematics, Nanjing University of Science and Technology, China Diagnostic Image Analysis Group, Radboud University Medical Center, Nijmegen, The Netherlands The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel Helmholtz AI, Helmholtz Zentrum München, Germany TranslaTUM - Central Institute for Translational Cancer Research, Technical University of Munich, Germany Department of Diagnostic and Interventional Neuroradiology, School of Medicine, Klinikum rechts der Isar, Technical University of Munich, Germany

Abstract

Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the development and evaluation of automated segmentation algorithms. Accurately modeling and quantifying this variability is essential for enhancing the robustness and clinical applicability of these algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ), which was organized in conjunction with International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2020 and 2021. The challenge focuses on the uncertainty quantification of medical image segmentation which considers the omnipresence of inter-rater variability in imaging datasets. The large collection of images with multi-rater annotations features various modalities such as MRI and CT; various organs such as the brain, prostate, kidney, and pancreas; and different image dimensions 2D-vs-3D. A total of 24 teams submitted different solutions to the problem, combining various baseline models, Bayesian neural networks, and ensemble model techniques. The obtained results indicate the importance of the ensemble models, as well as the need for further research to develop efficient 3D methods for uncertainty quantification methods in 3D segmentation tasks.

\UseRawInputEncoding\floatsetup

[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth] \newpageafterauthor

1 Introduction

Background

The segmentation of anatomical structures and pathologies in medical images frequently encounters substantial inter-rater variability (Lazarus et al., 2006; Watadani et al., 2013), which in turn significantly impacts downstream supervised-learning tasks and clinical decision-making processes. This variability becomes especially pronounced in the context of medical imaging, where manual annotations are often limited and costly to acquire (Kofler et al., 2023). A notable example of this challenge is the segmentation of liver lesions in CT scans, which is inherently complex even for experienced experts, due to the variability in lesion location, contrast, and size among different patients (Joskowicz et al., 2019). It has been observed that the range of variability in manual delineations for various structures and observers is extensive, encompassing a wide spectrum of structures and pathologies, as shown in Figure 1. The involvement of only two or three observers may be inadequate to capture the full breadth of potential variability in the outlines of the targeted structures. This variability, intrinsic to the biological problem, the imaging modality, and the expertise of the annotators has not yet been adequately addressed in the design of computerized algorithms for medical image quantification (Kofler et al., 2021b).

Refer to caption — Figure 1: Visualisation of the multi-rater segmentation masks on brain and prostate MRI scans and their derived uncertainty map.

Uncertainty quantification

Current methods for modeling uncertainty in predicted image segmentations primarily stem from general statistical model considerations, ensemble approaches involving resampling of training datasets, and aggregating multiple segmentation results, or systematic modifications to the predictive algorithm, as seen in techniques like Monte Carlo (MC) dropout. Yet, the exact delineation of segmented structures within an image inherently carries uncertainty, which is both task-specific and dependent on the dataset. Importantly, this uncertainty can be directly extrapolated from annotations made by multiple human experts. To our knowledge, there are currently no datasets available specifically for evaluating the accuracy of probabilistic model predictions against such multi-expert ground truths. Furthermore, there is a lack of consensus on which uncertainty quantification procedures yield realistic estimates and which do not.

Objective

The primary goal of the challenge is to establish a benchmark for algorithms that generate uncertainty estimates (such as probability scores and variability regions) in medical imaging segmentation tasks. The focus is to compare these algorithmic outputs against the uncertainties ascribed by human annotators in the local delineation of structures across various biomedical imaging segmentation tasks. These tasks include, but are not limited to, the segmentation of lesions (such as brain, pancreas, or prostate tumors) and anatomical structures (like brain, kidney, prostate, and pancreas). Multiple expert annotations have been gathered for several CT and MR image datasets to quantify boundary delineation variability.

Contributions

In an effort to assess the latest methods in uncertainty quantification for medical image segmentation, we organized the Uncertainty Quantification of Biomedical Image Quantification Challenge (QUBIQ) at MICCAI-2020 and MICCAI-2021. This paper highlights three major contributions to this field. Firstly, we introduce a new, publicly available multi-rater, multi-center, multi-modality dataset that includes both 2D and 3D segmentation tasks. Secondly, we present the setup and summarize the findings of our QUBIQ uncertainty quantification benchmarks held at two grand challenges. Lastly, we review, evaluate, rank, and analyze the state-of-the-art algorithms that emerged from these benchmarks.

2 Prior Work on Approaches and Datasets

2.1 Prior work.

There is a body of literature that models uncertainty and inter-rater variability in biomedical image segmentation (Lê et al., 2016; Sabuncu et al., 2010; Kwon et al., 2020; Roy et al., 2019; Ilg et al., 2018). Some of the prior methods directly extract uncertainty estimates from trained models, either by augmenting the input image Wang et al. (2019) or by generating multiple potential segmentations using MC dropout Nair et al. (2020). Others modify techniques into ensemble methods that generate multiple parallel predictions Ilg et al. (2018) or by running multiple models in parallel Calisto & Lai-Yuen (2020). Kofler et al. (2021a) extend this further to create an ensemble of multiple approaches from the literature and create a system to alert the user if there is low segmentation agreement within the ensemble. In contrast, others explicitly model inter-rater uncertainty. In Probabilistic U-Net, Kohl et al. Kohl et al. (2018) use variational inference to learn a prior distribution of variability, from which they sample plausible segmentations, while Baumgartner et al. Baumgartner et al. (2019) extend this to a hierarchical model capable of modeling uncertainty at different levels of abstraction within the U-Net architecture. Monteiro et al. Monteiro et al. (2020) explicitly model uncertainty by learning a low-rank pixel-wise covariance matrix.

2.2 Publicly available datasets.

Table 1 showcases available datasets for uncertainty quantification task. Most of the datasets feature multi-rater labeling. Each focuses on a particular pathological or healthy anatomy segmentation task. Therefore, the datasets contain either contain 2D or 3D images, CT or MRI modality. The QUBIQ challenge offers a dataset composed of multiple tasks for both image dimensions and imaging modalities.

Dataset	Modality	Target	2D	3D	#Images	multi-rater
LIDC-IDRI (Armato III et al., 2011)	CT	Lung nodule	✓	✗	1,018	✓
MICCAI-2012 (Litjens et al., 2012)	MRI	Prostate	✓	✗	48	✓
ISBI-2015 (Styner et al., 2008)	MRI	MS lesion	✗	✓	21	✓
BraTS (Mehta et al., 2020)	MRI	brain tumor	✗	✓	335	✗
QUBIQ	CT,MRI	six tasks	✓	✓	**	✓

Table 1: Overview of publicly available medical datasets for uncertainty quantification in image segmentation tasks. (to be updated)

3 QUBIQ challenge

3.1 QUBIQ datasets

3.1.1 Dataset creation.

For the adult glioma segmentation task, we employ three label sets. The first label set is the original label from the BraTS adult glioma segmentation challenge (Bakas et al., 2019). Additionally, we use two algorithm-based labels obtained from BraTS Toolkit (Kofler et al., 2020). To generate these, we first generate five algorithmic (Isensee et al., 2019; McKinley et al., 2019; Feng et al., 2020, 2020; Zhao et al., 2019; McKinley et al., 2020) glioma segmentations. Subsequently, we fuse these using basic majority voting and SIMPLE fusion (Langerak et al., 2010).

Dataset	Modality	[2020,2021]	2D	3D	#Images	#Tasks	Source (NEED to double check
Prostate segmentation	MRI	[✓,✓]	✓	✗	55	2	ETH Zürich
Brain growth segmentation	MRI	[✓,✓]	✓	✗	39	1	University of Zürich
Brain tumor segmentation	multimodal MRI	[✓,✓]	✓	✗	32	3	University of Pennsylvania
Kidney segmentation	CT	[✓,✓]	✓	✗	24	1	Technical University of Munich
Pancreas segmentation	CT	[✗,✓]	✗	✓	38	1	University of Pennsylvania
Pancreatic lesion segmentation	CT	[✗,✓]	✗	✓	21	1	University of Pennsylvania

Table 2: Overview of QUBIQ datasets and the sub-tasks

3.2 Evaluation metrics and ranking

For the evaluation, each participant had to segment the given binary structures and predict the distribution of the experts’ labels by returning one mask with continuous values between 0 and 1 which is supposed to reproduce the average segmentations of the experts.

Predictions and continuous ground truth labels are compared by thresholding the continuous labels at predefined thresholds and calculating the volumetric overlap of the resulting binary volumes using the Dice score (the continuous ground truth labels are obtained by averaging multiple experts’ annotations). To this end, both the ground truth and prediction are binarized at ten probability levels (0.1, 0.2, …, 0.8, 0.9). Dice scores for all thresholds are averaged.

The Q-Dice, a staged Dice score, is used to quantify the quality of the predicted probability map $p$ against the ground truth $y$ in $L$ discrete probability levels, formulated as:

T_{L}(p,l)=\begin{dcases}\mathbbm{1}\left\{\frac{l}{L}\leq p<\frac{l+1}{L}% \right\},&\text{if }0\leq l<L-1\\ \mathbbm{1}\left\{\frac{l}{L}\leq p\leq\frac{l+1}{L}\right\},&\text{if }l=L-1% \\ \end{dcases}

(1)

Compared to the original Dice score, Q-score quantifies the uncertainty by comparing the prediction and ground truth maps at different confidence levels. Since in most cases experts agree on most parts of the annotations, the variance of different Q-score demonstrates how well the prediction modeled the uncertainty on the borders of the structure of interest.

3.3 Challenge events

The QUBIQ challenge was organized within the MICCAI conference using the Grand Challenge platform. Below in Tables 3, 4, and 5, we provide descriptions of algorithms across the two iterations of the QUBIQ challenge (QUBIQ2020 and QUBIQ2021). Fig. 2 quantitatively compares the algorithms over the two iterations of the challenge.

3.4 Results

In Tables 3 and 4, we show the leaderboard for both iterations of the challenge.

Ref. Name	Brain-growth	Brain-tumor	Brain-tumor	Brain-tumor	Kidney	Prostate	Prostate	Average	Average
		Task 1	Task 2	Task 3		Task 1	Task 2	Ranking	Dice
Jun_Ma	0.921	0.936	0.809	0.822	0.310	0.970	0.918	7.857	0.812
Yanwu_Yang	0.893	0.917	0.699	0.836	0.825	0.937	0.878	7.143	0.855
Macaroon	0.878	0.848	0.528	0.690	0.238	0.937	0.890	5	0.715
Raghavendra_Selvan 2	0.885	0.899	0.617	0.682	0.695	0.883	0.800	4.857	0.780
Raghavendra_Selvan	0.907	0.874	0.602	0.690	0.639	0.858	0.780	4.714	0.764
Wei_Ji	0.900	0.755	0.323	0.605	0.915	0.941	0.845	4.714	0.755
Xiang_Li	0.865	0.931	0.513	0.556	0.903	0.914	0.872	4.714	0.793
Ujjwal_Baid	0.840	0.782	0.406	0.568	0.956	0.891	0.702	3.143	0.735
Maykol_Campos	0.849	0.799	0.522	0.613	0.805	0.838	0.630	2.857	0.722
anysys99	0.818	0.893	0.485	0.724	-	0.890	0.804	-	-
Davood_Karimi	0.874	0.900	0.452	-	0.785	0.947	0.897	-	-

Table 3: Results QUIBIQ 2020 ordered according to the ranking score. The top 3 performing teams are highlighted in blue color. Notice that only teams participating in all tasks are considered for the overall ranking.

Team	Brain-growth	Brain-tumor	Brain-tumor	Brain-tumor	Kidney	Prostate	Prostate	Pancreas	Pancreatic	Average	Average
		Task 1	Task 2	Task 3		Task 1	Task 2		Lesion	Ranking	Dice
Peng-Cheng_Shi	0.929	0.938	0.819	0.847	0.954	0.969	0.920	0.550	0.272	11.111	0.800
Yingbin_Bai	0.915	0.928	0.793	0.815	0.940	0.968	0.920	0.579	0.205	9.333	0.785
Lara_Dular	0.928	0.938	0.820	0.899	0.467	0.958	0.909	0.499	0.283	8.778	0.745
Lawrence_Schobs	0.300	0.939	0.780	0.798	0.503	0.969	0.915	0.683	0.330	8.556	0.691
Sabrican_Cetindag	0.928	0.932	0.769	0.883	0.839	0.964	0.922	0.409	0.231	8.444	0.764
Yucong_Chen	0.916	0.927	0.775	0.840	0.952	0.952	0.907	0.572	0.130	7.889	0.775
Hoang_Long_Le	0.912	0.899	0.680	0.754	0.706	0.971	0.927	0.575	0.246	7.444	0.741
Dewen_Zeng	0.927	0.940	0.695	0.835	0.894	0.947	0.911	0.423	0.126	7.111	0.744
Joao_Lourenco_Silva	0.931	0.929	0.750	0.797	0.511	0.968	0.920	0.075	0.068	6.333	0.661
Anindo_Saha	0.892	0.917	0.695	0.740	0.950	0.936	0.859	0.546	0.194	5.222	0.748
Wang_Xiong	0.893	0.905	0.589	0.784	0.930	0.916	0.862	0.557	0.204	5.222	0.738
Ishaan_Rajesh	0.892	0.919	0.638	0.704	0.858	0.861	0.799	0.316	0.122	3.111	0.679
Stephan_Huschauer	0.719	0.865	0.525	0.551	0.856	0.911	0.842	0.423	0.118	2.444	0.646
Jiachen_Zhao	0.873	0.844	0.547	0.787	0.835	0.931	0.884	-	-	-	-
Jimut_Bahan_Pal	0.869	0.842	0.456	0.690	0.769	0.833	0.781	-	-	-	-
Mohammad_Eslami	0.848	0.404	0.377	0.236	0.716	0.883	0.816	-	-	-	-
Shengbo_Gao	0.802	0.885	0.627	0.661	0.910	-	-	0.557	0.130	-	-
Xiaofeng_Liu	0.800	-	-	-	-	-	-	-	-	-	-
Timothy_S	0.780	-	-	-	-	-	-	-	-	-	-

Table 4: Results QUIBIQ 2021 ordered according to the ranking score. The top 3 performing teams are highlighted in blue color. Notice that only teams participating in all tasks are considered for the overall ranking.

4 Conclusion

In this paper, we report on the results of the QUBIQ challenge (Quantification of Uncertainties in Biomedical Image Quantification Challenge), which was organized in conjunction with International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI). Quantifying uncertainty in medical imaging is paramount for image analysis, as inter-rater variability is omnipresent in imaging datasets. Such quantification could reduce barriers to adopting learnable algorithms into clinical practice. With the QUBIQ challenge, we aim to fill the empty space among the medical imaging challenges, which are dominated by competition in deterministic segmentation, ignoring the importance of uncertainty prediction.

Acknowledgement

The research is supported through the SFB 824, subproject B12, as well as by Deutsche Forschungsgemeinschaft (DFG) via TUM International Graduate School of Science and Engineering (IGSSE), GSC 81. We acknowledge support by the Helmut Horten Foundation and by the Translational Brain Imaging Training Network (TRABIT) under the EU ‘Horizon 2020’ research & innovation program (Grant agreement ID: 765148). Research reported in this publication was partly supported by the National Institutes of Health (NIH) under award numbers NIH/NCI:U01CA242871 and NIH/NINDS:R01NS042645.

Table 5: Details of the participating teams’ methods in QUBIQ-Challenge-2020.

Lead Author &

Team Members

Method, Architecture &

Modifications

Data Augmentation

Loss Function

Pre-processing

Label Processing

Ensemble strategy

Jun Ma

Multiple 2D U-Nets

(one per annotator).

None

Cross-entropy

& Dice loss

None

Averaging

Davood Karimi

2D U-Net with additional

connections between coarse

and fine feature layers in the

encoder. Dynamic loss

weighting for harder classes.

Multi-task training approach.

None

1-Dice similarity loss

Zero mean, unit variance

standardization.

Averaging

annotations

None

Ming Feng;

Kele Xu,

Yin Wang

2D U-Net trained with

ground truth & predictions

binarized at different levels,

averaging Dice score of

each prediction.

Random scaling

[0.9,1.1]

Weighted cross-entropy

& Dice loss

Resizing to

256

\times

256 (brain tumor)

512

\times

512 (kidney & prostrate).

Normalization to [0, 255].

Averaging

annotations

None

Raghavendra Selvan

Multi-channel U-Net,

one channel for each rater.

Based on concept of

Normalizing Flows.

None

Planar Flow

& Dice loss

None

Ujjwal Baid;

Prasad Dutande,

Shubham Innani,

Bhakti Baheti,

Sanjay Talbar

ResNet34 based encoder

-decoder. Different annotations

included as individual

copies in training set.

Rotation, flip

& scaling

None

Resizing to

256

\times

256 (brain)

512

\times

512 (kidney)

640

\times

640 (prostate).

None

Wesam Adel;

Mustafa A. Elattar

2D U-Net trained

with averaged annotations

and as a regression problem.

None

Weighted KL-divergence

Resizing brain to

256

\times

256 with

rotation and elastic

deformation.

None

Xiang Li

U Net with attention,

4x downsampling for Kidney

and 5x for others.

None

Weighted cross-entropy

Cropping to

128

\times

128 (kidney)

416

\times

416 (prostate)

Averaging

annotations

None

Yanwu Yang;

Ting Ma

2D U-Net with multiple

branches. Instance Norm

instead of Batch Norm.

One model per annotation

integrated using auxiliary loss.

None

Cross-entropy

& Dice loss

MRI: z-score normalization.

CT: centering on ROI

and rescaling to

[0,1]

None

Averaging

Wei Ji;

Wenting Chen

Shuang Yu

Kai Ma

Li Cheng

Linlin Shen

Yefeng Zheng

U-Net with Resnet-34

encoder. One output

channel per label and

one model per annotation

integrated using auxiliary loss.

None

Cross-entropy

Resizing to

512

\times

512

Both fused final & individual

labels. Combining labels via

averaging, random sampling

& label sampling.

Weighted average

Table 6: Details of the participating teams’ methods in QUBIQ-Challenge-2021 (part 1).

Lead Author &

Team Members

Method, Architecture &

Modifications

Data Augmentation

Loss Function

Pre-processing

Label Processing

Ensemble strategy

Anindo Saha;

Henkjan Huisman

Probabilistic U-Net

with MC Dropout.

Gaussian noise, horizontal

flip, rotation, translation

& scaling

KL-Divergence

& Dice loss

z-Score normalization,

centre cropping to

512

\times

512 (Kidney)

256

\times

256 (Brain)

640

\times

640 (Prostrate)

None

Deep ensemble

(averaging)

Hoang Long Le;

Jin Tae Kwak

DeepLabv3 & EfficientNet

(latter for classifying

pancreas existence). 9 binary

ground truths from thresholding.

Gaussian noise, horizontal

& vertical flip, rotation, shift

scaling, blur random brightness

Dice loss

Normalization to [0, 255]

Averaging

annotations

Multiplying binary

segmentation map with

threshold value & taking

per pixel maximum.

Ishaan Bhat;

Hugo J. Kuijf

Probabilistic U-Net

with MC Dropout.

Random flip,

rotation, brightness &

contrast

Cross-entropy &

KL-Divergence

z-Score normalization,

Resizing to 256

\times

256 (brain)

512

\times

512 (kidney)

512

\times

512 (pancreas)

None

Deep ensemble

(averaging)

Jiachen Zhao

U-Net

Random flip

Dice loss

Resizing to

256

\times

256

Averaging

annotations

None

Jimut Bahan Pal

Multiple U-Nets

(one per annotation).

None

Focal Tversky

None

João Lourenço Silva;

Arlindo L. Oliveira

U-Net with

EfficientNet-B0

encoder.

Rotation, horizontal

& vertical flip,

translation & zoom

Cross-entropy

None

Averaging

annotations

None

Lawrence Schobs

Multiple nnU-Nets

(one per annotator,

2D for pancreas, 3D for other).

Gaussian noise,

rotation, scaling

mirroring & inhomogeneity

Dice loss

Resizing prostate images

640\times 640

. Image

sampling and

normalization.

Averaging

annotations

None

Martin Z̃ukovec;

Lara Dular

Z̃iga S̃piclin

nnU-Net. Multi-task

training approach with labels

0–N (N annotators + background).

Gaussian noise,

rotation, scaling

mirroring & inhomogeneity

Dice loss

Image sampling

and normalization

Addition of

segmentations

None

Sabri Can Cetindag;

Mert Yergin

Deniz Alis

Ilkay Oksuz

nnU-Net, training one U-Net

per annotator, then adding

segmentation map output

as extra channels.

None

Cross-entropy

& Dice loss

None

Average of

annotations

for stage 2

None

Xiong Wang;

Shengbo Gao

Weifeng Hu

Xuan Pei

2D: U-Net (one per annotator,

obtained via label fusion),

MaskNet, and DeepLab V3+.

3D: SegResNet.

None

Weighted focal

& Dice loss

None

Label fusion

for training

individual models

Weighted combination

YingLin Zhang;

Wei Wang

Ruiling Xi

Lingxi Zeng

Huiyan Lin

UNet, UNet++,

and TransUNet.

Gaussian noise, horizontal

& vertical flip, rotation,

translation, zoom, brightness,

sharpness, contrast, blur

& elastic deformation

Multi-level Dice loss

Center-cropping images

to ROI. Discarding images

with unclear ROI

Majority voting,

cumulative division & even

division of segmentations

Weighted combination

Yingbin Bai;

Maoying Qiao

Dadong Wang

Tongliang Liu

UNet++ with

EfficientNet-B7 encoder.

None

Multi-level Dice loss

None

Weighted combination

of individual

segmentation maps

None

Table 7: Details of the participating teams’ methods in QUBIQ-Challenge-2021 (part 2).

Lead Author &

Team Members

Method, Architecture &

Modifications

Data Augmentation

Loss Function

Pre-processing

Label Processing

Ensemble strategy

Dewen Zeng;

Yukun Ding

Yiyu Shi

2D U-Net with multiple

loss functions (one each

for individual annotators,

aggregated label & multi-scale

threshold).

None

Cross-entropy

z-Score normalization,

resizing

Averaging

annotations

None

Yanwu Yang;

Xutao Guo

Yiwei Pan

Pengcheng Shi

Haiyan Lv

Ting Ma

2D U-Net with one decoder

per annotator and

Layer Norm and skip-connections.

None

Cross-entropy & Dice loss

on individual decoders

A cross loss between

different decoders

& an auxiliary loss between average

prediction & average ground truth

z-Score normalization,

resizing

Averaging different

labels

Deep ensemble

(averaging)

Stephan Huschauer

High-Resolution Network (HRNet)

with stem layers replaced by

2D wavelet scattering transformation.

None

Resizing to

512

\times

512

Averaging

annotations

Deep ensemble

(averaging)

Xiaofeng Liu;

Fangxu Xing

Georges El Fakhri

Jonghye Woo

Variational Inference encoding

multi-annotator variability with a

latent variable model.

None

Cross-entropy

& L2 reconstruction loss

None

Averaging

annotations

Deep ensemble

(averaging)

Yucong Chen;

Guanqi He

Zhitong Gao

Xuming He

2D U-Net with multiple

decoders, one for each annotator.

Random cropping (training),

sliding window (inference)

None

Deep ensemble

(averaging)

Mohammad Eslami;

Farzin Soleymani

Anirudh Ashok

Bernd Bischl

Mina Rezaei

Uncertainty-aware

progressive GAN.

Encoder modeled using

2D U-Net &

Patch Discriminators

from Pix2Pix

None

Multi-stage

GAN loss and

Soft Dice loss

Intensity values

noramlized between

0-255

Averaging

annotations

None

Timothy Sum Hon Mun;

Simon J Doran

Paul Huang

Christina Messiou

Matthew D Blackledge

2D U-Net with

Monte Carlo dropout.

None

Dice loss

None

References

Armato III et al. (2011) Armato III, S. G., McLennan, G., Bidaut, L., McNitt-Gray, M. F., Meyer, C. R., Reeves, A. P., Zhao, B., Aberle, D. R., Henschke, C. I., Hoffman, E. A. et al. (2011). The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics, 38, 915–931.
Bakas et al. (2019) Bakas, S., Reyes, M., Jakab, A., Bauer, S., Rempfler, M., Crimi, A., Shinohara, R. T., Berger, C., Ha, S. M., Rozycki, M., Prastawa, M., Alberts, E., Lipkova, J., Freymann, J., Kirby, J., Bilello, M., Fathallah-Shaykh, H., Wiest, R., Kirschke, J., Wiestler, B., Colen, R., Kotrotsou, A., Lamontagne, P., Marcus, D., Milchenko, M., Nazeri, A., Weber, M.-A., Mahajan, A., Baid, U., Gerstner, E., Kwon, D., Acharya, G., Agarwal, M., Alam, M., Albiol, A., Albiol, A., Albiol, F. J., Alex, V., Allinson, N., Amorim, P. H. A., Amrutkar, A., Anand, G., Andermatt, S., Arbel, T., Arbelaez, P., Avery, A., Azmat, M., B., P., Bai, W., Banerjee, S., Barth, B., Batchelder, T., Batmanghelich, K., Battistella, E., Beers, A., Belyaev, M., Bendszus, M., Benson, E., Bernal, J., Bharath, H. N., Biros, G., Bisdas, S., Brown, J., Cabezas, M., Cao, S., Cardoso, J. M., Carver, E. N., Casamitjana, A., Castillo, L. S., Catà, M., Cattin, P., Cerigues, A., Chagas, V. S., Chandra, S., Chang, Y.-J., Chang, S., Chang, K., Chazalon, J., Chen, S., Chen, W., Chen, J. W., Chen, Z., Cheng, K., Choudhury, A. R., Chylla, R., Clérigues, A., Colleman, S., Colmeiro, R. G. R., Combalia, M., Costa, A., Cui, X., Dai, Z., Dai, L., Daza, L. A., Deutsch, E., Ding, C., Dong, C., Dong, S., Dudzik, W. et al. (2019). Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge.
Baumgartner et al. (2019) Baumgartner, C. F., Tezcan, K. C., Chaitanya, K., Hötker, A. M., Muehlematter, U. J., Schawkat, K., Becker, A. S., Donati, O., & Konukoglu, E. (2019). Phiseg: Capturing uncertainty in medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part II 22 (pp. 119–127). Springer.
Calisto & Lai-Yuen (2020) Calisto, M. B., & Lai-Yuen, S. K. (2020). Adaen-net: An ensemble of adaptive 2d–3d fully convolutional networks for medical image segmentation. Neural Networks, 126, 76–94.
Feng et al. (2020) Feng, X., Tustison, N. J., Patel, S. H., & Meyer, C. H. (2020). Brain tumor segmentation using an ensemble of 3d u-nets and overall survival prediction using radiomic features. Frontiers in computational neuroscience, 14, 25.
Ilg et al. (2018) Ilg, E., Cicek, O., Galesso, S., Klein, A., Makansi, O., Hutter, F., & Brox, T. (2018). Uncertainty estimates and multi-hypotheses networks for optical flow. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 652–667).
Isensee et al. (2019) Isensee, F., Kickingereder, P., Wick, W., Bendszus, M., & Maier-Hein, K. H. (2019). No new-net. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers, Part II 4 (pp. 234–244). Springer.
Joskowicz et al. (2019) Joskowicz, L., Cohen, D., Caplan, N., & Sosna, J. (2019). Inter-observer variability of manual contour delineation of structures in ct. European radiology, 29, 1391–1399.
Kofler et al. (2020) Kofler, F., Berger, C., Waldmannstetter, D., Lipkova, J., Ezhov, I., Tetteh, G., Kirschke, J., Zimmer, C., Wiestler, B., & Menze, B. H. (2020). Brats toolkit: translating brats brain tumor segmentation algorithms into clinical and scientific practice. Frontiers in neuroscience, (p. 125).
Kofler et al. (2021a) Kofler, F., Ezhov, I., Fidon, L., Pirkl, C. M., Paetzold, J. C., Burian, E., Pati, S., El Husseini, M., Navarro, F., Shit, S. et al. (2021a). Robust, primitive, and unsupervised quality estimation for segmentation ensembles. Frontiers in Neuroscience, 15, 752780.
Kofler et al. (2021b) Kofler, F., Ezhov, I., Isensee, F., Balsiger, F., Berger, C., Koerner, M., Paetzold, J., Li, H., Shit, S., McKinley, R. et al. (2021b). Are we using appropriate segmentation metrics? identifying correlates of human expert perception for cnn training beyond rolling the dice coefficient. arXiv preprint arXiv:2103.06205, .
Kofler et al. (2023) Kofler, F., Wahle, J., Ezhov, I., Wagner, S. J., Al-Maskari, R., Gryska, E., Todorov, M., Bukas, C., Meissen, F., Peng, T. et al. (2023). Approaching peak ground truth. In 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI) (pp. 1–6). IEEE.
Kohl et al. (2018) Kohl, S., Romera-Paredes, B., Meyer, C., De Fauw, J., Ledsam, J. R., Maier-Hein, K., Eslami, S., Jimenez Rezende, D., & Ronneberger, O. (2018). A probabilistic u-net for segmentation of ambiguous images. Advances in neural information processing systems, 31.
Kwon et al. (2020) Kwon, Y., Won, J.-H., Kim, B. J., & Paik, M. C. (2020). Uncertainty quantification using bayesian neural networks in classification: Application to biomedical image segmentation. Computational Statistics & Data Analysis, 142, 106816.
Langerak et al. (2010) Langerak, T. R., van der Heide, U. A., Kotte, A. N., Viergever, M. A., Van Vulpen, M., & Pluim, J. P. (2010). Label fusion in atlas-based segmentation using a selective and iterative method for performance level estimation (simple). IEEE transactions on medical imaging, 29, 2000–2008.
Lazarus et al. (2006) Lazarus, E., Mainiero, M. B., Schepps, B., Koelliker, S. L., & Livingston, L. S. (2006). Bi-rads lexicon for us and mammography: interobserver variability and positive predictive value. Radiology, 239, 385–391.
Lê et al. (2016) Lê, M., Unkelbach, J., Ayache, N., & Delingette, H. (2016). Sampling image segmentations for uncertainty quantification. Medical image analysis, 34, 42–51.
Litjens et al. (2012) Litjens, G., Debats, O., van de Ven, W., Karssemeijer, N., & Huisman, H. (2012). A pattern recognition approach to zonal segmentation of the prostate on mri. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 413–420). Springer.
McKinley et al. (2019) McKinley, R., Meier, R., & Wiest, R. (2019). Ensembles of densely-connected cnns with label-uncertainty for brain tumor segmentation. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers, Part II 4 (pp. 456–465). Springer.
McKinley et al. (2020) McKinley, R., Rebsamen, M., Meier, R., & Wiest, R. (2020). Triplanar ensemble of 3d-to-2d cnns with label-uncertainty for brain tumor segmentation. In Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 5th International Workshop, BrainLes 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 17, 2019, Revised Selected Papers, Part I 5 (pp. 379–387). Springer.
Mehta et al. (2020) Mehta, R., Filos, A., Gal, Y., & Arbel, T. (2020). Uncertainty evaluation metric for brain tumour segmentation. arXiv preprint arXiv:2005.14262, .
Monteiro et al. (2020) Monteiro, M., Le Folgoc, L., Coelho de Castro, D., Pawlowski, N., Marques, B., Kamnitsas, K., van der Wilk, M., & Glocker, B. (2020). Stochastic segmentation networks: Modelling spatially correlated aleatoric uncertainty. Advances in Neural Information Processing Systems, 33, 12756–12767.
Nair et al. (2020) Nair, T., Precup, D., Arnold, D. L., & Arbel, T. (2020). Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. Medical image analysis, 59, 101557.
Roy et al. (2019) Roy, A. G., Conjeti, S., Navab, N., Wachinger, C., Initiative, A. D. N. et al. (2019). Bayesian quicknat: Model uncertainty in deep whole-brain segmentation for structure-wise quality control. NeuroImage, 195, 11–22.
Sabuncu et al. (2010) Sabuncu, M. R., Yeo, B. T., Van Leemput, K., Fischl, B., & Golland, P. (2010). A generative model for image segmentation based on label fusion. IEEE transactions on medical imaging, 29, 1714–1729.
Styner et al. (2008) Styner, M., Lee, J., Chin, B., Chin, M., Commowick, O., Tran, H., Markovic-Plese, S., Jewells, V., & Warfield, S. (2008). 3d segmentation in the clinic: A grand challenge ii: Ms lesion segmentation. Midas Journal, 2008, 1–6.
Wang et al. (2019) Wang, G., Li, W., Aertsen, M., Deprest, J., Ourselin, S., & Vercauteren, T. (2019). Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks. Neurocomputing, 338, 34–45.
Watadani et al. (2013) Watadani, T., Sakai, F., Johkoh, T., Noma, S., Akira, M., Fujimoto, K., Bankier, A. A., Lee, K. S., Müller, N. L., Song, J.-W. et al. (2013). Interobserver variability in the ct assessment of honeycombing in the lungs. Radiology, 266, 936–944.
Zhao et al. (2019) Zhao, Y.-X., Zhang, Y.-M., Song, M., & Liu, C.-L. (2019). Multi-view semi-supervised 3d whole brain segmentation with a self-ensemble network. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part III 22 (pp. 256–265). Springer.