Image and Video Processing

See recent articles

Total of 53 entries

Showing up to 2000 entries per page: fewer | more | all

[1] arXiv:2407.09507 [pdf, html, other]: Title: Can Generative AI Replace Immunofluorescent Staining Processes? A Comparison Study of Synthetically Generated CellPainting Images from Brightfield

Xiaodan Xing, Siofra Murdoch, Chunling Tang, Giorgos Papanastasiou, Jan Cross-Zamirski, Yunzhe Guo, Xianglu Xiao, Carola-Bibiane Schönlieb, Yinhai Wang, Guang Yang

Subjects: Image and Video Processing (eess.IV)

Cell imaging assays utilizing fluorescence stains are essential for observing sub-cellular organelles and their responses to perturbations. Immunofluorescent staining process is routinely in labs, however the recent innovations in generative AI is challenging the idea of IF staining are required. This is especially true when the availability and cost of specific fluorescence dyes is a problem to some labs. Furthermore, staining process takes time and leads to inter-intra technician and hinders downstream image and data analysis, and the reusability of image data for other projects. Recent studies showed the use of generated synthetic immunofluorescence (IF) images from brightfield (BF) images using generative AI algorithms in the literature. Therefore, in this study, we benchmark and compare five models from three types of IF generation backbones, CNN, GAN, and diffusion models, using a publicly available dataset. This paper not only serves as a comparative study to determine the best-performing model but also proposes a comprehensive analysis pipeline for evaluating the efficacy of generators in IF image synthesis. We highlighted the potential of deep learning-based generators for IF image synthesis, while also discussed potential issues and future research directions. Although generative AI shows promise in simplifying cell phenotyping using only BF images with IF staining, further research and validations are needed to address the key challenges of model generalisability, batch effects, feature relevance and computational costs.
[2] arXiv:2407.09515 [pdf, html, other]: Title: Rethinking Knee Osteoarthritis Severity Grading: A Few Shot Self-Supervised Contrastive Learning Approach

Niamh Belton, Misgina Tsighe Hagos, Aonghus Lawlor, Kathleen M. Curran

Journal-ref: Thirty-seventh Conference on Neural Information Processing Systems Workshop on Medical Imaging meets NeurIPS 2023

Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG)

Knee Osteoarthritis (OA) is a debilitating disease affecting over 250 million people worldwide. Currently, radiologists grade the severity of OA on an ordinal scale from zero to four using the Kellgren-Lawrence (KL) system. Recent studies have raised concern in relation to the subjectivity of the KL grading system, highlighting the requirement for an automated system, while also indicating that five ordinal classes may not be the most appropriate approach for assessing OA severity. This work presents preliminary results of an automated system with a continuous grading scale. This system, namely SS-FewSOME, uses self-supervised pre-training to learn robust representations of the features of healthy knee X-rays. It then assesses the OA severity by the X-rays' distance to the normal representation space. SS-FewSOME initially trains on only 'few' examples of healthy knee X-rays, thus reducing the barriers to clinical implementation by eliminating the need for large training sets and costly expert annotations that existing automated systems require. The work reports promising initial results, obtaining a positive Spearman Rank Correlation Coefficient of 0.43, having had access to only 30 ground truth labels at training time.
[3] arXiv:2407.09534 [pdf, html, other]: Title: DFS-based fast crack detection

Vsevolod Chernyshev, Vitalii Makogin, Duc Nguyen, Evgeny Spodarev

Subjects: Image and Video Processing (eess.IV)

In this paper, we propose an fast method for crack detection in 3D computed tomography (CT) images. Our approach combines the Maximal Hessian Entry filter and a Deep-First Search algorithm-based technique to strike a balance between computational complexity and accuracy. Experimental results demonstrate the effectiveness of our approach in detecting the crack structure with predefined misclassification probability.
[4] arXiv:2407.09540 [pdf, html, other]: Title: Prompting Whole Slide Image Based Genetic Biomarker Prediction

Ling Zhang, Boxiang Yun, Xingran Xie, Qingli Li, Xinxing Li, Yan Wang

Comments: 11 pages, 3 figures, MICCAI2024

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Prediction of genetic biomarkers, e.g., microsatellite instability and BRAF in colorectal cancer is crucial for clinical decision making. In this paper, we propose a whole slide image (WSI) based genetic biomarker prediction method via prompting techniques. Our work aims at addressing the following challenges: (1) extracting foreground instances related to genetic biomarkers from gigapixel WSIs, and (2) the interaction among the fine-grained pathological components in WSIs.Specifically, we leverage large language models to generate medical prompts that serve as prior knowledge in extracting instances associated with genetic biomarkers. We adopt a coarse-to-fine approach to mine biomarker information within the tumor microenvironment. This involves extracting instances related to genetic biomarkers using coarse medical prior knowledge, grouping pathology instances into fine-grained pathological components and mining their interactions. Experimental results on two colorectal cancer datasets show the superiority of our method, achieving 91.49% in AUC for MSI classification. The analysis further shows the clinical interpretability of our method. Code is publicly available at this https URL.
[5] arXiv:2407.09543 [pdf, html, other]: Title: Neural Texture Block Compression

Shin Fujieda, Takahiro Harada

Subjects: Image and Video Processing (eess.IV); Graphics (cs.GR)

Block compression is a widely used technique to compress textures in real-time graphics applications, offering a reduction in storage size. However, their storage efficiency is constrained by the fixed compression ratio, which substantially increases storage size when hundreds of high-quality textures are required. In this paper, we propose a novel block texture compression method with neural networks, Neural Texture Block Compression (NTBC). NTBC learns the mapping from uncompressed textures to block-compressed textures, which allows for significantly reduced storage costs without any change in the shaders.Our experiments show that NTBC can achieve reasonable-quality results with up to about 70% less storage footprint, preserving real-time performance with a modest computational overhead at the texture loading phase in the graphics pipeline.
[6] arXiv:2407.09561 [pdf, html, other]: Title: Integrating Deep Learning in Cardiology: A Comprehensive Review of Atrial Fibrillation, Left Atrial Scar Segmentation, and the Frontiers of State-of-the-Art Techniques

Malitha Gunawardhana, Anuradha Kulathilaka, Jichao Zhao

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Atrial fibrillation (AFib) is the prominent cardiac arrhythmia in the world. It affects mostly the elderly population, with potential consequences such as stroke and heart failure in the absence of necessary treatments as soon as possible. The importance of atrial scarring in the development and progression of AFib has gained recognition, positioning late gadolinium-enhanced magnetic resonance imaging (LGE-MRI) as a crucial technique for the non-invasive evaluation of atrial scar tissue. This review delves into the recent progress in segmenting atrial scars using LGE-MRIs, emphasizing the importance of precise scar measurement in the treatment and management of AFib. Initially, it provides a detailed examination of AFib. Subsequently, it explores the application of deep learning in this domain. The review culminates in a discussion of the latest research advancements in atrial scar segmentation using deep learning methods. By offering a thorough analysis of current technologies and their impact on AFib management strategies, this review highlights the integral role of deep learning in enhancing atrial scar segmentation and its implications for future therapeutic approaches.
[7] arXiv:2407.09828 [pdf, html, other]: Title: Enhancing Semantic Segmentation with Adaptive Focal Loss: A Novel Approach

Md Rakibul Islam, Riad Hassan, Abdullah Nazib, Kien Nguyen, Clinton Fookes, Md Zahidul Islam

Comments: 15 pages, 4 figures

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Deep learning has achieved outstanding accuracy in medical image segmentation, particularly for objects like organs or tumors with smooth boundaries or large sizes. Whereas, it encounters significant difficulties with objects that have zigzag boundaries or are small in size, leading to a notable decrease in segmentation effectiveness. In this context, using a loss function that incorporates smoothness and volume information into a model's predictions offers a promising solution to these shortcomings. In this work, we introduce an Adaptive Focal Loss (A-FL) function designed to mitigate class imbalance by down-weighting the loss for easy examples that results in up-weighting the loss for hard examples and giving greater emphasis to challenging examples, such as small and irregularly shaped objects. The proposed A-FL involves dynamically adjusting a focusing parameter based on an object's surface smoothness, size information, and adjusting the class balancing parameter based on the ratio of targeted area to total area in an image. We evaluated the performance of the A-FL using ResNet50-encoded U-Net architecture on the Picai 2022 and BraTS 2018 datasets. On the Picai 2022 dataset, the A-FL achieved an Intersection over Union (IoU) of 0.696 and a Dice Similarity Coefficient (DSC) of 0.769, outperforming the regular Focal Loss (FL) by 5.5% and 5.4% respectively. It also surpassed the best baseline Dice-Focal by 2.0% and 1.2%. On the BraTS 2018 dataset, A-FL achieved an IoU of 0.883 and a DSC of 0.931. The comparative studies show that the proposed A-FL function surpasses conventional methods, including Dice Loss, Focal Loss, and their hybrid variants, in IoU, DSC, Sensitivity, and Specificity metrics. This work highlights A-FL's potential to improve deep learning models for segmenting clinically significant regions in medical images, leading to more precise and reliable diagnostic tools.
[8] arXiv:2407.09918 [pdf, html, other]: Title: DiffRect: Latent Diffusion Label Rectification for Semi-supervised Medical Image Segmentation

Xinyu Liu, Wuyang Li, Yixuan Yuan

Comments: MICCAI 2024

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Semi-supervised medical image segmentation aims to leverage limited annotated data and rich unlabeled data to perform accurate segmentation. However, existing semi-supervised methods are highly dependent on the quality of self-generated pseudo labels, which are prone to incorrect supervision and confirmation bias. Meanwhile, they are insufficient in capturing the label distributions in latent space and suffer from limited generalization to unlabeled data. To address these issues, we propose a Latent Diffusion Label Rectification Model (DiffRect) for semi-supervised medical image segmentation. DiffRect first utilizes a Label Context Calibration Module (LCC) to calibrate the biased relationship between classes by learning the category-wise correlation in pseudo labels, then apply Latent Feature Rectification Module (LFR) on the latent space to formulate and align the pseudo label distributions of different levels via latent diffusion. It utilizes a denoising network to learn the coarse to fine and fine to precise consecutive distribution transportations. We evaluate DiffRect on three public datasets: ACDC, MS-CMRSEG 2019, and Decathlon Prostate. Experimental results demonstrate the effectiveness of DiffRect, e.g. it achieves 82.40\% Dice score on ACDC with only 1\% labeled scan available, outperforms the previous state-of-the-art by 4.60\% in Dice, and even rivals fully supervised performance. Code is released at \url{this https URL}.
[9] arXiv:2407.09999 [pdf, html, other]: Title: Pay Less On Clinical Images: Asymmetric Multi-Modal Fusion Method For Efficient Multi-Label Skin Lesion Classification

Peng Tang, Tobias Lasser

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Existing multi-modal approaches primarily focus on enhancing multi-label skin lesion classification performance through advanced fusion modules, often neglecting the associated rise in parameters. In clinical settings, both clinical and dermoscopy images are captured for diagnosis; however, dermoscopy images exhibit more crucial visual features for multi-label skin lesion classification. Motivated by this observation, we introduce a novel asymmetric multi-modal fusion method in this paper for efficient multi-label skin lesion classification. Our fusion method incorporates two innovative schemes. Firstly, we validate the effectiveness of our asymmetric fusion structure. It employs a light and simple network for clinical images and a heavier, more complex one for dermoscopy images, resulting in significant parameter savings compared to the symmetric fusion structure using two identical networks for both modalities. Secondly, in contrast to previous approaches using mutual attention modules for interaction between image modalities, we propose an asymmetric attention module. This module solely leverages clinical image information to enhance dermoscopy image features, considering clinical images as supplementary information in our pipeline. We conduct the extensive experiments on the seven-point checklist dataset. Results demonstrate the generality of our proposed method for both networks and Transformer structures, showcasing its superiority over existing methods We will make our code publicly available.
[10] arXiv:2407.10157 [pdf, html, other]: Title: SACNet: A Spatially Adaptive Convolution Network for 2D Multi-organ Medical Segmentation

Lin Zhang, Wenbo Gao, Jie Yi, Yunyun Yang

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Multi-organ segmentation in medical image analysis is crucial for diagnosis and treatment planning. However, many factors complicate the task, including variability in different target categories and interference from complex backgrounds. In this paper, we utilize the knowledge of Deformable Convolution V3 (DCNv3) and multi-object segmentation to optimize our Spatially Adaptive Convolution Network (SACNet) in three aspects: feature extraction, model architecture, and loss constraint, simultaneously enhancing the perception of different segmentation targets. Firstly, we propose the Adaptive Receptive Field Module (ARFM), which combines DCNv3 with a series of customized block-level and architecture-level designs similar to transformers. This module can capture the unique features of different organs by adaptively adjusting the receptive field according to various targets. Secondly, we utilize ARFM as building blocks to construct the encoder-decoder of SACNet and partially share parameters between the encoder and decoder, making the network wider rather than deeper. This design achieves a shared lightweight decoder and a more parameter-efficient and effective framework. Lastly, we propose a novel continuity dynamic adjustment loss function, based on t-vMF dice loss and cross-entropy loss, to better balance easy and complex classes in segmentation. Experiments on 3D slice datasets from ACDC and Synapse demonstrate that SACNet delivers superior segmentation performance in multi-organ segmentation tasks compared to several existing methods.
[11] arXiv:2407.10251 [pdf, other]: Title: Deep Learning Algorithms for Early Diagnosis of Acute Lymphoblastic Leukemia

Dimitris Papaioannou, Ioannis Christou, Nikos Anagnou, Aristotelis Chatziioannou

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Acute lymphoblastic leukemia (ALL) is a form of blood cancer that affects the white blood cells. ALL constitutes approximately 25% of pediatric cancers. Early diagnosis and treatment of ALL are crucial for improving patient outcomes. The task of identifying immature leukemic blasts from normal cells under the microscope can prove challenging, since the images of a healthy and cancerous cell appear similar morphologically. In this study, we propose a binary image classification model to assist in the diagnostic process of ALL. Our model takes as input microscopic images of blood samples and outputs a binary prediction of whether the sample is normal or cancerous. Our dataset consists of 10661 images out of 118 subjects. Deep learning techniques on convolutional neural network architectures were used to achieve accurate classification results. Our proposed method achieved 94.3% accuracy and could be used as an assisting tool for hematologists trying to predict the likelihood of a patient developing ALL.
[12] arXiv:2407.10325 [pdf, html, other]: Title: Light Field Compression Based on Implicit Neural Representation

Henan Wang, Hanxin Zhu, Zhibo Chen

Comments: PCS2022

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Light field, as a new data representation format in multimedia, has the ability to capture both intensity and direction of light rays. However, the additional angular information also brings a large volume of data. Classical coding methods are not effective to describe the relationship between different views, leading to redundancy left. To address this problem, we propose a novel light field compression scheme based on implicit neural representation to reduce redundancies between views. We store the information of a light field image implicitly in an neural network and adopt model compression methods to further compress the implicit representation. Extensive experiments have demonstrated the effectiveness of our proposed method, which achieves comparable rate-distortion performance as well as superior perceptual quality over traditional methods.
[13] arXiv:2407.10336 [pdf, html, other]: Title: Thyroidiomics: An Automated Pipeline for Segmentation and Classification of Thyroid Pathologies from Scintigraphy Images

Maziar Sabouri, Shadab Ahamed, Azin Asadzadeh, Atlas Haddadi Avval, Soroush Bagheri, Mohsen Arabi, Seyed Rasoul Zakavi, Emran Askari, Ali Rasouli, Atena Aghaee, Mohaddese Sehati, Fereshteh Yousefirizi, Carlos Uribe, Ghasem Hajianfar, Habib Zaidi, Arman Rahmim

Comments: 7 pages, 4 figures, 2 tables

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)

The objective of this study was to develop an automated pipeline that enhances thyroid disease classification using thyroid scintigraphy images, aiming to decrease assessment time and increase diagnostic accuracy. Anterior thyroid scintigraphy images from 2,643 patients were collected and categorized into diffuse goiter (DG), multinodal goiter (MNG), and thyroiditis (TH) based on clinical reports, and then segmented by an expert. A ResUNet model was trained to perform auto-segmentation. Radiomic features were extracted from both physician (scenario 1) and ResUNet segmentations (scenario 2), followed by omitting highly correlated features using Spearman's correlation, and feature selection using Recursive Feature Elimination (RFE) with XGBoost as the core. All models were trained under leave-one-center-out cross-validation (LOCOCV) scheme, where nine instances of algorithms were iteratively trained and validated on data from eight centers and tested on the ninth for both scenarios separately. Segmentation performance was assessed using the Dice similarity coefficient (DSC), while classification performance was assessed using metrics, such as precision, recall, F1-score, accuracy, area under the Receiver Operating Characteristic (ROC AUC), and area under the precision-recall curve (PRC AUC). ResUNet achieved DSC values of 0.84$\pm$0.03, 0.71$\pm$0.06, and 0.86$\pm$0.02 for MNG, TH, and DG, respectively. Classification in scenario 1 achieved an accuracy of 0.76$\pm$0.04 and a ROC AUC of 0.92$\pm$0.02 while in scenario 2, classification yielded an accuracy of 0.74$\pm$0.05 and a ROC AUC of 0.90$\pm$0.02. The automated pipeline demonstrated comparable performance to physician segmentations on several classification metrics across different classes, effectively reducing assessment time while maintaining high diagnostic accuracy. Code available at: this https URL.
[14] arXiv:2407.10377 [pdf, other]: Title: Enhanced Self-supervised Learning for Multi-modality MRI Segmentation and Classification: A Novel Approach Avoiding Model Collapse

Linxuan Han, Sa Xiao, Zimeng Li, Haidong Li, Xiuchao Zhao, Fumin Guo, Yeqing Han, Xin Zhou

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Multi-modality magnetic resonance imaging (MRI) can provide complementary information for computer-aided diagnosis. Traditional deep learning algorithms are suitable for identifying specific anatomical structures segmenting lesions and classifying diseases with magnetic resonance images. However, manual labels are limited due to high expense, which hinders further improvement of model accuracy. Self-supervised learning (SSL) can effectively learn feature representations from unlabeled data by pre-training and is demonstrated to be effective in natural image analysis. Most SSL methods ignore the similarity of multi-modality MRI, leading to model collapse. This limits the efficiency of pre-training, causing low accuracy in downstream segmentation and classification tasks. To solve this challenge, we establish and validate a multi-modality MRI masked autoencoder consisting of hybrid mask pattern (HMP) and pyramid barlow twin (PBT) module for SSL on multi-modality MRI analysis. The HMP concatenates three masking steps forcing the SSL to learn the semantic connections of multi-modality images by reconstructing the masking patches. We have proved that the proposed HMP can avoid model collapse. The PBT module exploits the pyramidal hierarchy of the network to construct barlow twin loss between masked and original views, aligning the semantic representations of image patches at different vision scales in latent space. Experiments on BraTS2023, PI-CAI, and lung gas MRI datasets further demonstrate the superiority of our framework over the state-of-the-art. The performance of the segmentation and classification is substantially enhanced, supporting the accurate detection of small lesion areas. The code is available at this https URL.
[15] arXiv:2407.10427 [pdf, html, other]: Title: Transformer for Multitemporal Hyperspectral Image Unmixing

Hang Li, Qiankun Dong, Xueshuo Xie, Xia Xu, Tao Li, Zhenwei Shi

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Multitemporal hyperspectral image unmixing (MTHU) holds significant importance in monitoring and analyzing the dynamic changes of surface. However, compared to single-temporal unmixing, the multitemporal approach demands comprehensive consideration of information across different phases, rendering it a greater challenge. To address this challenge, we propose the Multitemporal Hyperspectral Image Unmixing Transformer (MUFormer), an end-to-end unsupervised deep learning model. To effectively perform multitemporal hyperspectral image unmixing, we introduce two key modules: the Global Awareness Module (GAM) and the Change Enhancement Module (CEM). The Global Awareness Module computes self-attention across all phases, facilitating global weight allocation. On the other hand, the Change Enhancement Module dynamically learns local temporal changes by comparing endmember changes between adjacent phases. The synergy between these modules allows for capturing semantic information regarding endmember and abundance changes, thereby enhancing the effectiveness of multitemporal hyperspectral image unmixing. We conducted experiments on one real dataset and two synthetic datasets, demonstrating that our model significantly enhances the effect of multitemporal hyperspectral image unmixing.
[16] arXiv:2407.10537 [pdf, html, other]: Title: Segmentation of Prostate Tumour Volumes from PET Images is a Different Ball Game

Shrajan Bhandary, Dejan Kuhn, Zahra Babaiee, Tobias Fechter, Simon K.B. Spohn, Constantinos Zamboglou, Anca-Ligia Grosu, Radu Grosu

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Accurate segmentation of prostate tumours from PET images presents a formidable challenge in medical image analysis. Despite considerable work and improvement in delineating organs from CT and MR modalities, the existing standards do not transfer well and produce quality results in PET related tasks. Particularly, contemporary methods fail to accurately consider the intensity-based scaling applied by the physicians during manual annotation of tumour contours. In this paper, we observe that the prostate-localised uptake threshold ranges are beneficial for suppressing outliers. Therefore, we utilize the intensity threshold values, to implement a new custom-feature-clipping normalisation technique. We evaluate multiple, established U-Net variants under different normalisation schemes, using the nnU-Net framework. All models were trained and tested on multiple datasets, obtained with two radioactive tracers: [68-Ga]Ga-PSMA-11 and [18-F]PSMA-1007. Our results show that the U-Net models achieve much better performance when the PET scans are preprocessed with our novel clipping technique.
[17] arXiv:2407.10630 [pdf, other]: Title: Brain Tumor Classification From MRI Images Using Machine Learning

Vidhyapriya Ranganathan, Celshiya Udaiyar, Jaisree Jayanth, Meghaa P V, Srija B, Uthra S

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Brain tumor is a life-threatening problem and hampers the normal functioning of the human body. The average five-year relative survival rate for malignant brain tumors is 35.6 percent. For proper diagnosis and efficient treatment planning, it is necessary to detect the brain tumor in early stages. Due to advancement in medical imaging technology, the brain images are taken in different modalities. The ability to extract relevant characteristics from magnetic resonance imaging (MRI) scans is a crucial step for brain tumor classifiers. Several studies have proposed various strategies to extract relevant features from different modalities of MRI to predict the growth of abnormal tumors. Most techniques used conventional methods of image processing for feature extraction and machine learning for classification. More recently, the use of deep learning algorithms in medical imaging has resulted in significant improvements in the classification and diagnosis of brain tumors. Since tumors are located at different regions of the brain, localizing the tumor and classifying it to a particular category is a challenging task. The objective of this project is to develop a predictive system for brain tumor detection using machine learning(ensembling).
[18] arXiv:2407.10632 [pdf, html, other]: Title: Bidirectional Stereo Image Compression with Cross-Dimensional Entropy Model

Zhening Liu, Xinjie Zhang, Jiawei Shao, Zehong Lin, Jun Zhang

Comments: ECCV 2024

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

With the rapid advancement of stereo vision technologies, stereo image compression has emerged as a crucial field that continues to draw significant attention. Previous approaches have primarily employed a unidirectional paradigm, where the compression of one view is dependent on the other, resulting in imbalanced compression. To address this issue, we introduce a symmetric bidirectional stereo image compression architecture, named BiSIC. Specifically, we propose a 3D convolution based codec backbone to capture local features and incorporate bidirectional attention blocks to exploit global features. Moreover, we design a novel cross-dimensional entropy model that integrates various conditioning factors, including the spatial context, channel context, and stereo dependency, to effectively estimate the distribution of latent representations for entropy coding. Extensive experiments demonstrate that our proposed BiSIC outperforms conventional image/video compression standards, as well as state-of-the-art learning-based methods, in terms of both PSNR and MS-SSIM.
[19] arXiv:2407.10667 [pdf, html, other]: Title: DUCPS: Deep Unfolding the Cauchy Proximal Splitting Algorithm for B-Lines Quantification in Lung Ultrasound Images

Tianqi Yang, Oktay Karakuş, Nantheera Anantrasirichai, Marco Allinovi, Alin Achim

Comments: 16 pages, 6 figures, IEEE TMI

Subjects: Image and Video Processing (eess.IV)

The identification of artefacts, particularly B-lines, in lung ultrasound (LUS), is crucial for assisting clinical diagnosis, prompting the development of innovative methodologies. While the Cauchy proximal splitting (CPS) algorithm has demonstrated effective performance in B-line detection, the process is slow and has limited generalization. This paper addresses these issues with a novel unsupervised deep unfolding network structure (DUCPS). The framework utilizes deep unfolding procedures to merge traditional model-based techniques with deep learning approaches. By unfolding the CPS algorithm into a deep network, DUCPS enables the parameters in the optimization algorithm to be learnable, thus enhancing generalization performance and facilitating rapid convergence. We conducted entirely unsupervised training using the Neighbor2Neighbor (N2N) and the Structural Similarity Index Measure (SSIM) losses. When combined with an improved line identification method proposed in this paper, state-of-the-art performance is achieved, with the recall and F2 score reaching 0.70 and 0.64, respectively. Notably, DUCPS significantly improves computational efficiency eliminating the need for extensive data labeling, representing a notable advancement over both traditional algorithms and existing deep learning approaches.
[20] arXiv:2407.10785 [pdf, html, other]: Title: Interpretability analysis on a pathology foundation model reveals biologically relevant embeddings across modalities

Nhat Le, Ciyue Shen, Chintan Shah, Blake Martin, Daniel Shenker, Harshith Padigela, Jennifer Hipp, Sean Grullon, John Abel, Harsha Vardhan Pokkalla, Dinkar Juyal

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Mechanistic interpretability has been explored in detail for large language models (LLMs). For the first time, we provide a preliminary investigation with similar interpretability methods for medical imaging. Specifically, we analyze the features from a ViT-Small encoder obtained from a pathology Foundation Model via application to two datasets: one dataset of pathology images, and one dataset of pathology images paired with spatial transcriptomics. We discover an interpretable representation of cell and tissue morphology, along with gene expression within the model embedding space. Our work paves the way for further exploration around interpretable feature dimensions and their utility for medical and clinical applications.
[21] arXiv:2407.10796 [pdf, html, other]: Title: Mammographic Breast Positioning Assessment via Deep Learning

Toygar Tanyel, Nurper Denizoglu, Mustafa Ege Seker, Deniz Alis, Esma Cerekci, Ercan Karaarslan, Erkin Aribal, Ilkay Oksuz

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Breast cancer remains a leading cause of cancer-related deaths among women worldwide, with mammography screening as the most effective method for the early detection. Ensuring proper positioning in mammography is critical, as poor positioning can lead to diagnostic errors, increased patient stress, and higher costs due to recalls. Despite advancements in deep learning (DL) for breast cancer diagnostics, limited focus has been given to evaluating mammography positioning. This paper introduces a novel DL methodology to quantitatively assess mammogram positioning quality, specifically in mediolateral oblique (MLO) views using attention and coordinate convolution modules. Our method identifies key anatomical landmarks, such as the nipple and pectoralis muscle, and automatically draws a posterior nipple line (PNL), offering robust and inherently explainable alternative to well-known classification and regression-based approaches. We compare the performance of proposed methodology with various regression and classification-based models. The CoordAtt UNet model achieved the highest accuracy of 88.63% $\pm$ 2.84 and specificity of 90.25% $\pm$ 4.04, along with a noteworthy sensitivity of 86.04% $\pm$ 3.41. In landmark detection, the same model also recorded the lowest mean errors in key anatomical points and the smallest angular error of 2.42 degrees. Our results indicate that models incorporating attention mechanisms and CoordConv module increase the accuracy in classifying breast positioning quality and detecting anatomical landmarks. Furthermore, we make the labels and source codes available to the community to initiate an open research area for mammography, accessible at this https URL.
[22] arXiv:2407.10833 [pdf, html, other]: Title: MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration

Yulin Ren, Xin Li, Bingchen Li, Xingrui Wang, Mengxi Guo, Shijie Zhao, Li Zhang, Zhibo Chen

Comments: Accepted by ECCV 2024

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

We present MoE-DiffIR, an innovative universal compressed image restoration (CIR) method with task-customized diffusion priors. This intends to handle two pivotal challenges in the existing CIR methods: (i) lacking adaptability and universality for different image codecs, e.g., JPEG and WebP; (ii) poor texture generation capability, particularly at low bitrates. Specifically, our MoE-DiffIR develops the powerful mixture-of-experts (MoE) prompt module, where some basic prompts cooperate to excavate the task-customized diffusion priors from Stable Diffusion (SD) for each compression task. Moreover, the degradation-aware routing mechanism is proposed to enable the flexible assignment of basic prompts. To activate and reuse the cross-modality generation prior of SD, we design the visual-to-text adapter for MoE-DiffIR, which aims to adapt the embedding of low-quality images from the visual domain to the textual domain as the textual guidance for SD, enabling more consistent and reasonable texture generation. We also construct one comprehensive benchmark dataset for universal CIR, covering 21 types of degradations from 7 popular traditional and learned codecs. Extensive experiments on universal CIR have demonstrated the excellent robustness and texture restoration capability of our proposed MoE-DiffIR. The project can be found at this https URL.
[23] arXiv:2407.10856 [pdf, html, other]: Title: Physics-Inspired Generative Models in Medical Imaging: A Review

Dennis Hein, Afshin Bozorgpour, Dorit Merhof, Ge Wang

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

Physics-inspired generative models, in particular diffusion and Poisson flow models, enhance Bayesian methods and promise great utilities in medical imaging. This review examines the transformative role of such generative methods. First, a variety of physics-inspired generative models, including Denoising Diffusion Probabilistic Models (DDPM), Score-based Diffusion Models, and Poisson Flow Generative Models (PFGM and PFGM++), are revisited, with an emphasis on their accuracy, robustness as well as acceleration. Then, major applications of physics-inspired generative models in medical imaging are presented, comprising image reconstruction, image generation, and image analysis. Finally, future research directions are brainstormed, including unification of physics-inspired generative models, integration with vision-language models (VLMs),and potential novel applications of generative models. Since the development of generative methods has been rapid, this review will hopefully give peers and learners a timely snapshot of this new family of physics-driven generative models and help capitalize their enormous potential for medical imaging.
[24] arXiv:2407.10888 [pdf, html, other]: Title: Leveraging Multimodal CycleGAN for the Generation of Anatomically Accurate Synthetic CT Scans from MRIs

Leonardo Crespi, Samuele Camnasio, Damiano Dei, Nicola Lambri, Pietro Mancosu, Marta Scorsetti, Daniele Loiacono

Comments: Currently submitted to: Scientific Reports

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

In many clinical settings, the use of both Computed Tomography (CT) and Magnetic Resonance (MRI) is necessary to pursue a thorough understanding of the patient's anatomy and to plan a suitable therapeutical strategy; this is often the case in MRI-based radiotherapy, where CT is always necessary to prepare the dose delivery, as it provides the essential information about the radiation absorption properties of the tissues. Sometimes, MRI is preferred to contour the target volumes. However, this approach is often not the most efficient, as it is more expensive, time-consuming and, most importantly, stressful for the patients. To overcome this issue, in this work, we analyse the capabilities of different configurations of Deep Learning models to generate synthetic CT scans from MRI, leveraging the power of Generative Adversarial Networks (GANs) and, in particular, the CycleGAN architecture, capable of working in an unsupervised manner and without paired images, which were not available. Several CycleGAN models were trained unsupervised to generate CT scans from different MRI modalities with and without contrast agents. To overcome the problem of not having a ground truth, distribution-based metrics were used to assess the model's performance quantitatively, together with a qualitative evaluation where physicians were asked to differentiate between real and synthetic images to understand how realistic the generated images were. The results show how, depending on the input modalities, the models can have very different performances; however, models with the best quantitative results, according to the distribution-based metrics used, can generate very difficult images to distinguish from the real ones, even for physicians, demonstrating the approach's potential.
[25] arXiv:2407.10921 [pdf, html, other]: Title: A Dual-Attention Aware Deep Convolutional Neural Network for Early Alzheimer's Detection

Pandiyaraju V, Shravan Venkatraman, Abeshek A, Aravintakshan S A, Pavan Kumar S, Kannan A

Comments: 18 pages, 10 figures, 6 tables

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Alzheimer's disease (AD) represents the primary form of neurodegeneration, impacting millions of individuals each year and causing progressive cognitive decline. Accurately diagnosing and classifying AD using neuroimaging data presents ongoing challenges in medicine, necessitating advanced interventions that will enhance treatment measures. In this research, we introduce a dual attention enhanced deep learning (DL) framework for classifying AD from neuroimaging data. Combined spatial and self-attention mechanisms play a vital role in emphasizing focus on neurofibrillary tangles and amyloid plaques from the MRI images, which are difficult to discern with regular imaging techniques. Results demonstrate that our model yielded remarkable performance in comparison to existing state of the art (SOTA) convolutional neural networks (CNNs), with an accuracy of 99.1%. Moreover, it recorded remarkable metrics, with an F1-Score of 99.31%, a precision of 99.24%, and a recall of 99.5%. These results highlight the promise of cutting edge DL methods in medical diagnostics, contributing to highly reliable and more efficient healthcare solutions.
[26] arXiv:2407.10926 [pdf, html, other]: Title: In-Loop Filtering via Trained Look-Up Tables

Zhuoyuan Li, Jiacheng Li, Yao Li, Li Li, Dong Liu, Feng Wu

Comments: 11 pages, 6 figures

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

In-loop filtering (ILF) is a key technology for removing the artifacts in image/video coding standards. Recently, neural network-based in-loop filtering methods achieve remarkable coding gains beyond the capability of advanced video coding standards, which becomes a powerful coding tool candidate for future video coding standards. However, the utilization of deep neural networks brings heavy time and computational complexity, and high demands of high-performance hardware, which is challenging to apply to the general uses of coding scene. To address this limitation, inspired by explorations in image restoration, we propose an efficient and practical in-loop filtering scheme by adopting the Look-up Table (LUT). We train the DNN of in-loop filtering within a fixed filtering reference range, and cache the output values of the DNN into a LUT via traversing all possible inputs. At testing time in the coding process, the filtered pixel is generated by locating input pixels (to-be-filtered pixel with reference pixels) and interpolating cached filtered pixel values. To further enable the large filtering reference range with the limited storage cost of LUT, we introduce the enhanced indexing mechanism in the filtering process, and clipping/finetuning mechanism in the training. The proposed method is implemented into the Versatile Video Coding (VVC) reference software, VTM-11.0. Experimental results show that the ultrafast, very fast, and fast mode of the proposed method achieves on average 0.13%/0.34%/0.51%, and 0.10%/0.27%/0.39% BD-rate reduction, under the all intra (AI) and random access (RA) configurations. Especially, our method has friendly time and computational complexity, only 101%/102%-104%/108% time increase with 0.13-0.93 kMACs/pixel, and only 164-1148 KB storage cost for a single model. Our solution may shed light on the journey of practical neural network-based coding tool evolution.

[27] arXiv:2407.09520 (cross-list from cs.CV) [pdf, html, other]: Title: Exploring the Impact of Hand Pose and Shadow on Hand-washing Action Recognition

Shengtai Ju, Amy R. Reibman

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

In the real world, camera-based application systems can face many challenges, including environmental factors and distribution shift. In this paper, we investigate how pose and shadow impact a classifier's performance, using the specific application of handwashing action recognition. To accomplish this, we generate synthetic data with desired variations to introduce controlled distribution shift. Using our synthetic dataset, we define a classifier's breakdown points to be where the system's performance starts to degrade sharply, and we show these are heavily impacted by pose and shadow conditions. In particular, heavier and larger shadows create earlier breakdown points. Also, it is intriguing to observe model accuracy drop to almost zero with bigger changes in pose. Moreover, we propose a simple mitigation strategy for pose-induced breakdown points by utilizing additional training data from non-canonical poses. Results show that the optimal choices of additional training poses are those with moderate deviations from the canonical poses with 50-60 degrees of rotation.
[28] arXiv:2407.09539 (cross-list from cs.CV) [pdf, html, other]: Title: Classification of Inkjet Printers based on Droplet Statistics

Patrick Takenaka, Manuel Eberhardinger, Daniel Grießhaber, Johannes Maucher

Comments: to be published in 2024 International Joint Conference on Neural Networks (IJCNN)

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Knowing the printer model used to print a given document may provide a crucial lead towards identifying counterfeits or conversely verifying the validity of a real document. Inkjet printers produce probabilistic droplet patterns that appear to be distinct for each printer model and as such we investigate the utilization of droplet characteristics including frequency domain features extracted from printed document scans for the classification of the underlying printer model. We collect and publish a dataset of high resolution document scans and show that our extracted features are informative enough to enable a neural network to distinguish not only the printer manufacturer, but also individual printer models.
[29] arXiv:2407.09544 (cross-list from cs.CL) [pdf, other]: Title: A Transformer-Based Multi-Stream Approach for Isolated Iranian Sign Language Recognition

Ali Ghadami, Alireza Taheri, Ali Meghdari

Comments: 17 pages, 10 figures

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Sign language is an essential means of communication for millions of people around the world and serves as their primary language. However, most communication tools are developed for spoken and written languages which can cause problems and difficulties for the deaf and hard of hearing community. By developing a sign language recognition system, we can bridge this communication gap and enable people who use sign language as their main form of expression to better communicate with people and their surroundings. This recognition system increases the quality of health services, improves public services, and creates equal opportunities for the deaf community. This research aims to recognize Iranian Sign Language words with the help of the latest deep learning tools such as transformers. The dataset used includes 101 Iranian Sign Language words frequently used in academic environments such as universities. The network used is a combination of early fusion and late fusion transformer encoder-based networks optimized with the help of genetic algorithm. The selected features to train this network include hands and lips key points, and the distance and angle between hands extracted from the sign videos. Also, in addition to the training model for the classes, the embedding vectors of words are used as multi-task learning to have smoother and more efficient training. This model was also tested on sentences generated from our word dataset using a windowing technique for sentence translation. Finally, the sign language training software that provides real-time feedback to users with the help of the developed model, which has 90.2% accuracy on test data, was introduced, and in a survey, the effectiveness and efficiency of this type of sign language learning software and the impact of feedback were investigated.
[30] arXiv:2407.09562 (cross-list from cs.CV) [pdf, html, other]: Title: Edge AI-Enabled Chicken Health Detection Based on Enhanced FCOS-Lite and Knowledge Distillation

Qiang Tong, Jinrui Wang, Wenshuang Yang, Songtao Wu, Wenqi Zhang, Chen Sun, Kuanhong Xu

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

The utilization of AIoT technology has become a crucial trend in modern poultry management, offering the potential to optimize farming operations and reduce human workloads. This paper presents a real-time and compact edge-AI enabled detector designed to identify chickens and their healthy statuses using frames captured by a lightweight and intelligent camera equipped with an edge-AI enabled CMOS sensor. To ensure efficient deployment of the proposed compact detector within the memory-constrained edge-AI enabled CMOS sensor, we employ a FCOS-Lite detector leveraging MobileNet as the backbone. To mitigate the issue of reduced accuracy in compact edge-AI detectors without incurring additional inference costs, we propose a gradient weighting loss function as classification loss and introduce CIOU loss function as localization loss. Additionally, we propose a knowledge distillation scheme to transfer valuable information from a large teacher detector to the proposed FCOS-Lite detector, thereby enhancing its performance while preserving a compact model size. Experimental results demonstrate the proposed edge-AI enabled detector achieves commendable performance metrics, including a mean average precision (mAP) of 95.1$\%$ and an F1-score of 94.2$\%$, etc. Notably, the proposed detector can be efficiently deployed and operates at a speed exceeding 20 FPS on the edge-AI enabled CMOS sensor, achieved through int8 quantization. That meets practical demands for automated poultry health monitoring using lightweight intelligent cameras with low power consumption and minimal bandwidth costs.
[31] arXiv:2407.09788 (cross-list from cs.CV) [pdf, html, other]: Title: Explanation is All You Need in Distillation: Mitigating Bias and Shortcut Learning

Pedro R. A. S. Bassi, Andrea Cavalli, Sergio Decherchi

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Bias and spurious correlations in data can cause shortcut learning, undermining out-of-distribution (OOD) generalization in deep neural networks. Most methods require unbiased data during training (and/or hyper-parameter tuning) to counteract shortcut learning. Here, we propose the use of explanation distillation to hinder shortcut learning. The technique does not assume any access to unbiased data, and it allows an arbitrarily sized student network to learn the reasons behind the decisions of an unbiased teacher, such as a vision-language model or a network processing debiased images. We found that it is possible to train a neural network with explanation (e.g by Layer Relevance Propagation, LRP) distillation only, and that the technique leads to high resistance to shortcut learning, surpassing group-invariant learning, explanation background minimization, and alternative distillation techniques. In the COLOURED MNIST dataset, LRP distillation achieved 98.2% OOD accuracy, while deep feature distillation and IRM achieved 92.1% and 60.2%, respectively. In COCO-on-Places, the undesirable generalization gap between in-distribution and OOD accuracy is only of 4.4% for LRP distillation, while the other two techniques present gaps of 15.1% and 52.1%, respectively.
[32] arXiv:2407.09896 (cross-list from cs.CV) [pdf, html, other]: Title: Zero-Shot Image Compression with Diffusion-Based Posterior Sampling

Noam Elata, Tomer Michaeli, Michael Elad

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Diffusion models dominate the field of image generation, however they have yet to make major breakthroughs in the field of image compression. Indeed, while pre-trained diffusion models have been successfully adapted to a wide variety of downstream tasks, existing work in diffusion-based image compression require task specific model training, which can be both cumbersome and limiting. This work addresses this gap by harnessing the image prior learned by existing pre-trained diffusion models for solving the task of lossy image compression. This enables the use of the wide variety of publicly-available models, and avoids the need for training or fine-tuning. Our method, PSC (Posterior Sampling-based Compression), utilizes zero-shot diffusion-based posterior samplers. It does so through a novel sequential process inspired by the active acquisition technique "Adasense" to accumulate informative measurements of the image. This strategy minimizes uncertainty in the reconstructed image and allows for construction of an image-adaptive transform coordinated between both the encoder and decoder. PSC offers a progressive compression scheme that is both practical and simple to implement. Despite minimal tuning, and a simple quantization and entropy coding, PSC achieves competitive results compared to established methods, paving the way for further exploration of pre-trained diffusion models and posterior samplers for image compression.
[33] arXiv:2407.09935 (cross-list from cs.CV) [pdf, html, other]: Title: LeRF: Learning Resampling Function for Adaptive and Efficient Image Interpolation

Jiacheng Li, Chang Chen, Fenglong Song, Youliang Yan, Zhiwei Xiong

Comments: Code: this https URL

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)

Image resampling is a basic technique that is widely employed in daily applications, such as camera photo editing. Recent deep neural networks (DNNs) have made impressive progress in performance by introducing learned data priors. Still, these methods are not the perfect substitute for interpolation, due to the drawbacks in efficiency and versatility. In this work, we propose a novel method of Learning Resampling Function (termed LeRF), which takes advantage of both the structural priors learned by DNNs and the locally continuous assumption of interpolation. Specifically, LeRF assigns spatially varying resampling functions to input image pixels and learns to predict the hyper-parameters that determine the shapes of these resampling functions with a neural network. Based on the formulation of LeRF, we develop a family of models, including both efficiency-orientated and performance-orientated ones. To achieve interpolation-level efficiency, we adopt look-up tables (LUTs) to accelerate the inference of the learned neural network. Furthermore, we design a directional ensemble strategy and edge-sensitive indexing patterns to better capture local structures. On the other hand, to obtain DNN-level performance, we propose an extension of LeRF to enable it in cooperation with pre-trained upsampling models for cascaded resampling. Extensive experiments show that the efficiency-orientated version of LeRF runs as fast as interpolation, generalizes well to arbitrary transformations, and outperforms interpolation significantly, e.g., up to 3dB PSNR gain over Bicubic for x2 upsampling on Manga109. Besides, the performance-orientated version of LeRF reaches comparable performance with existing DNNs at much higher efficiency, e.g., less than 25% running time on a desktop GPU.
[34] arXiv:2407.09966 (cross-list from cs.CV) [pdf, html, other]: Title: Optimizing ROI Benefits Vehicle ReID in ITS

Mei Qiu, Lauren Ann Christopher, Lingxi Li, Stanley Chien, Yaobin Chen

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Vehicle re-identification (ReID) is a computer vision task that matches the same vehicle across different cameras or viewpoints in a surveillance system. This is crucial for Intelligent Transportation Systems (ITS), where the effectiveness is influenced by the regions from which vehicle images are cropped. This study explores whether optimal vehicle detection regions, guided by detection confidence scores, can enhance feature matching and ReID tasks. Using our framework with multiple Regions of Interest (ROIs) and lane-wise vehicle counts, we employed YOLOv8 for detection and DeepSORT for tracking across twelve Indiana Highway videos, including two pairs of videos from non-overlapping cameras. Tracked vehicle images were cropped from inside and outside the ROIs at five-frame intervals. Features were extracted using pre-trained models: ResNet50, ResNeXt50, Vision Transformer, and Swin-Transformer. Feature consistency was assessed through cosine similarity, information entropy, and clustering variance. Results showed that features from images cropped inside ROIs had higher mean cosine similarity values compared to those involving one image inside and one outside the ROIs. The most significant difference was observed during night conditions (0.7842 inside vs. 0.5 outside the ROI with Swin-Transformer) and in cross-camera scenarios (0.75 inside-inside vs. 0.52 inside-outside the ROI with Vision Transformer). Information entropy and clustering variance further supported that features in ROIs are more consistent. These findings suggest that strategically selected ROIs can enhance tracking performance and ReID accuracy in ITS.
[35] arXiv:2407.10348 (cross-list from cs.CV) [pdf, html, other]: Title: Addressing Class Imbalance and Data Limitations in Advanced Node Semiconductor Defect Inspection: A Generative Approach for SEM Images

Bappaditya Dey, Vic De Ridder, Victor Blanco, Sandip Halder, Bartel Van Waeyenberge

Comments: 8 pages, 11 figures, to be presented at 2024 International Symposium ELMAR, and published by IEEE in the conference proceedings

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Precision in identifying nanometer-scale device-killer defects is crucial in both semiconductor research and development as well as in production processes. The effectiveness of existing ML-based approaches in this context is largely limited by the scarcity of data, as the production of real semiconductor wafer data for training these models involves high financial and time costs. Moreover, the existing simulation methods fall short of replicating images with identical noise characteristics, surface roughness and stochastic variations at advanced nodes. We propose a method for generating synthetic semiconductor SEM images using a diffusion model within a limited data regime. In contrast to images generated through conventional simulation methods, SEM images generated through our proposed DL method closely resemble real SEM images, replicating their noise characteristics and surface roughness adaptively. Our main contributions, which are validated on three different real semiconductor datasets, are: i) proposing a patch-based generative framework utilizing DDPM to create SEM images with intended defect classes, addressing challenges related to class-imbalance and data insufficiency, ii) demonstrating generated synthetic images closely resemble real SEM images acquired from the tool, preserving all imaging conditions and metrology characteristics without any metadata supervision, iii) demonstrating a defect detector trained on generated defect dataset, either independently or combined with a limited real dataset, can achieve similar or improved performance on real wafer SEM images during validation/testing compared to exclusive training on a real defect dataset, iv) demonstrating the ability of the proposed approach to transfer defect types, critical dimensions, and imaging conditions from one specified CD/Pitch and metrology specifications to another, thereby highlighting its versatility.
[36] arXiv:2407.10512 (cross-list from q-bio.NC) [pdf, other]: Title: MARVEL: MR Fingerprinting with Additional micRoVascular Estimates using bidirectional LSTMs

Antoine Barrier (GIN), Thomas Coudert (GIN), Aurélien Delphin (IRMaGe), Benjamin Lemasson (GIN), Thomas Christen (GIN)

Subjects: Neurons and Cognition (q-bio.NC); Image and Video Processing (eess.IV)

The Magnetic Resonance Fingerprinting (MRF) approach aims to estimate multiple MR or physiological parameters simultaneously with a single fast acquisition sequence. Most of the MRF studies proposed so far have used simple MR sequence types to measure relaxation times (T1, T2). In that case, deep learning algorithms have been successfully used to speed up the reconstruction process. In theory, the MRF concept could be used with a variety of other MR sequence types and should be able to provide more information about the tissue microstructures. Yet, increasing the complexity of the numerical models often leads to prohibited simulation times, and estimating multiple parameters from one sequence implies new dictionary dimensions whose sizes become too large for standard computers and DL this http URL this paper, we propose to analyze the MRF signal coming from a complex balance Steady-state free precession (bSSFP) type sequence to simultaneously estimate relaxometry maps (T1, T2), Field maps (B1, B0) as well as microvascular properties such as the local Cerebral Blood Volume (CBV) or the averaged vessel Radius (R).To bypass the curse of dimensionality, we propose an efficient way to simulate the MR signal coming from numerical voxels containing realistic microvascular networks as well as a Bidirectional Long Short-Term Memory network used for the matching process.On top of standard MRF maps, our results on 3 human volunteers suggest that our approach can quickly produce high-quality quantitative maps of microvascular parameters that are otherwise obtained using longer dedicated sequences and intravenous injection of a contrast agent. This approach could be used for the management of multiple pathologies and could be tuned to provide other types of microstructural information.
[37] arXiv:2407.10559 (cross-list from cs.CV) [pdf, html, other]: Title: LIP-CAR: contrast agent reduction by a deep learned inverse problem

Davide Bianchi, Sonia Colombo Serra, Davide Evangelista, Pengpeng Luo, Elena Morotti, Giovanni Valbusa

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Numerical Analysis (math.NA)

The adoption of contrast agents in medical imaging protocols is crucial for accurate and timely diagnosis. While highly effective and characterized by an excellent safety profile, the use of contrast agents has its limitation, including rare risk of allergic reactions, potential environmental impact and economic burdens on patients and healthcare systems. In this work, we address the contrast agent reduction (CAR) problem, which involves reducing the administered dosage of contrast agent while preserving the visual enhancement. The current literature on the CAR task is based on deep learning techniques within a fully image processing framework. These techniques digitally simulate high-dose images from images acquired with a low dose of contrast agent. We investigate the feasibility of a ``learned inverse problem'' (LIP) approach, as opposed to the end-to-end paradigm in the state-of-the-art literature.
Specifically, we learn the image-to-image operator that maps high-dose images to their corresponding low-dose counterparts, and we frame the CAR task as an inverse problem. We then solve this problem through a regularized optimization reformulation. Regularization methods are well-established mathematical techniques that offer robustness and explainability. Our approach combines these rigorous techniques with cutting-edge deep learning tools. Numerical experiments performed on pre-clinical medical images confirm the effectiveness of this strategy, showing improved stability and accuracy in the simulated high-dose images.
[38] arXiv:2407.10567 (cross-list from cs.CV) [pdf, html, other]: Title: PULPo: Probabilistic Unsupervised Laplacian Pyramid Registration

Leonard Siegert, Paul Fischer, Mattias P. Heinrich, Christian F. Baumgartner

Comments: Accepted as full paper to MICCAI 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Deformable image registration is fundamental to many medical imaging applications. Registration is an inherently ambiguous task often admitting many viable solutions. While neural network-based registration techniques enable fast and accurate registration, the majority of existing approaches are not able to estimate uncertainty. Here, we present PULPo, a method for probabilistic deformable registration capable of uncertainty quantification. PULPo probabilistically models the distribution of deformation fields on different hierarchical levels combining them using Laplacian pyramids. This allows our method to model global as well as local aspects of the deformation field. We evaluate our method on two widely used neuroimaging datasets and find that it achieves high registration performance as well as substantially better calibrated uncertainty quantification compared to the current state-of-the-art.
[39] arXiv:2407.10628 (cross-list from cond-mat.mtrl-sci) [pdf, other]: Title: Automated high-resolution backscattered-electron imaging at macroscopic scale

Zhiyuan Lang, Zunshuai Zhang, Lei Wang, Yuhan Liu, Weixiong Qian, Shenghua Zhou, Ying Jiang, Tongyi Zhang, Jiong Yang

Comments: 22 pages,12 figures

Subjects: Materials Science (cond-mat.mtrl-sci); Image and Video Processing (eess.IV)

Scanning electron microscopy (SEM) has been widely utilized in the field of materials science due to its significant advantages, such as large depth of field, wide field of view, and excellent stereoscopic imaging. However, at high magnification, the limited imaging range in SEM cannot cover all the possible inhomogeneous microstructures. In this research, we propose a novel approach for generating high-resolution SEM images across multiple scales, enabling a single image to capture physical dimensions at the centimeter level while preserving submicron-level details. We adopted the SEM imaging on the AlCoCrFeNi2.1 eutectic high entropy alloy (EHEA) as an example. SEM videos and image stitching are combined to fulfill this goal, and the video-extracted low-definition (LD) images are clarified by a well-trained denoising model. Furthermore, we segment the macroscopic image of the EHEA, and area of various microstructures are distinguished. Combining the segmentation results and hardness experiments, we found that the hardness is positively correlated with the content of body-centered cubic (BCC) phase, negatively correlated with the lamella width, and the relationship with the proportion of lamellar structures was not significant. Our work provides a feasible solution to generate macroscopic images based on SEMs for further analysis of the correlations between the microstructures and spatial distribution, and can be widely applied to other types of microscope.
[40] arXiv:2407.10762 (cross-list from cs.CV) [pdf, html, other]: Title: Domain Generalization for 6D Pose Estimation Through NeRF-based Image Synthesis

Antoine Legrand, Renaud Detry, Christophe De Vleeschouwer

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

This work introduces a novel augmentation method that increases the diversity of a train set to improve the generalization abilities of a 6D pose estimation network. For this purpose, a Neural Radiance Field is trained from synthetic images and exploited to generate an augmented set. Our method enriches the initial set by enabling the synthesis of images with (i) unseen viewpoints, (ii) rich illumination conditions through appearance extrapolation, and (iii) randomized textures. We validate our augmentation method on the challenging use-case of spacecraft pose estimation and show that it significantly improves the pose estimation generalization capabilities. On the SPEED+ dataset, our method reduces the error on the pose by 50% on both target domains.

[41] arXiv:2207.13021 (replaced) [pdf, other]: Title: Topological Optimized Convolutional Visual Recurrent Network for Brain Tumor Segmentation and Classification

Dhananjay Joshi, Bhupesh Kumar Singh, Kapil Kumar Nagwanshi, Nitin S. Choubey

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

In today's world of health care, brain tumor detection has become common. However, the manual brain tumor classification approach is time-consuming. So Deep Convolutional Neural Network (DCNN) is used by many researchers in the medical field for making accurate diagnoses and aiding in the patient's treatment. The traditional techniques have problems such as overfitting and the inability to extract necessary features. To overcome these problems, we developed the Topological Data Analysis based Improved Persistent Homology (TDA-IPH) and Convolutional Transfer learning and Visual Recurrent learning with Elephant Herding Optimization hyper-parameter tuning (CTVR-EHO) models for brain tumor segmentation and classification. Initially, the Topological Data Analysis based Improved Persistent Homology is designed to segment the brain tumor image. Then, from the segmented image, features are extracted using TL via the AlexNet model and Bidirectional Visual Long Short-Term Memory (Bi-VLSTM). Next, elephant Herding Optimization (EHO) is used to tune the hyperparameters of both networks to get an optimal result. Finally, extracted features are concatenated and classified using the softmax activation layer. The simulation result of this proposed CTVR-EHO and TDA-IPH method is analyzed based on precision, accuracy, recall, loss, and F score metrics. When compared to other existing brain tumor segmentation and classification models, the proposed CTVR-EHO and TDA-IPH approaches show high accuracy (99.8%), high recall (99.23%), high precision (99.67%), and high F score (99.59%).
[42] arXiv:2309.14550 (replaced) [pdf, html, other]: Title: MEMO: Dataset and Methods for Robust Multimodal Retinal Image Registration with Large or Small Vessel Density Differences

Chiao-Yi Wang, Faranguisse Kakhi Sadrieh, Yi-Ting Shen, Shih-En Chen, Sarah Kim, Victoria Chen, Achyut Raghavendra, Dongyi Wang, Osamah Saeedi, Yang Tao

Comments: Biomedical Optics Express

Journal-ref: Biomed. Opt. Express 15, 3457-3479 (2024)

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

The measurement of retinal blood flow (RBF) in capillaries can provide a powerful biomarker for the early diagnosis and treatment of ocular diseases. However, no single modality can determine capillary flowrates with high precision. Combining erythrocyte-mediated angiography (EMA) with optical coherence tomography angiography (OCTA) has the potential to achieve this goal, as EMA can measure the absolute 2D RBF of retinal microvasculature and OCTA can provide the 3D structural images of capillaries. However, multimodal retinal image registration between these two modalities remains largely unexplored. To fill this gap, we establish MEMO, the first public multimodal EMA and OCTA retinal image dataset. A unique challenge in multimodal retinal image registration between these modalities is the relatively large difference in vessel density (VD). To address this challenge, we propose a segmentation-based deep-learning framework (VDD-Reg) and a new evaluation metric (MSD), which provide robust results despite differences in vessel density. VDD-Reg consists of a vessel segmentation module and a registration module. To train the vessel segmentation module, we further designed a two-stage semi-supervised learning framework (LVD-Seg) combining supervised and unsupervised losses. We demonstrate that VDD-Reg outperforms baseline methods quantitatively and qualitatively for cases of both small VD differences (using the CF-FA dataset) and large VD differences (using our MEMO dataset). Moreover, VDD-Reg requires as few as three annotated vessel segmentation masks to maintain its accuracy, demonstrating its feasibility.
[43] arXiv:2401.13403 (replaced) [pdf, other]: Title: SEDNet: Shallow Encoder-Decoder Network for Brain Tumor Segmentation

Chollette C. Olisah

Comments: 8 pages, 6 figures, 3 Tables

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Despite the advancement in computational modeling towards brain tumor segmentation, of which several models have been developed, it is evident from the computational complexity of existing models which are still at an all-time high, that performance and efficiency under clinical application scenarios are limited. Therefore, this paper proposes a shallow encoder and decoder network named SEDNet for brain tumor segmentation. The proposed network is adapted from the U-Net structure. Though brain tumors do not assume complex structures like the task the traditional U-Net was designed for, their variance in appearance, shape, and ambiguity of boundaries makes it a compelling complex task to solve. SEDNet architecture design is inspired by the localized nature of brain tumors in brain images, thus consists of sufficient hierarchical convolutional blocks in the encoding pathway capable of learning the intrinsic features of brain tumors in brain slices, and a decoding pathway with selective skip path sufficient for capturing miniature local-level spatial features alongside the global-level features of brain tumor. SEDNet with the integration of the proposed preprocessing algorithm and optimization function on the BraTS2020 set reserved for testing achieves impressive dice and Hausdorff scores of 0.9308, 0.9451, 0.9026, and 0.7040, 1.2866, 0.7762 for non-enhancing tumor core (NTC), peritumoral edema (ED), and enhancing tumor (ET), respectively. Furthermore, through transfer learning with initialized SEDNet pre-trained weights, termed SEDNetX, a performance increase is observed. The dice and Hausdorff scores recorded are 0.9336, 0.9478, 0.9061, 0.6983, 1.2691, and 0.7711 for NTC, ED, and ET, respectively. With about 1.3 million parameters and impressive performance in comparison to the state-of-the-art, SEDNet(X) is shown to be computationally efficient for real-time clinical diagnosis.
[44] arXiv:2402.01054 (replaced) [pdf, html, other]: Title: Unconditional Latent Diffusion Models Memorize Patient Imaging Data: Implications for Openly Sharing Synthetic Data

Salman Ul Hassan Dar, Marvin Seyfarth, Jannik Kahmann, Isabelle Ayx, Theano Papavassiliu, Stefan O. Schoenberg, Norbert Frey, Bettina Baeßler, Sebastian Foersch, Daniel Truhn, Jakob Nikolas Kather, Sandy Engelhardt

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

AI models present a wide range of applications in the field of medicine. However, achieving optimal performance requires access to extensive healthcare data, which is often not readily available. Furthermore, the imperative to preserve patient privacy restricts patient data sharing with third parties and even within institutes. Recently, generative AI models have been gaining traction for facilitating open-data sharing by proposing synthetic data as surrogates of real patient data. Despite the promise, these models are susceptible to patient data memorization, where models generate patient data copies instead of novel synthetic samples. Considering the importance of the problem, it has received little attention in the medical imaging community. To this end, we assess memorization in unconditional latent diffusion models. We train 2D and 3D latent diffusion models on CT, MR, and X-ray datasets for synthetic data generation. Afterwards, we detect the amount of training data memorized utilizing our self-supervised approach and further investigate various factors that can influence memorization. Our findings show a surprisingly high degree of patient data memorization across all datasets, with approximately 40.9% of patient data being memorized and 78.5% of synthetic samples identified as patient data copies on average in our experiments. Further analyses reveal that using augmentation strategies during training can reduce memorization while over-training the models can enhance it. Although increasing the dataset size does not reduce memorization and might even enhance it, it does lower the probability of a synthetic sample being a patient data copy. Collectively, our results emphasize the importance of carefully training generative models on private medical imaging datasets, and examining the synthetic data to ensure patient privacy before sharing it for medical research and applications.
[45] arXiv:2404.10700 (replaced) [pdf, html, other]: Title: Rawformer: Unpaired Raw-to-Raw Translation for Learnable Camera ISPs

Georgy Perevozchikov, Nancy Mehta, Mahmoud Afifi, Radu Timofte

Comments: Accepted by ECCV 2024

Journal-ref: https://eccv.ecva.net/Conferences/2024

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Modern smartphone camera quality heavily relies on the image signal processor (ISP) to enhance captured raw images, utilizing carefully designed modules to produce final output images encoded in a standard color space (e.g., sRGB). Neural-based end-to-end learnable ISPs offer promising advancements, potentially replacing traditional ISPs with their ability to adapt without requiring extensive tuning for each new camera model, as is often the case for nearly every module in traditional ISPs. However, the key challenge with the recent learning-based ISPs is the urge to collect large paired datasets for each distinct camera model due to the influence of intrinsic camera characteristics on the formation of input raw images. This paper tackles this challenge by introducing a novel method for unpaired learning of raw-to-raw translation across diverse cameras. Specifically, we propose Rawformer, an unsupervised Transformer-based encoder-decoder method for raw-to-raw translation. It accurately maps raw images captured by a certain camera to the target camera, facilitating the generalization of learnable ISPs to new unseen cameras. Our method demonstrates superior performance on real camera datasets, achieving higher accuracy compared to previous state-of-the-art techniques, and preserving a more robust correlation between the original and translated raw images. The codes and the pretrained models are available at this https URL.
[46] arXiv:2405.18267 (replaced) [pdf, html, other]: Title: CT-based brain ventricle segmentation via diffusion Schr\"odinger Bridge without target domain ground truths

Reihaneh Teimouri, Marta Kersten-Oertel, Yiming Xiao

Comments: Early acceptance at MICCAI2024

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Efficient and accurate brain ventricle segmentation from clinical CT scans is critical for emergency surgeries like ventriculostomy. With the challenges in poor soft tissue contrast and a scarcity of well-annotated databases for clinical brain CTs, we introduce a novel uncertainty-aware ventricle segmentation technique without the need of CT segmentation ground truths by leveraging diffusion-model-based domain adaptation. Specifically, our method employs the diffusion Schrödinger Bridge and an attention recurrent residual U-Net to capitalize on unpaired CT and MRI scans to derive automatic CT segmentation from those of the MRIs, which are more accessible. Importantly, we propose an end-to-end, joint training framework of image translation and segmentation tasks, and demonstrate its benefit over training individual tasks separately. By comparing the proposed method against similar setups using two different GAN models for domain adaptation (CycleGAN and CUT), we also reveal the advantage of diffusion models towards improved segmentation and image translation quality. With a Dice score of 0.78$\pm$0.27, our proposed method outperformed the compared methods, including SynSeg-Net, while providing intuitive uncertainty measures to further facilitate quality control of the automatic segmentation outcomes. The implementation of our proposed method is available at: this https URL.
[47] arXiv:2407.02844 (replaced) [pdf, html, other]: Title: Multi-Attention Integrated Deep Learning Frameworks for Enhanced Breast Cancer Segmentation and Identification

Pandiyaraju V, Shravan Venkatraman, Pavan Kumar S, Santhosh Malarvannan, Kannan A

Comments: 29 pages, 15 figures, 6 tables

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Breast cancer poses a profound threat to lives globally, claiming numerous lives each year. Therefore, timely detection is crucial for early intervention and improved chances of survival. Accurately diagnosing and classifying breast tumors using ultrasound images is a persistent challenge in medicine, demanding cutting-edge solutions for improved treatment strategies. This research introduces multiattention-enhanced deep learning (DL) frameworks designed for the classification and segmentation of breast cancer tumors from ultrasound images. A spatial channel attention mechanism is proposed for segmenting tumors from ultrasound images, utilizing a novel LinkNet DL framework with an InceptionResNet backbone. Following this, the paper proposes a deep convolutional neural network with an integrated multi-attention framework (DCNNIMAF) to classify the segmented tumor as benign, malignant, or normal. From experimental results, it is observed that the segmentation model has recorded an accuracy of 98.1%, with a minimal loss of 0.6%. It has also achieved high Intersection over Union (IoU) and Dice Coefficient scores of 96.9% and 97.2%, respectively. Similarly, the classification model has attained an accuracy of 99.2%, with a low loss of 0.31%. Furthermore, the classification framework has achieved outstanding F1-Score, precision, and recall values of 99.1%, 99.3%, and 99.1%, respectively. By offering a robust framework for early detection and accurate classification of breast cancer, this proposed work significantly advances the field of medical image analysis, potentially improving diagnostic precision and patient outcomes.
[48] arXiv:2407.05709 (replaced) [pdf, html, other]: Title: Heterogeneous window transformer for image denoising

Chunwei Tian, Menghua Zheng, Chia-Wen Lin, Zhiwu Li, David Zhang

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Deep networks can usually depend on extracting more structural information to improve denoising results. However, they may ignore correlation between pixels from an image to pursue better denoising performance. Window transformer can use long- and short-distance modeling to interact pixels to address mentioned problem. To make a tradeoff between distance modeling and denoising time, we propose a heterogeneous window transformer (HWformer) for image denoising. HWformer first designs heterogeneous global windows to capture global context information for improving denoising effects. To build a bridge between long and short-distance modeling, global windows are horizontally and vertically shifted to facilitate diversified information without increasing denoising time. To prevent the information loss phenomenon of independent patches, sparse idea is guided a feed-forward network to extract local information of neighboring patches. The proposed HWformer only takes 30% of popular Restormer in terms of denoising time.
[49] arXiv:2407.05767 (replaced) [pdf, html, other]: Title: Nonrigid Reconstruction of Freehand Ultrasound without a Tracker

Qi Li, Ziyi Shen, Qianye Yang, Dean C. Barratt, Matthew J. Clarkson, Tom Vercauteren, Yipeng Hu

Comments: Accepted at MICCAI 2024

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Reconstructing 2D freehand Ultrasound (US) frames into 3D space without using a tracker has recently seen advances with deep learning. Predicting good frame-to-frame rigid transformations is often accepted as the learning objective, especially when the ground-truth labels from spatial tracking devices are inherently rigid transformations. Motivated by a) the observed nonrigid deformation due to soft tissue motion during scanning, and b) the highly sensitive prediction of rigid transformation, this study investigates the methods and their benefits in predicting nonrigid transformations for reconstructing 3D US. We propose a novel co-optimisation algorithm for simultaneously estimating rigid transformations among US frames, supervised by ground-truth from a tracker, and a nonrigid deformation, optimised by a regularised registration network. We show that these two objectives can be either optimised using meta-learning or combined by weighting. A fast scattered data interpolation is also developed for enabling frequent reconstruction and registration of non-parallel US frames, during training. With a new data set containing over 357,000 frames in 720 scans, acquired from 60 subjects, the experiments demonstrate that, due to an expanded thus easier-to-optimise solution space, the generalisation is improved with the added deformation estimation, with respect to the rigid ground-truth. The global pixel reconstruction error (assessing accumulative prediction) is lowered from 18.48 to 16.51 mm, compared with baseline rigid-transformation-predicting methods. Using manually identified landmarks, the proposed co-optimisation also shows potentials in compensating nonrigid tissue motion at inference, which is not measurable by tracker-provided ground-truth. The code and data used in this paper are made publicly available at this https URL.
[50] arXiv:2407.06514 (replaced) [pdf, html, other]: Title: Asymmetric Mask Scheme for Self-Supervised Real Image Denoising

Xiangyu Liao, Tianheng Zheng, Jiayu Zhong, Pingping Zhang, Chao Ren

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

In recent years, self-supervised denoising methods have gained significant success and become critically important in the field of image restoration. Among them, the blind spot network based methods are the most typical type and have attracted the attentions of a large number of researchers. Although the introduction of blind spot operations can prevent identity mapping from noise to noise, it imposes stringent requirements on the receptive fields in the network design, thereby limiting overall performance. To address this challenge, we propose a single mask scheme for self-supervised denoising training, which eliminates the need for blind spot operation and thereby removes constraints on the network structure design. Furthermore, to achieve denoising across entire image during inference, we propose a multi-mask scheme. Our method, featuring the asymmetric mask scheme in training and inference, achieves state-of-the-art performance on existing real noisy image datasets. All the source code will be made available to the public.
[51] arXiv:2309.00470 (replaced) [pdf, html, other]: Title: Deep Joint Source-Channel Coding for Adaptive Image Transmission over MIMO Channels

Haotian Wu, Yulin Shao, Chenghong Bian, Krystian Mikolajczyk, Deniz Gündüz

Comments: arXiv admin note: text overlap with arXiv:2210.15347

Subjects: Information Theory (cs.IT); Image and Video Processing (eess.IV)

This paper introduces a vision transformer (ViT)-based deep joint source and channel coding (DeepJSCC) scheme for wireless image transmission over multiple-input multiple-output (MIMO) channels, denoted as DeepJSCC-MIMO. We consider DeepJSCC-MIMO for adaptive image transmission in both open-loop and closed-loop MIMO systems. The novel DeepJSCC-MIMO architecture surpasses the classical separation-based benchmarks with robustness to channel estimation errors and showcases remarkable flexibility in adapting to diverse channel conditions and antenna numbers without requiring retraining. Specifically, by harnessing the self-attention mechanism of ViT, DeepJSCC-MIMO intelligently learns feature mapping and power allocation strategies tailored to the unique characteristics of the source image and prevailing channel conditions. Extensive numerical experiments validate the significant improvements in transmission quality achieved by DeepJSCC-MIMO for both open-loop and closed-loop MIMO systems across a wide range of scenarios. Moreover, DeepJSCC-MIMO exhibits robustness to varying channel conditions, channel estimation errors, and different antenna numbers, making it an appealing solution for emerging semantic communication systems.
[52] arXiv:2403.00628 (replaced) [pdf, html, other]: Title: Region-Adaptive Transform with Segmentation Prior for Image Compression

Yuxi Liu, Wenhan Yang, Huihui Bai, Yunchao Wei, Yao Zhao

Comments: Accepted to ECCV 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Learned Image Compression (LIC) has shown remarkable progress in recent years. Existing works commonly employ CNN-based or self-attention-based modules as transform methods for compression. However, there is no prior research on neural transform that focuses on specific regions. In response, we introduce the class-agnostic segmentation masks (i.e. semantic masks without category labels) for extracting region-adaptive contextual information. Our proposed module, Region-Adaptive Transform, applies adaptive convolutions on different regions guided by the masks. Additionally, we introduce a plug-and-play module named Scale Affine Layer to incorporate rich contexts from various regions. While there have been prior image compression efforts that involve segmentation masks as additional intermediate inputs, our approach differs significantly from them. Our advantages lie in that, to avoid extra bitrate overhead, we treat these masks as privilege information, which is accessible during the model training stage but not required during the inference phase. To the best of our knowledge, we are the first to employ class-agnostic masks as privilege information and achieve superior performance in pixel-fidelity metrics, such as Peak Signal to Noise Ratio (PSNR). The experimental results demonstrate our improvement compared to previously well-performing methods, with about 8.2% bitrate saving compared to VTM-17.0. The source code is available at this https URL.
[53] arXiv:2403.19238 (replaced) [pdf, html, other]: Title: Taming Lookup Tables for Efficient Image Retouching

Sidi Yang, Binxiao Huang, Mingdeng Cao, Yatai Ji, Hanzhong Guo, Ngai Wong, Yujiu Yang

Comments: Accepted by ECCV2024

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

The widespread use of high-definition screens in edge devices, such as end-user cameras, smartphones, and televisions, is spurring a significant demand for image enhancement. Existing enhancement models often optimize for high performance while falling short of reducing hardware inference time and power consumption, especially on edge devices with constrained computing and storage resources. To this end, we propose Image Color Enhancement Lookup Table (ICELUT) that adopts LUTs for extremely efficient edge inference, without any convolutional neural network (CNN). During training, we leverage pointwise (1x1) convolution to extract color information, alongside a split fully connected layer to incorporate global information. Both components are then seamlessly converted into LUTs for hardware-agnostic deployment. ICELUT achieves near-state-of-the-art performance and remarkably low power consumption. We observe that the pointwise network structure exhibits robust scalability, upkeeping the performance even with a heavily downsampled 32x32 input image. These enable ICELUT, the first-ever purely LUT-based image enhancer, to reach an unprecedented speed of 0.4ms on GPU and 7ms on CPU, at least one order faster than any CNN solution. Codes are available at this https URL.

Total of 53 entries

Showing up to 2000 entries per page: fewer | more | all

Image and Video Processing

New submissions for Tuesday, 16 July 2024 (showing 26 of 26 entries )

Cross submissions for Tuesday, 16 July 2024 (showing 14 of 14 entries )

Replacement submissions for Tuesday, 16 July 2024 (showing 13 of 13 entries )