-
Image registration based automated lesion correspondence pipeline for longitudinal CT data
Authors:
Subrata Mukherjee,
Thibaud Coroller,
Craig Wang,
Ravi K. Samala,
Tingting Hu,
Didem Gokcay,
Nicholas Petrick,
Berkman Sahiner,
Qian Cao
Abstract:
Patients diagnosed with metastatic breast cancer (mBC) typically undergo several radiographic assessments during their treatment. mBC often involves multiple metastatic lesions in different organs, it is imperative to accurately track and assess these lesions to gain a comprehensive understanding of the disease's response to treatment. Computerized analysis methods that rely on lesion-level tracki…
▽ More
Patients diagnosed with metastatic breast cancer (mBC) typically undergo several radiographic assessments during their treatment. mBC often involves multiple metastatic lesions in different organs, it is imperative to accurately track and assess these lesions to gain a comprehensive understanding of the disease's response to treatment. Computerized analysis methods that rely on lesion-level tracking have often used manual matching of corresponding lesions, a time-consuming process that is prone to errors. This paper introduces an automated lesion correspondence algorithm designed to precisely track both targets' lesions and non-targets' lesions in longitudinal data. Here we demonstrate the applicability of our algorithm on the anonymized data from two Phase III trials. The dataset contains imaging data of patients for different follow-up timepoints and the radiologist annotations for the patients enrolled in the trials. Target and non-target lesions are annotated by either one or two groups of radiologists. To facilitate accurate tracking, we have developed a registration-assisted lesion correspondence algorithm. The algorithm employs a sequential two-step pipeline: (a) Firstly, an adaptive Hungarian algorithm is used to establish correspondence among lesions within a single volumetric image series which have been annotated by multiple radiologists at a specific timepoint. (b) Secondly, after establishing correspondence and assigning unique names to the lesions, three-dimensional rigid registration is applied to various image series at the same timepoint. Registration is followed by ongoing lesion correspondence based on the adaptive Hungarian algorithm and updating lesion names for accurate tracking. Validation of our automated lesion correspondence algorithm is performed through triaxial plots based on axial, sagittal, and coronal views, confirming its efficacy in matching lesions.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
CRNet: A Detail-Preserving Network for Unified Image Restoration and Enhancement Task
Authors:
Kangzhen Yang,
Tao Hu,
Kexin Dai,
Genggeng Chen,
Yu Cao,
Wei Dong,
Peng Wu,
Yanning Zhang,
Qingsen Yan
Abstract:
In real-world scenarios, images captured often suffer from blurring, noise, and other forms of image degradation, and due to sensor limitations, people usually can only obtain low dynamic range images. To achieve high-quality images, researchers have attempted various image restoration and enhancement operations on photographs, including denoising, deblurring, and high dynamic range imaging. Howev…
▽ More
In real-world scenarios, images captured often suffer from blurring, noise, and other forms of image degradation, and due to sensor limitations, people usually can only obtain low dynamic range images. To achieve high-quality images, researchers have attempted various image restoration and enhancement operations on photographs, including denoising, deblurring, and high dynamic range imaging. However, merely performing a single type of image enhancement still cannot yield satisfactory images. In this paper, to deal with the challenge above, we propose the Composite Refinement Network (CRNet) to address this issue using multiple exposure images. By fully integrating information-rich multiple exposure inputs, CRNet can perform unified image restoration and enhancement. To improve the quality of image details, CRNet explicitly separates and strengthens high and low-frequency information through pooling layers, using specially designed Multi-Branch Blocks for effective fusion of these frequencies. To increase the receptive field and fully integrate input features, CRNet employs the High-Frequency Enhancement Module, which includes large kernel convolutions and an inverted bottleneck ConvFFN. Our model secured third place in the first track of the Bracketing Image Restoration and Enhancement Challenge, surpassing previous SOTA models in both testing metrics and visual quality.
△ Less
Submitted 22 April, 2024;
originally announced April 2024.
-
Bracketing Image Restoration and Enhancement with High-Low Frequency Decomposition
Authors:
Genggeng Chen,
Kexin Dai,
Kangzhen Yang,
Tao Hu,
Xiangyu Chen,
Yongqing Yang,
Wei Dong,
Peng Wu,
Yanning Zhang,
Qingsen Yan
Abstract:
In real-world scenarios, due to a series of image degradations, obtaining high-quality, clear content photos is challenging. While significant progress has been made in synthesizing high-quality images, previous methods for image restoration and enhancement often overlooked the characteristics of different degradations. They applied the same structure to address various types of degradation, resul…
▽ More
In real-world scenarios, due to a series of image degradations, obtaining high-quality, clear content photos is challenging. While significant progress has been made in synthesizing high-quality images, previous methods for image restoration and enhancement often overlooked the characteristics of different degradations. They applied the same structure to address various types of degradation, resulting in less-than-ideal restoration outcomes. Inspired by the notion that high/low frequency information is applicable to different degradations, we introduce HLNet, a Bracketing Image Restoration and Enhancement method based on high-low frequency decomposition. Specifically, we employ two modules for feature extraction: shared weight modules and non-shared weight modules. In the shared weight modules, we use SCConv to extract common features from different degradations. In the non-shared weight modules, we introduce the High-Low Frequency Decomposition Block (HLFDB), which employs different methods to handle high-low frequency information, enabling the model to address different degradations more effectively. Compared to other networks, our method takes into account the characteristics of different degradations, thus achieving higher-quality image restoration.
△ Less
Submitted 24 April, 2024; v1 submitted 21 April, 2024;
originally announced April 2024.
-
A Robust Ensemble Algorithm for Ischemic Stroke Lesion Segmentation: Generalizability and Clinical Utility Beyond the ISLES Challenge
Authors:
Ezequiel de la Rosa,
Mauricio Reyes,
Sook-Lei Liew,
Alexandre Hutton,
Roland Wiest,
Johannes Kaesmacher,
Uta Hanning,
Arsany Hakim,
Richard Zubal,
Waldo Valenzuela,
David Robben,
Diana M. Sima,
Vincenzo Anania,
Arne Brys,
James A. Meakin,
Anne Mickan,
Gabriel Broocks,
Christian Heitkamp,
Shengbo Gao,
Kongming Liang,
Ziji Zhang,
Md Mahfuzur Rahman Siddiquee,
Andriy Myronenko,
Pooya Ashtari,
Sabine Van Huffel
, et al. (33 additional authors not shown)
Abstract:
Diffusion-weighted MRI (DWI) is essential for stroke diagnosis, treatment decisions, and prognosis. However, image and disease variability hinder the development of generalizable AI algorithms with clinical value. We address this gap by presenting a novel ensemble algorithm derived from the 2022 Ischemic Stroke Lesion Segmentation (ISLES) challenge. ISLES'22 provided 400 patient scans with ischemi…
▽ More
Diffusion-weighted MRI (DWI) is essential for stroke diagnosis, treatment decisions, and prognosis. However, image and disease variability hinder the development of generalizable AI algorithms with clinical value. We address this gap by presenting a novel ensemble algorithm derived from the 2022 Ischemic Stroke Lesion Segmentation (ISLES) challenge. ISLES'22 provided 400 patient scans with ischemic stroke from various medical centers, facilitating the development of a wide range of cutting-edge segmentation algorithms by the research community. Through collaboration with leading teams, we combined top-performing algorithms into an ensemble model that overcomes the limitations of individual solutions. Our ensemble model achieved superior ischemic lesion detection and segmentation accuracy on our internal test set compared to individual algorithms. This accuracy generalized well across diverse image and disease variables. Furthermore, the model excelled in extracting clinical biomarkers. Notably, in a Turing-like test, neuroradiologists consistently preferred the algorithm's segmentations over manual expert efforts, highlighting increased comprehensiveness and precision. Validation using a real-world external dataset (N=1686) confirmed the model's generalizability. The algorithm's outputs also demonstrated strong correlations with clinical scores (admission NIHSS and 90-day mRS) on par with or exceeding expert-derived results, underlining its clinical relevance. This study offers two key findings. First, we present an ensemble algorithm (https://github.com/Tabrisrei/ISLES22_Ensemble) that detects and segments ischemic stroke lesions on DWI across diverse scenarios on par with expert (neuro)radiologists. Second, we show the potential for biomedical challenge outputs to extend beyond the challenge's initial objectives, demonstrating their real-world clinical applicability.
△ Less
Submitted 3 April, 2024; v1 submitted 28 March, 2024;
originally announced March 2024.
-
Towards High-quality HDR Deghosting with Conditional Diffusion Models
Authors:
Qingsen Yan,
Tao Hu,
Yuan Sun,
Hao Tang,
Yu Zhu,
Wei Dong,
Luc Van Gool,
Yanning Zhang
Abstract:
High Dynamic Range (HDR) images can be recovered from several Low Dynamic Range (LDR) images by existing Deep Neural Networks (DNNs) techniques. Despite the remarkable progress, DNN-based methods still generate ghosting artifacts when LDR images have saturation and large motion, which hinders potential applications in real-world scenarios. To address this challenge, we formulate the HDR deghosting…
▽ More
High Dynamic Range (HDR) images can be recovered from several Low Dynamic Range (LDR) images by existing Deep Neural Networks (DNNs) techniques. Despite the remarkable progress, DNN-based methods still generate ghosting artifacts when LDR images have saturation and large motion, which hinders potential applications in real-world scenarios. To address this challenge, we formulate the HDR deghosting problem as an image generation that leverages LDR features as the diffusion model's condition, consisting of the feature condition generator and the noise predictor. Feature condition generator employs attention and Domain Feature Alignment (DFA) layer to transform the intermediate features to avoid ghosting artifacts. With the learned features as conditions, the noise predictor leverages a stochastic iterative denoising process for diffusion models to generate an HDR image by steering the sampling process. Furthermore, to mitigate semantic confusion caused by the saturation problem of LDR images, we design a sliding window noise estimator to sample smooth noise in a patch-based manner. In addition, an image space loss is proposed to avoid the color distortion of the estimated HDR results. We empirically evaluate our model on benchmark datasets for HDR imaging. The results demonstrate that our approach achieves state-of-the-art performances and well generalization to real-world images.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models
Authors:
Hsuan Su,
Ting-Yao Hu,
Hema Swetha Koppula,
Raviteja Vemulapalli,
Jen-Hao Rick Chang,
Karren Yang,
Gautam Varma Mantena,
Oncel Tuzel
Abstract:
While Automatic Speech Recognition (ASR) systems are widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. However, target-domain data usually are not readily available in many scenarios. In this paper, we propose a new strategy for adapting ASR models to new target domains without any text or speech from…
▽ More
While Automatic Speech Recognition (ASR) systems are widely used in many real-world applications, they often do not generalize well to new domains and need to be finetuned on data from these domains. However, target-domain data usually are not readily available in many scenarios. In this paper, we propose a new strategy for adapting ASR models to new target domains without any text or speech from those domains. To accomplish this, we propose a novel data synthesis pipeline that uses a Large Language Model (LLM) to generate a target domain text corpus, and a state-of-the-art controllable speech synthesis model to generate the corresponding speech. We propose a simple yet effective in-context instruction finetuning strategy to increase the effectiveness of LLM in generating text corpora for new domains. Experiments on the SLURP dataset show that the proposed method achieves an average relative word error rate improvement of $28\%$ on unseen target domains without any performance drop in source domains.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
Formation Control for Moving Target Enclosing via Relative Localization
Authors:
Xueming Liu,
Kunda Liu,
Tianjiang Hu,
Qingrui Zhang
Abstract:
In this paper, we investigate the problem of controlling multiple unmanned aerial vehicles (UAVs) to enclose a moving target in a distributed fashion based on a relative distance and self-displacement measurements. A relative localization technique is developed based on the recursive least square estimation (RLSE) technique with a forgetting factor to estimates both the ``UAV-UAV'' and ``UAV-targe…
▽ More
In this paper, we investigate the problem of controlling multiple unmanned aerial vehicles (UAVs) to enclose a moving target in a distributed fashion based on a relative distance and self-displacement measurements. A relative localization technique is developed based on the recursive least square estimation (RLSE) technique with a forgetting factor to estimates both the ``UAV-UAV'' and ``UAV-target'' relative positions. The formation enclosing motion is planned using a coupled oscillator model, which generates desired motion for UAVs to distribute evenly on a circle. The coupled-oscillator-based motion can also facilitate the exponential convergence of relative localization due to its persistent excitation nature. Based on the generation strategy of desired formation pattern and relative localization estimates, a cooperative formation tracking control scheme is proposed, which enables the formation geometric center to asymptotically converge to the moving target. The asymptotic convergence performance is analyzed theoretically for both the relative localization technique and the formation control algorithm. Numerical simulations are provided to show the efficiency of the proposed algorithm. Experiments with three quadrotors tracking one target are conducted to evaluate the proposed target enclosing method in real platforms.
△ Less
Submitted 28 July, 2023;
originally announced July 2023.
-
A ChatGPT Aided Explainable Framework for Zero-Shot Medical Image Diagnosis
Authors:
Jiaxiang Liu,
Tianxiang Hu,
Yan Zhang,
Xiaotang Gai,
Yang Feng,
Zuozhu Liu
Abstract:
Zero-shot medical image classification is a critical process in real-world scenarios where we have limited access to all possible diseases or large-scale annotated data. It involves computing similarity scores between a query medical image and possible disease categories to determine the diagnostic result. Recent advances in pretrained vision-language models (VLMs) such as CLIP have shown great pe…
▽ More
Zero-shot medical image classification is a critical process in real-world scenarios where we have limited access to all possible diseases or large-scale annotated data. It involves computing similarity scores between a query medical image and possible disease categories to determine the diagnostic result. Recent advances in pretrained vision-language models (VLMs) such as CLIP have shown great performance for zero-shot natural image recognition and exhibit benefits in medical applications. However, an explainable zero-shot medical image recognition framework with promising performance is yet under development. In this paper, we propose a novel CLIP-based zero-shot medical image classification framework supplemented with ChatGPT for explainable diagnosis, mimicking the diagnostic process performed by human experts. The key idea is to query large language models (LLMs) with category names to automatically generate additional cues and knowledge, such as disease symptoms or descriptions other than a single category name, to help provide more accurate and explainable diagnosis in CLIP. We further design specific prompts to enhance the quality of generated texts by ChatGPT that describe visual medical features. Extensive results on one private dataset and four public datasets along with detailed analysis demonstrate the effectiveness and explainability of our training-free zero-shot diagnosis pipeline, corroborating the great potential of VLMs and LLMs for medical applications.
△ Less
Submitted 4 July, 2023;
originally announced July 2023.
-
ToothSegNet: Image Degradation meets Tooth Segmentation in CBCT Images
Authors:
Jiaxiang Liu,
Tianxiang Hu,
Yang Feng,
Wanghui Ding,
Zuozhu Liu
Abstract:
In computer-assisted orthodontics, three-dimensional tooth models are required for many medical treatments. Tooth segmentation from cone-beam computed tomography (CBCT) images is a crucial step in constructing the models. However, CBCT image quality problems such as metal artifacts and blurring caused by shooting equipment and patients' dental conditions make the segmentation difficult. In this pa…
▽ More
In computer-assisted orthodontics, three-dimensional tooth models are required for many medical treatments. Tooth segmentation from cone-beam computed tomography (CBCT) images is a crucial step in constructing the models. However, CBCT image quality problems such as metal artifacts and blurring caused by shooting equipment and patients' dental conditions make the segmentation difficult. In this paper, we propose ToothSegNet, a new framework which acquaints the segmentation model with generated degraded images during training. ToothSegNet merges the information of high and low quality images from the designed degradation simulation module using channel-wise cross fusion to reduce the semantic gap between encoder and decoder, and also refines the shape of tooth prediction through a structural constraint loss. Experimental results suggest that ToothSegNet produces more precise segmentation and outperforms the state-of-the-art medical image segmentation methods.
△ Less
Submitted 4 July, 2023;
originally announced July 2023.
-
Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis
Authors:
Karren Yang,
Ting-Yao Hu,
Jen-Hao Rick Chang,
Hema Swetha Koppula,
Oncel Tuzel
Abstract:
Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis. Here, we ask two fundamental questions about this strategy: when is synthetic data effective for personalization, and why is it effective in those cases? To…
▽ More
Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis. Here, we ask two fundamental questions about this strategy: when is synthetic data effective for personalization, and why is it effective in those cases? To address the first question, we adapt a state-of-the-art automatic speech recognition (ASR) model to target speakers from four benchmark datasets representative of different speaker types. We show that ASR personalization with synthetic data is effective in all cases, but particularly when (i) the target speaker is underrepresented in the global data, and (ii) the capacity of the global model is limited. To address the second question of why personalized synthetic data is effective, we use controllable speech synthesis to generate speech with varied styles and content. Surprisingly, we find that the text content of the synthetic data, rather than style, is important for speaker adaptation. These results lead us to propose a data selection strategy for ASR personalization based on speech content.
△ Less
Submitted 26 March, 2023;
originally announced March 2023.
-
EASpace: Enhanced Action Space for Policy Transfer
Authors:
Zheng Zhang,
Qingrui Zhang,
Bo Zhu,
Xiaohan Wang,
Tianjiang Hu
Abstract:
Formulating expert policies as macro actions promises to alleviate the long-horizon issue via structured exploration and efficient credit assignment. However, traditional option-based multi-policy transfer methods suffer from inefficient exploration of macro action's length and insufficient exploitation of useful long-duration macro actions. In this paper, a novel algorithm named EASpace (Enhanced…
▽ More
Formulating expert policies as macro actions promises to alleviate the long-horizon issue via structured exploration and efficient credit assignment. However, traditional option-based multi-policy transfer methods suffer from inefficient exploration of macro action's length and insufficient exploitation of useful long-duration macro actions. In this paper, a novel algorithm named EASpace (Enhanced Action Space) is proposed, which formulates macro actions in an alternative form to accelerate the learning process using multiple available sub-optimal expert policies. Specifically, EASpace formulates each expert policy into multiple macro actions with different execution {times}. All the macro actions are then integrated into the primitive action space directly. An intrinsic reward, which is proportional to the execution time of macro actions, is introduced to encourage the exploitation of useful macro actions. The corresponding learning rule that is similar to Intra-option Q-learning is employed to improve the data efficiency. Theoretical analysis is presented to show the convergence of the proposed learning rule. The efficiency of EASpace is illustrated by a grid-based game and a multi-agent pursuit problem. The proposed algorithm is also implemented in physical systems to validate its effectiveness.
△ Less
Submitted 24 July, 2023; v1 submitted 7 December, 2022;
originally announced December 2022.
-
Adaptive De-noising of Photoacoustic Signal and Image based on Modified Kalman Filter
Authors:
Tianqu Hu,
Zihao Huang,
Peng Ge,
Feng Gao,
Fei Gao
Abstract:
As a burgeoning medical imaging method based on hybrid fusion of light and ultrasound, photoacoustic imaging (PAI) has demonstrated high potential in various biomedical applications recently, especially in revealing the functional and molecular information to improve diagnostic accuracy. However, stemming from weak amplitude and unavoidable random noise, caused by limited laser power and severe at…
▽ More
As a burgeoning medical imaging method based on hybrid fusion of light and ultrasound, photoacoustic imaging (PAI) has demonstrated high potential in various biomedical applications recently, especially in revealing the functional and molecular information to improve diagnostic accuracy. However, stemming from weak amplitude and unavoidable random noise, caused by limited laser power and severe attenuation in deep tissue imaging, PA signals are usually of low signal-to-noise ratio (SNR), and reconstructed PA images are of low quality. Despite that conventional Kalman Filter (KF) can remove Gaussian noise in time domain, it lacks adaptability in real-time estimating condition due to its fixed model. Moreover, KF-based de-noising algorithm has not been applied in PAI before. In this paper, we propose an adaptive Modified Kalman Filter (MKF) targeted at PAI de-noising by tuning system noise matrix Q and measurement noise matrix R in the conventional KF model. Additionally, in order to compensate the signal skewing caused by KF, we cascade the backward part of Rauch-Tung-Striebel smoother (BRTS), which also utilizes the newly determined Q. Finally, as a supplement, we add a commonly used differential filter to remove in-band reflection artifacts. Experimental results using phantom and ex vivo colorectal tissue are provided to prove the validity of the algorithm.
△ Less
Submitted 18 November, 2022;
originally announced November 2022.
-
I see what you hear: a vision-inspired method to localize words
Authors:
Mohammad Samragh,
Arnav Kundu,
Ting-Yao Hu,
Minsik Cho,
Aman Chadha,
Ashish Shrivastava,
Oncel Tuzel,
Devang Naik
Abstract:
This paper explores the possibility of using visual object detection techniques for word localization in speech data. Object detection has been thoroughly studied in the contemporary literature for visual data. Noting that an audio can be interpreted as a 1-dimensional image, object localization techniques can be fundamentally useful for word localization. Building upon this idea, we propose a lig…
▽ More
This paper explores the possibility of using visual object detection techniques for word localization in speech data. Object detection has been thoroughly studied in the contemporary literature for visual data. Noting that an audio can be interpreted as a 1-dimensional image, object localization techniques can be fundamentally useful for word localization. Building upon this idea, we propose a lightweight solution for word detection and localization. We use bounding box regression for word localization, which enables our model to detect the occurrence, offset, and duration of keywords in a given audio stream. We experiment with LibriSpeech and train a model to localize 1000 words. Compared to existing work, our method reduces model size by 94%, and improves the F1 score by 6.5\%.
△ Less
Submitted 24 October, 2022;
originally announced October 2022.
-
Multi-robot Cooperative Pursuit via Potential Field-Enhanced Reinforcement Learning
Authors:
Zheng Zhang,
Xiaohan Wang,
Qingrui Zhang,
Tianjiang Hu
Abstract:
It is of great challenge, though promising, to coordinate collective robots for hunting an evader in a decentralized manner purely in light of local observations. In this paper, this challenge is addressed by a novel hybrid cooperative pursuit algorithm that combines reinforcement learning with the artificial potential field method. In the proposed algorithm, decentralized deep reinforcement learn…
▽ More
It is of great challenge, though promising, to coordinate collective robots for hunting an evader in a decentralized manner purely in light of local observations. In this paper, this challenge is addressed by a novel hybrid cooperative pursuit algorithm that combines reinforcement learning with the artificial potential field method. In the proposed algorithm, decentralized deep reinforcement learning is employed to learn cooperative pursuit policies that are adaptive to dynamic environments. The artificial potential field method is integrated into the learning process as predefined rules to improve the data efficiency and generalization ability. It is shown by numerical simulations that the proposed hybrid design outperforms the pursuit policies either learned from vanilla reinforcement learning or designed by the potential field method. Furthermore, experiments are conducted by transferring the learned pursuit policies into real-world mobile robots. Experimental results demonstrate the feasibility and potential of the proposed algorithm in learning multiple cooperative pursuit strategies.
△ Less
Submitted 9 March, 2022;
originally announced March 2022.
-
A spectral-spatial fusion anomaly detection method for hyperspectral imagery
Authors:
Zengfu Hou,
Siyuan Cheng,
Ting Hu
Abstract:
In hyperspectral, high-quality spectral signals convey subtle spectral differences to distinguish similar materials, thereby providing unique advantage for anomaly detection. Hence fine spectra of anomalous pixels can be effectively screened out from heterogeneous background pixels. Since the same materials have similar characteristics in spatial and spectral dimension, detection performance can b…
▽ More
In hyperspectral, high-quality spectral signals convey subtle spectral differences to distinguish similar materials, thereby providing unique advantage for anomaly detection. Hence fine spectra of anomalous pixels can be effectively screened out from heterogeneous background pixels. Since the same materials have similar characteristics in spatial and spectral dimension, detection performance can be significantly enhanced by jointing spatial and spectral information. In this paper, a spectralspatial fusion anomaly detection (SSFAD) method is proposed for hyperspectral imagery. First, original spectral signals are mapped to a local linear background space composed of median and mean with high confidence, where saliency weight and feature enhancement strategies are implemented to obtain an initial detection map in spectral domain. Futhermore, to make full use of similarity information of local background around testing pixel, a new detector is designed to extract the local similarity spatial features of patch images in spatial domain. Finally, anomalies are detected by adaptively combining the spectral and spatial detection maps. The experimental results demonstrate that our proposed method has superior detection performance than traditional methods.
△ Less
Submitted 23 February, 2022;
originally announced February 2022.
-
DiriNet: A network to estimate the spatial and spectral degradation functions
Authors:
Ting Hu
Abstract:
The spatial and spectral degradation functions are critical to hyper- and multi-spectral image fusion. However, few work has been payed on the estimation of the degradation functions. To learn the spatial response function and the point spread function from the image pairs to be fused, we propose a Dirichlet network, where both functions are properly constrained. Specifically, the spatial response…
▽ More
The spatial and spectral degradation functions are critical to hyper- and multi-spectral image fusion. However, few work has been payed on the estimation of the degradation functions. To learn the spatial response function and the point spread function from the image pairs to be fused, we propose a Dirichlet network, where both functions are properly constrained. Specifically, the spatial response function is constrained with positivity, while the Dirichlet distribution along with a total variation is imposed on the point spread function. To the best of our knowledge, the neural netwrok and the Dirichlet regularization are exclusively investigated, for the first time, to estimate the degradation functions. Both image degradation and fusion experiments demonstrate the effectiveness and superiority of the proposed Dirichlet network.
△ Less
Submitted 27 January, 2022;
originally announced January 2022.
-
KFWC: A Knowledge-Driven Deep Learning Model for Fine-grained Classification of Wet-AMD
Authors:
Haihong E,
Jiawen He,
Tianyi Hu,
Lifei Wang,
Lifei Yuan,
Ruru Zhang,
Meina Song
Abstract:
Automated diagnosis using deep neural networks can help ophthalmologists detect the blinding eye disease wet Age-related Macular Degeneration (AMD). Wet-AMD has two similar subtypes, Neovascular AMD and Polypoidal Choroidal Vessels (PCV). However, due to the difficulty in data collection and the similarity between images, most studies have only achieved the coarse-grained classification of wet-AMD…
▽ More
Automated diagnosis using deep neural networks can help ophthalmologists detect the blinding eye disease wet Age-related Macular Degeneration (AMD). Wet-AMD has two similar subtypes, Neovascular AMD and Polypoidal Choroidal Vessels (PCV). However, due to the difficulty in data collection and the similarity between images, most studies have only achieved the coarse-grained classification of wet-AMD rather than a finer-grained one of wet-AMD subtypes. To solve this issue, in this paper we propose a Knowledge-driven Fine-grained Wet-AMD Classification Model (KFWC), to classify fine-grained diseases with insufficient data. With the introduction of a priori knowledge of 10 lesion signs of input images into the KFWC, we aim to accelerate the KFWC by means of multi-label classification pre-training, to locate the decisive image features in the fine-grained disease classification task and therefore achieve better classification. Simultaneously, the KFWC can also provide good interpretability and effectively alleviate the pressure of data collection and annotation in the field of fine-grained disease classification for wet-AMD. The experiments demonstrate the effectiveness of the KFWC which reaches 99.71% in AU-ROC scores, and its considerable improvements over the data-driven w/o Knowledge and ophthalmologists, with the rates of 6.69% over the strongest baseline and 4.14% over ophthalmologists.
△ Less
Submitted 23 December, 2021;
originally announced December 2021.
-
A Latent Encoder Coupled Generative Adversarial Network (LE-GAN) for Efficient Hyperspectral Image Super-resolution
Authors:
Yue Shi,
Liangxiu Han,
Lianghao Han,
Sheng Chang,
Tongle Hu,
Darren Dancey
Abstract:
Realistic hyperspectral image (HSI) super-resolution (SR) techniques aim to generate a high-resolution (HR) HSI with higher spectral and spatial fidelity from its low-resolution (LR) counterpart. The generative adversarial network (GAN) has proven to be an effective deep learning framework for image super-resolution. However, the optimisation process of existing GAN-based models frequently suffers…
▽ More
Realistic hyperspectral image (HSI) super-resolution (SR) techniques aim to generate a high-resolution (HR) HSI with higher spectral and spatial fidelity from its low-resolution (LR) counterpart. The generative adversarial network (GAN) has proven to be an effective deep learning framework for image super-resolution. However, the optimisation process of existing GAN-based models frequently suffers from the problem of mode collapse, leading to the limited capacity of spectral-spatial invariant reconstruction. This may cause the spectral-spatial distortion on the generated HSI, especially with a large upscaling factor. To alleviate the problem of mode collapse, this work has proposed a novel GAN model coupled with a latent encoder (LE-GAN), which can map the generated spectral-spatial features from the image space to the latent space and produce a coupling component to regularise the generated samples. Essentially, we treat an HSI as a high-dimensional manifold embedded in a latent space. Thus, the optimisation of GAN models is converted to the problem of learning the distributions of high-resolution HSI samples in the latent space, making the distributions of the generated super-resolution HSIs closer to those of their original high-resolution counterparts. We have conducted experimental evaluations on the model performance of super-resolution and its capability in alleviating mode collapse. The proposed approach has been tested and validated based on two real HSI datasets with different sensors (i.e. AVIRIS and UHD-185) for various upscaling factors and added noise levels, and compared with the state-of-the-art super-resolution models (i.e. HyCoNet, LTTR, BAGAN, SR- GAN, WGAN).
△ Less
Submitted 16 November, 2021;
originally announced November 2021.
-
Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition
Authors:
Ting-Yao Hu,
Mohammadreza Armandpour,
Ashish Shrivastava,
Jen-Hao Rick Chang,
Hema Koppula,
Oncel Tuzel
Abstract:
With recent advances in speech synthesis, synthetic data is becoming a viable alternative to real data for training speech recognition models. However, machine learning with synthetic data is not trivial due to the gap between the synthetic and the real data distributions. Synthetic datasets may contain artifacts that do not exist in real data such as structured noise, content errors, or unrealist…
▽ More
With recent advances in speech synthesis, synthetic data is becoming a viable alternative to real data for training speech recognition models. However, machine learning with synthetic data is not trivial due to the gap between the synthetic and the real data distributions. Synthetic datasets may contain artifacts that do not exist in real data such as structured noise, content errors, or unrealistic speaking styles. Moreover, the synthesis process may introduce a bias due to uneven sampling of the data manifold. We propose two novel techniques during training to mitigate the problems due to the distribution gap: (i) a rejection sampling algorithm and (ii) using separate batch normalization statistics for the real and the synthetic samples. We show that these methods significantly improve the training of speech recognition models using synthetic data. We evaluate the proposed approach on keyword detection and Automatic Speech Recognition (ASR) tasks, and observe up to 18% and 13% relative error reduction, respectively, compared to naively using the synthetic data.
△ Less
Submitted 21 October, 2021;
originally announced October 2021.
-
Scalable Perception-Action-Communication Loops with Convolutional and Graph Neural Networks
Authors:
Ting-Kuei Hu,
Fernando Gama,
Tianlong Chen,
Wenqing Zheng,
Zhangyang Wang,
Alejandro Ribeiro,
Brian M. Sadler
Abstract:
In this paper, we present a perception-action-communication loop design using Vision-based Graph Aggregation and Inference (VGAI). This multi-agent decentralized learning-to-control framework maps raw visual observations to agent actions, aided by local communication among neighboring agents. Our framework is implemented by a cascade of a convolutional and a graph neural network (CNN / GNN), addre…
▽ More
In this paper, we present a perception-action-communication loop design using Vision-based Graph Aggregation and Inference (VGAI). This multi-agent decentralized learning-to-control framework maps raw visual observations to agent actions, aided by local communication among neighboring agents. Our framework is implemented by a cascade of a convolutional and a graph neural network (CNN / GNN), addressing agent-level visual perception and feature learning, as well as swarm-level communication, local information aggregation and agent action inference, respectively. By jointly training the CNN and GNN, image features and communication messages are learned in conjunction to better address the specific task. We use imitation learning to train the VGAI controller in an offline phase, relying on a centralized expert controller. This results in a learned VGAI controller that can be deployed in a distributed manner for online execution. Additionally, the controller exhibits good scaling properties, with training in smaller teams and application in larger teams. Through a multi-agent flocking application, we demonstrate that VGAI yields performance comparable to or better than other decentralized controllers, using only the visual input modality and without accessing precise location or motion state information.
△ Less
Submitted 5 November, 2021; v1 submitted 24 June, 2021;
originally announced June 2021.
-
A Statistical Model for Melody Reduction
Authors:
Tianxue Hu,
Claire Arthur
Abstract:
A commonly-cited reason for the poor performance of automatic chord estimation (ACE) systems within music information retrieval (MIR) is that non-chord tones (i.e., notes outside the supporting harmony) contribute to error during the labeling process. Despite the prevalence of machine learning approaches in MIR, there are cases where alternative approaches provide a simpler alternative while allow…
▽ More
A commonly-cited reason for the poor performance of automatic chord estimation (ACE) systems within music information retrieval (MIR) is that non-chord tones (i.e., notes outside the supporting harmony) contribute to error during the labeling process. Despite the prevalence of machine learning approaches in MIR, there are cases where alternative approaches provide a simpler alternative while allowing for insights into musicological practices. In this project, we present a statistical model for predicting chord tones based on music theory rules. Our model is currently focused on predicting chord tones in classical music, since composition in this style is highly constrained, theoretically making the placement of chord tones highly predictable. Indeed, music theorists have labeling systems for every variety of non-chord tone, primarily classified by the note's metric position and intervals of approach and departure. Using metric position, duration, and melodic intervals as predictors, we build a statistical model for predicting chord tones using the TAVERN dataset. While our probabilistic approach is similar to other efforts in the domain of automatic harmonic analysis, our focus is on melodic reduction rather than predicting harmony. However, we hope to pursue applications for ACE in the future. Finally, we implement our melody reduction model using an existing symbolic visualization tool, to assist with melody reduction and non-chord tone identification for computational musicology researchers and music theorists.
△ Less
Submitted 11 May, 2021;
originally announced May 2021.
-
Collaborative Target Tracking in Elliptic Coordinates: a Binocular Coordination Approach
Authors:
Yuan Chang,
Zhiyong Sun,
Han Zhou,
Xiangke Wang,
Lincheng Shen,
Tianjiang Hu
Abstract:
This paper concentrates on the collaborative target tracking control of a pair of tracking vehicles with formation constraints. The proposed controller requires only distance measurements between tracking vehicles and the target. Its novelty lies in two aspects: 1) the elliptic coordinates are used to represent an arbitrary tracking formation without singularity, which can be deduced from inter-ag…
▽ More
This paper concentrates on the collaborative target tracking control of a pair of tracking vehicles with formation constraints. The proposed controller requires only distance measurements between tracking vehicles and the target. Its novelty lies in two aspects: 1) the elliptic coordinates are used to represent an arbitrary tracking formation without singularity, which can be deduced from inter-agent distances, and 2) the regulation of the tracking vehicle system obeys a binocular coordination principle, which simplifies the design of the control law by leveraging rich physical meanings of elliptic coordinates. The tracking system with the proposed controller is proven to be exponentially convergent when the target is stationary. When the target drifts with a small velocity, the desired tracking formation is achieved within a small margin proportional to the magnitude of the target's drift velocity. Simulation examples are provided to demonstrate the tracking performance of the proposed controller.
△ Less
Submitted 21 September, 2020;
originally announced September 2020.
-
Localizing the Common Action Among a Few Videos
Authors:
Pengwan Yang,
Vincent Tao Hu,
Pascal Mettes,
Cees G. M. Snoek
Abstract:
This paper strives to localize the temporal extent of an action in a long untrimmed video. Where existing work leverages many examples with their start, their ending, and/or the class of the action during training time, we propose few-shot common action localization. The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containin…
▽ More
This paper strives to localize the temporal extent of an action in a long untrimmed video. Where existing work leverages many examples with their start, their ending, and/or the class of the action during training time, we propose few-shot common action localization. The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containing the same action, without knowing their common class label. To address this task, we introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments. The network contains: (\textit{i}) a mutual enhancement module to simultaneously complement the representation of the few trimmed support videos and the untrimmed query video; (\textit{ii}) a progressive alignment module that iteratively fuses the support videos into the query branch; and (\textit{iii}) a pairwise matching module to weigh the importance of different support videos. Evaluation of few-shot common action localization in untrimmed videos containing a single or multiple action instances demonstrates the effectiveness and general applicability of our proposal.
△ Less
Submitted 25 August, 2020; v1 submitted 13 August, 2020;
originally announced August 2020.
-
Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis
Authors:
Ting-Yao Hu,
Ashish Shrivastava,
Oncel Tuzel,
Chandra Dhir
Abstract:
We present a method to generate speech from input text and a style vector that is extracted from a reference speech signal in an unsupervised manner, i.e., no style annotation, such as speaker information, is required. Existing unsupervised methods, during training, generate speech by computing style from the corresponding ground truth sample and use a decoder to combine the style vector with the…
▽ More
We present a method to generate speech from input text and a style vector that is extracted from a reference speech signal in an unsupervised manner, i.e., no style annotation, such as speaker information, is required. Existing unsupervised methods, during training, generate speech by computing style from the corresponding ground truth sample and use a decoder to combine the style vector with the input text. Training the model in such a way leaks content information into the style vector. The decoder can use the leaked content and ignore some of the input text to minimize the reconstruction loss. At inference time, when the reference speech does not match the content input, the output may not contain all of the content of the input text. We refer to this problem as "content leakage", which we address by explicitly estimating and minimizing the mutual information between the style and the content through an adversarial training formulation. We call our method MIST - Mutual Information based Style Content Separation. The main goal of the method is to preserve the input content in the synthesized speech signal, which we measure by the word error rate (WER) and show substantial improvements over state-of-the-art unsupervised speech synthesis methods.
△ Less
Submitted 9 March, 2020;
originally announced March 2020.
-
Flyback-Based Multiple Output dc-dc Converter with Independent Voltage Regulation
Authors:
M. Tahan,
D. Bamgboje,
T. Hu
Abstract:
This paper proposes a new single input multiple output power supply by integrating a flyback converter and several buck converters. The flyback converter works as the main regulator, and the buck converters provide series voltage compensation with the aim of tight regulation. A time multiplexing switching scheme is proposed to deliver multiple output voltage levels via a two winding transformer an…
▽ More
This paper proposes a new single input multiple output power supply by integrating a flyback converter and several buck converters. The flyback converter works as the main regulator, and the buck converters provide series voltage compensation with the aim of tight regulation. A time multiplexing switching scheme is proposed to deliver multiple output voltage levels via a two winding transformer and to eliminate the cross regulation between output channels. This configuration reduces the size of flyback transformer and filter capacitors, and consequently improves the overall form factor. A detailed steady state analysis is conducted on the circuit to obtain the design criteria. A three output channel power supply is designed and the effectiveness of the proposed configuration is validated via simulation with a MATLAB/Simscape model. Simulation results also demonstrate satisfactory transient response to load changes.
△ Less
Submitted 9 February, 2020;
originally announced February 2020.
-
VGAI: End-to-End Learning of Vision-Based Decentralized Controllers for Robot Swarms
Authors:
Ting-Kuei Hu,
Fernando Gama,
Tianlong Chen,
Zhangyang Wang,
Alejandro Ribeiro,
Brian M. Sadler
Abstract:
Decentralized coordination of a robot swarm requires addressing the tension between local perceptions and actions, and the accomplishment of a global objective. In this work, we propose to learn decentralized controllers based on solely raw visual inputs. For the first time, that integrates the learning of two key components: communication and visual perception, in one end-to-end framework. More s…
▽ More
Decentralized coordination of a robot swarm requires addressing the tension between local perceptions and actions, and the accomplishment of a global objective. In this work, we propose to learn decentralized controllers based on solely raw visual inputs. For the first time, that integrates the learning of two key components: communication and visual perception, in one end-to-end framework. More specifically, we consider that each robot has access to a visual perception of the immediate surroundings, and communication capabilities to transmit and receive messages from other neighboring robots. Our proposed learning framework combines a convolutional neural network (CNN) for each robot to extract messages from the visual inputs, and a graph neural network (GNN) over the entire swarm to transmit, receive and process these messages in order to decide on actions. The use of a GNN and locally-run CNNs results naturally in a decentralized controller. We jointly train the CNNs and the GNN so that each robot learns to extract messages from the images that are adequate for the team as a whole. Our experiments demonstrate the proposed architecture in the problem of drone flocking and show its promising performance and scalability, e.g., achieving successful decentralized flocking for large-sized swarms consisting of up to 75 drones.
△ Less
Submitted 10 December, 2020; v1 submitted 6 February, 2020;
originally announced February 2020.
-
Multiple string LED driver with flexible and high performance PWM dimming control
Authors:
M. Tahan,
T. Hu
Abstract:
The main objectives in driving multiple LED strings include achieving uniform current control and high performance PWM dimming for all strings. This work proposes a new multiple string LED driver to achieve not only current balance, but also flexible and wide range PWM dimming ratio for each string. A compact single-inductor multiple-output topology is adopted in the driver, accompanied by synchro…
▽ More
The main objectives in driving multiple LED strings include achieving uniform current control and high performance PWM dimming for all strings. This work proposes a new multiple string LED driver to achieve not only current balance, but also flexible and wide range PWM dimming ratio for each string. A compact single-inductor multiple-output topology is adopted in the driver, accompanied by synchronous integrators and variable dimming frequency, to achieve both high efficiency and high performance dimming. By using the proposed variable dimming frequency scheme, high dimming frequency is applied to a string with high dimming ratio, which helps to maintain the deviation of LED string current in an acceptable range, while low dimming frequency is applied to a string with low dimming ratio, which helps to achieve rectangular LED current waveform. Meanwhile, the new time multiplexing control scheme automatically optimizes the LED string's bus voltages, thus minimizes each string's power loss. A three string LED driver prototype is constructed to validate the effectiveness of the peoposed control scheme, where the three strings can have different dimming ratios between 4% and 100%.
△ Less
Submitted 31 January, 2020;
originally announced February 2020.
-
Screening for REM Sleep Behaviour Disorder with Minimal Sensors
Authors:
Navin Cooray,
Fernando Andreotti,
Christine Lo,
Mkael Symmonds,
Michele T. M. Hu,
Maarten De Vos
Abstract:
Rapid-Eye-Movement (REM) sleep behaviour disorder (RBD) is an early predictor of Parkinson's disease, dementia with Lewy bodies, and multiple system atrophy. This study investigates a minimal set of sensors to achieve effective screening for RBD in the population, integrating automated sleep staging (three state) followed by RBD detection without the need for cumbersome electroencephalogram (EEG)…
▽ More
Rapid-Eye-Movement (REM) sleep behaviour disorder (RBD) is an early predictor of Parkinson's disease, dementia with Lewy bodies, and multiple system atrophy. This study investigates a minimal set of sensors to achieve effective screening for RBD in the population, integrating automated sleep staging (three state) followed by RBD detection without the need for cumbersome electroencephalogram (EEG) sensors. Polysomnography signals from 50 participants with RBD and 50 age-matched healthy controls were used to evaluate this study. Three stage sleep classification was achieved using a Random Forest (RF) classifier and features derived from a combination of cost-effective and easy to use sensors, namely electrocardiogram (ECG), electrooculogram (EOG), and electromyogram (EMG) channels. Subsequently, RBD detection was achieved using established and new metrics derived from ECG and EMG metrics. The EOG and EMG combination provided the best minimalist fully automated performance, achieving $0.57\pm0.19$ kappa (3 stage) for sleep staging and an RBD detection accuracy of $0.90\pm0.11$, (sensitivity, and specificity $0.88\pm0.13$, and $0.92\pm0.098$). A single ECG sensor allowed three state sleep staging with $0.28\pm0.06$ kappa and RBD detection accuracy of $0.62\pm0.10$. This study demonstrated the feasibility of using signals from a single EOG and EMG sensor to detect RBD using fully-automated techniques. This study proposes a cost-effective, practical, and simple RBD identification support tool using only two sensors (EMG and EOG), ideal for screening purposes.
△ Less
Submitted 24 October, 2019;
originally announced October 2019.
-
A Radio Signal Modulation Recognition Algorithm Based on Residual Networks and Attention Mechanisms
Authors:
Ruisen Luo,
Tao Hu,
Zuodong Tang,
Chen Wang,
Xiaofeng Gong,
Haiyan Tu
Abstract:
To solve the problem of inaccurate recognition of types of communication signal modulation, a RNN neural network recognition algorithm combining residual block network with attention mechanism is proposed. In this method, 10 kinds of communication signals with Gaussian white noise are generated from standard data sets, such as MASK, MPSK, MFSK, OFDM, 16QAM, AM and FM. Based on the original RNN neu…
▽ More
To solve the problem of inaccurate recognition of types of communication signal modulation, a RNN neural network recognition algorithm combining residual block network with attention mechanism is proposed. In this method, 10 kinds of communication signals with Gaussian white noise are generated from standard data sets, such as MASK, MPSK, MFSK, OFDM, 16QAM, AM and FM. Based on the original RNN neural network, residual block network is added to solve the problem of gradient disappearance caused by deep network layers. Attention mechanism is added to the network to accelerate the gradient descent. In the experiment, 16QAM, 2FSK and 4FSK are used as actual samples, IQ data frames of signals are used as input, and the RNN neural network combined with residual block network and attention mechanism is trained. The final recognition results show that the average recognition rate of real-time signals is over 93%. The network has high robustness and good use value.
△ Less
Submitted 26 September, 2019;
originally announced September 2019.
-
Detection of REM Sleep Behaviour Disorder by Automated Polysomnography Analysis
Authors:
Navin Cooray,
Fernando Andreotti,
Christine Lo,
Mkael Symmonds,
Michele T. M. Hu,
Maarten De Vos
Abstract:
Evidence suggests Rapid-Eye-Movement (REM) Sleep Behaviour Disorder (RBD) is an early predictor of Parkinson's disease. This study proposes a fully-automated framework for RBD detection consisting of automated sleep staging followed by RBD identification. Analysis was assessed using a limited polysomnography montage from 53 participants with RBD and 53 age-matched healthy controls. Sleep stage cla…
▽ More
Evidence suggests Rapid-Eye-Movement (REM) Sleep Behaviour Disorder (RBD) is an early predictor of Parkinson's disease. This study proposes a fully-automated framework for RBD detection consisting of automated sleep staging followed by RBD identification. Analysis was assessed using a limited polysomnography montage from 53 participants with RBD and 53 age-matched healthy controls. Sleep stage classification was achieved using a Random Forest (RF) classifier and 156 features extracted from electroencephalogram (EEG), electrooculogram (EOG) and electromyogram (EMG) channels. For RBD detection, a RF classifier was trained combining established techniques to quantify muscle atonia with additional features that incorporate sleep architecture and the EMG fractal exponent. Automated multi-state sleep staging achieved a 0.62 Cohen's Kappa score. RBD detection accuracy improved by 10% to 96% (compared to individual established metrics) when using manually annotated sleep staging. Accuracy remained high (92%) when using automated sleep staging. This study outperforms established metrics and demonstrates that incorporating sleep architecture and sleep stage transitions can benefit RBD detection. This study also achieved automated sleep staging with a level of accuracy comparable to manual annotation. This study validates a tractable, fully-automated, and sensitive pipeline for RBD identification that could be translated to wearable take-home technology.
△ Less
Submitted 12 November, 2018;
originally announced November 2018.