-
The Interpretation Gap in Text-to-Music Generation Models
Authors:
Yongyi Zang,
Yixiao Zhang
Abstract:
Large-scale text-to-music generation models have significantly enhanced music creation capabilities, offering unprecedented creative freedom. However, their ability to collaborate effectively with human musicians remains limited. In this paper, we propose a framework to describe the musical interaction process, which includes expression, interpretation, and execution of controls. Following this fr…
▽ More
Large-scale text-to-music generation models have significantly enhanced music creation capabilities, offering unprecedented creative freedom. However, their ability to collaborate effectively with human musicians remains limited. In this paper, we propose a framework to describe the musical interaction process, which includes expression, interpretation, and execution of controls. Following this framework, we argue that the primary gap between existing text-to-music models and musicians lies in the interpretation stage, where models lack the ability to interpret controls from musicians. We also propose two strategies to address this gap and call on the music information retrieval community to tackle the interpretation challenge to improve human-AI musical collaboration.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
HPC: Hierarchical Progressive Coding Framework for Volumetric Video
Authors:
Zihan Zheng,
Houqiang Zhong,
Qiang Hu,
Xiaoyun Zhang,
Li Song,
Ya Zhang,
Yanfeng Wang
Abstract:
Volumetric video based on Neural Radiance Field (NeRF) holds vast potential for various 3D applications, but its substantial data volume poses significant challenges for compression and transmission. Current NeRF compression lacks the flexibility to adjust video quality and bitrate within a single model for various network and device capacities. To address these issues, we propose HPC, a novel hie…
▽ More
Volumetric video based on Neural Radiance Field (NeRF) holds vast potential for various 3D applications, but its substantial data volume poses significant challenges for compression and transmission. Current NeRF compression lacks the flexibility to adjust video quality and bitrate within a single model for various network and device capacities. To address these issues, we propose HPC, a novel hierarchical progressive volumetric video coding framework achieving variable bitrate using a single model. Specifically, HPC introduces a hierarchical representation with a multi-resolution residual radiance field to reduce temporal redundancy in long-duration sequences while simultaneously generating various levels of detail. Then, we propose an end-to-end progressive learning approach with a multi-rate-distortion loss function to jointly optimize both hierarchical representation and compression. Our HPC trained only once can realize multiple compression levels, while the current methods need to train multiple fixed-bitrate models for different rate-distortion (RD) tradeoffs. Extensive experiments demonstrate that HPC achieves flexible quality levels with variable bitrate by a single model and exhibits competitive RD performance, even outperforming fixed-bitrate models across various datasets.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
DSCENet: Dynamic Screening and Clinical-Enhanced Multimodal Fusion for MPNs Subtype Classification
Authors:
Yuan Zhang,
Yaolei Qi,
Xiaoming Qi,
Yongyue Wei,
Guanyu Yang
Abstract:
The precise subtype classification of myeloproliferative neoplasms (MPNs) based on multimodal information, which assists clinicians in diagnosis and long-term treatment plans, is of great clinical significance. However, it remains a great challenging task due to the lack of diagnostic representativeness for local patches and the absence of diagnostic-relevant features from a single modality. In th…
▽ More
The precise subtype classification of myeloproliferative neoplasms (MPNs) based on multimodal information, which assists clinicians in diagnosis and long-term treatment plans, is of great clinical significance. However, it remains a great challenging task due to the lack of diagnostic representativeness for local patches and the absence of diagnostic-relevant features from a single modality. In this paper, we propose a Dynamic Screening and Clinical-Enhanced Network (DSCENet) for the subtype classification of MPNs on the multimodal fusion of whole slide images (WSIs) and clinical information. (1) A dynamic screening module is proposed to flexibly adapt the feature learning of local patches, reducing the interference of irrelevant features and enhancing their diagnostic representativeness. (2) A clinical-enhanced fusion module is proposed to integrate clinical indicators to explore complementary features across modalities, providing comprehensive diagnostic information. Our approach has been validated on the real clinical data, achieving an increase of 7.91% AUC and 16.89% accuracy compared with the previous state-of-the-art (SOTA) methods. The code is available at https://github.com/yuanzhang7/DSCENet.
△ Less
Submitted 11 July, 2024;
originally announced July 2024.
-
Pairwise Distance Distillation for Unsupervised Real-World Image Super-Resolution
Authors:
Yuehan Zhang,
Seungjun Lee,
Angela Yao
Abstract:
Standard single-image super-resolution creates paired training data from high-resolution images through fixed downsampling kernels. However, real-world super-resolution (RWSR) faces unknown degradations in the low-resolution inputs, all the while lacking paired training data. Existing methods approach this problem by learning blind general models through complex synthetic augmentations on training…
▽ More
Standard single-image super-resolution creates paired training data from high-resolution images through fixed downsampling kernels. However, real-world super-resolution (RWSR) faces unknown degradations in the low-resolution inputs, all the while lacking paired training data. Existing methods approach this problem by learning blind general models through complex synthetic augmentations on training inputs; they sacrifice the performance on specific degradation for broader generalization to many possible ones. We address the unsupervised RWSR for a targeted real-world degradation. We study from a distillation perspective and introduce a novel pairwise distance distillation framework. Through our framework, a model specialized in synthetic degradation adapts to target real-world degradations by distilling intra- and inter-model distances across the specialized model and an auxiliary generalized model. Experiments on diverse datasets demonstrate that our method significantly enhances fidelity and perceptual quality, surpassing state-of-the-art approaches in RWSR. The source code is available at https://github.com/Yuehan717/PDD.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
OFDM Achieves the Lowest Ranging Sidelobe Under Random ISAC Signaling
Authors:
Fan Liu,
Ying Zhang,
Yifeng Xiong,
Shuangyang Li,
Weijie Yuan,
Feifei Gao,
Shi Jin,
Giuseppe Caire
Abstract:
This paper aims to answer a fundamental question in the area of Integrated Sensing and Communications (ISAC): What is the optimal communication-centric ISAC waveform for ranging? Towards that end, we first established a generic framework to analyze the sensing performance of communication-centric ISAC waveforms built upon orthonormal signaling bases and random data symbols. Then, we evaluated thei…
▽ More
This paper aims to answer a fundamental question in the area of Integrated Sensing and Communications (ISAC): What is the optimal communication-centric ISAC waveform for ranging? Towards that end, we first established a generic framework to analyze the sensing performance of communication-centric ISAC waveforms built upon orthonormal signaling bases and random data symbols. Then, we evaluated their ranging performance by adopting both the periodic and aperiodic auto-correlation functions (P-ACF and A-ACF), and defined the expectation of the integrated sidelobe level (EISL) as a sensing performance metric. On top of that, we proved that among all communication waveforms with cyclic prefix (CP), the orthogonal frequency division multiplexing (OFDM) modulation is the only globally optimal waveform that achieves the lowest ranging sidelobe for quadrature amplitude modulation (QAM) and phase shift keying (PSK) constellations, in terms of both the EISL and the sidelobe level at each individual lag of the P-ACF. As a step forward, we proved that among all communication waveforms without CP, OFDM is a locally optimal waveform for QAM/PSK in the sense that it achieves a local minimum of the EISL of the A-ACF. Finally, we demonstrated by numerical results that under QAM/PSK constellations, there is no other orthogonal communication-centric waveform that achieves a lower ranging sidelobe level than that of the OFDM, in terms of both P-ACF and A-ACF cases.
△ Less
Submitted 9 July, 2024;
originally announced July 2024.
-
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports
Authors:
Yutong Zhang,
Yi Pan,
Tianyang Zhong,
Peixin Dong,
Kangni Xie,
Yuxiao Liu,
Hanqi Jiang,
Zhengliang Liu,
Shijie Zhao,
Tuo Zhang,
Xi Jiang,
Dinggang Shen,
Tianming Liu,
Xin Zhang
Abstract:
Medical images and radiology reports are crucial for diagnosing medical conditions, highlighting the importance of quantitative analysis for clinical decision-making. However, the diversity and cross-source heterogeneity of these data challenge the generalizability of current data-mining methods. Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecti…
▽ More
Medical images and radiology reports are crucial for diagnosing medical conditions, highlighting the importance of quantitative analysis for clinical decision-making. However, the diversity and cross-source heterogeneity of these data challenge the generalizability of current data-mining methods. Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence (AGI) for computer vision, showcasing their potential in the biomedical domain. In this study, we evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets, including 5 medical imaging categories (dermatology, radiology, dentistry, ophthalmology, and endoscopy), and 3 radiology report datasets. The investigated tasks encompass disease classification, lesion segmentation, anatomical localization, disease diagnosis, report generation, and lesion detection. Our experimental results demonstrated that Gemini-series models excelled in report generation and lesion detection but faces challenges in disease classification and anatomical localization. Conversely, GPT-series models exhibited proficiency in lesion segmentation and anatomical localization but encountered difficulties in disease diagnosis and lesion detection. Additionally, both the Gemini series and GPT series contain models that have demonstrated commendable generation efficiency. While both models hold promise in reducing physician workload, alleviating pressure on limited healthcare resources, and fostering collaboration between clinical practitioners and artificial intelligence technologies, substantial enhancements and comprehensive validations remain imperative before clinical deployment.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
Authors:
Ye Bai,
Jingping Chen,
Jitong Chen,
Wei Chen,
Zhuo Chen,
Chuang Ding,
Linhao Dong,
Qianqian Dong,
Yujiao Du,
Kepan Gao,
Lu Gao,
Yi Guo,
Minglun Han,
Ting Han,
Wenchao Hu,
Xinying Hu,
Yuxiang Hu,
Deyu Hua,
Lu Huang,
Mingkun Huang,
Youjia Huang,
Jishuo Jin,
Fanliu Kong,
Zongwei Lan,
Tianyu Li
, et al. (30 additional authors not shown)
Abstract:
Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor…
▽ More
Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation sets, including multiple domains, accents/dialects and languages. Additionally, Seed-ASR can be further deployed to support specific needs in various scenarios without requiring extra language models. Compared to recently released large ASR models, Seed-ASR achieves 10%-40% reduction in word (or character, for Chinese) error rates on Chinese and English public test sets, further demonstrating its powerful performance.
△ Less
Submitted 10 July, 2024; v1 submitted 5 July, 2024;
originally announced July 2024.
-
AI-Based Beam-Level and Cell-Level Mobility Management for High Speed Railway Communications
Authors:
Wen Li,
Wei Chen,
Shiyue Wang,
Yuanyuan Zhang,
Michail Matthaiou,
Bo Ai
Abstract:
High-speed railway (HSR) communications are pivotal for ensuring rail safety, operations, maintenance, and delivering passenger information services. The high speed of trains creates rapidly time-varying wireless channels, increases the signaling overhead, and reduces the system throughput, making it difficult to meet the growing and stringent needs of HSR applications. In this article, we explore…
▽ More
High-speed railway (HSR) communications are pivotal for ensuring rail safety, operations, maintenance, and delivering passenger information services. The high speed of trains creates rapidly time-varying wireless channels, increases the signaling overhead, and reduces the system throughput, making it difficult to meet the growing and stringent needs of HSR applications. In this article, we explore artificial intelligence (AI)-based beam-level and cell-level mobility management suitable for HSR communications, including the use cases, inputs, outputs, and key performance indicators (KPI)s of AI models. Particularly, in comparison to traditional down-sampling spatial beam measurements, we show that the compressed spatial multi-beam measurements via compressive sensing lead to improved spatial-temporal beam prediction. Moreover, we demonstrate the performance gains of AI-assisted cell handover over traditional mobile handover mechanisms. In addition, we observe that the proposed approaches to reduce the measurement overhead achieve comparable radio link failure performance with the traditional approach that requires all the beam measurements of all cells, while the former methods can save 50% beam measurement overhead.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Perception-Guided Quality Metric of 3D Point Clouds Using Hybrid Strategy
Authors:
Yujie Zhang,
Qi Yang,
Yiling Xu,
Shan Liu
Abstract:
Full-reference point cloud quality assessment (FR-PCQA) aims to infer the quality of distorted point clouds with available references. Most of the existing FR-PCQA metrics ignore the fact that the human visual system (HVS) dynamically tackles visual information according to different distortion levels (i.e., distortion detection for high-quality samples and appearance perception for low-quality sa…
▽ More
Full-reference point cloud quality assessment (FR-PCQA) aims to infer the quality of distorted point clouds with available references. Most of the existing FR-PCQA metrics ignore the fact that the human visual system (HVS) dynamically tackles visual information according to different distortion levels (i.e., distortion detection for high-quality samples and appearance perception for low-quality samples) and measure point cloud quality using unified features. To bridge the gap, in this paper, we propose a perception-guided hybrid metric (PHM) that adaptively leverages two visual strategies with respect to distortion degree to predict point cloud quality: to measure visible difference in high-quality samples, PHM takes into account the masking effect and employs texture complexity as an effective compensatory factor for absolute difference; on the other hand, PHM leverages spectral graph theory to evaluate appearance degradation in low-quality samples. Variations in geometric signals on graphs and changes in the spectral graph wavelet coefficients are utilized to characterize geometry and texture appearance degradation, respectively. Finally, the results obtained from the two components are combined in a non-linear method to produce an overall quality score of the tested point cloud. The results of the experiment on five independent databases show that PHM achieves state-of-the-art (SOTA) performance and offers significant performance improvement in multiple distortion environments. The code is publicly available at https://github.com/zhangyujie-1998/PHM.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
Highly Accelerated MRI via Implicit Neural Representation Guided Posterior Sampling of Diffusion Models
Authors:
Jiayue Chu,
Chenhe Du,
Xiyue Lin,
Yuyao Zhang,
Hongjiang Wei
Abstract:
Reconstructing high-fidelity magnetic resonance (MR) images from under-sampled k-space is a commonly used strategy to reduce scan time. The posterior sampling of diffusion models based on the real measurement data holds significant promise of improved reconstruction accuracy. However, traditional posterior sampling methods often lack effective data consistency guidance, leading to inaccurate and u…
▽ More
Reconstructing high-fidelity magnetic resonance (MR) images from under-sampled k-space is a commonly used strategy to reduce scan time. The posterior sampling of diffusion models based on the real measurement data holds significant promise of improved reconstruction accuracy. However, traditional posterior sampling methods often lack effective data consistency guidance, leading to inaccurate and unstable reconstructions. Implicit neural representation (INR) has emerged as a powerful paradigm for solving inverse problems by modeling a signal's attributes as a continuous function of spatial coordinates. In this study, we present a novel posterior sampler for diffusion models using INR, named DiffINR. The INR-based component incorporates both the diffusion prior distribution and the MRI physical model to ensure high data fidelity. DiffINR demonstrates superior performance on experimental datasets with remarkable accuracy, even under high acceleration factors (up to R=12 in single-channel reconstruction). Notably, our proposed framework can be a generalizable framework to solve inverse problems in other medical imaging tasks.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Coding-Enhanced Cooperative Jamming for Secret Communication in Fluid Antenna Systems
Authors:
Hao Xu,
Kai-Kit Wong,
Wee Kiat New,
Guyue Li,
Farshad Rostami Ghadi,
Yongxu Zhu,
Shi Jin,
Chan-Byoung Chae,
Yangyang Zhang
Abstract:
This letter investigates the secret communication problem for a fluid antenna system (FAS)-assisted wiretap channel, where the legitimate transmitter transmits an information-bearing signal to the legitimate receiver, and at the same time, transmits a jamming signal to interfere with the eavesdropper (Eve). Unlike the conventional jamming scheme, which usually transmits Gaussian noise that interfe…
▽ More
This letter investigates the secret communication problem for a fluid antenna system (FAS)-assisted wiretap channel, where the legitimate transmitter transmits an information-bearing signal to the legitimate receiver, and at the same time, transmits a jamming signal to interfere with the eavesdropper (Eve). Unlike the conventional jamming scheme, which usually transmits Gaussian noise that interferes not only with Eve but also with the legitimate receiver, in this letter, we consider that encoded codewords are transmitted to jam Eve. Then, by employing appropriate coding schemes, the legitimate receiver can successfully decode the jamming signal and then cancel the interference, while Eve cannot, even if it knows the codebooks. We aim to maximize the secrecy rate through port selection and power control. Although the problem is non-convex, we show that the optimal solution can be found. Simulation results show that by using the FAS technique and the proposed jamming scheme, the secrecy rate of the system can be significantly increased.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Occlusion-Aware Seamless Segmentation
Authors:
Yihong Cao,
Jiaming Zhang,
Hao Shi,
Kunyu Peng,
Yuhongxuan Zhang,
Hui Zhang,
Rainer Stiefelhagen,
Kailun Yang
Abstract:
Panoramic images can broaden the Field of View (FoV), occlusion-aware prediction can deepen the understanding of the scene, and domain adaptation can transfer across viewing domains. In this work, we introduce a novel task, Occlusion-Aware Seamless Segmentation (OASS), which simultaneously tackles all these three challenges. For benchmarking OASS, we establish a new human-annotated dataset for Ble…
▽ More
Panoramic images can broaden the Field of View (FoV), occlusion-aware prediction can deepen the understanding of the scene, and domain adaptation can transfer across viewing domains. In this work, we introduce a novel task, Occlusion-Aware Seamless Segmentation (OASS), which simultaneously tackles all these three challenges. For benchmarking OASS, we establish a new human-annotated dataset for Blending Panoramic Amodal Seamless Segmentation, i.e., BlendPASS. Besides, we propose the first solution UnmaskFormer, aiming at unmasking the narrow FoV, occlusions, and domain gaps all at once. Specifically, UnmaskFormer includes the crucial designs of Unmasking Attention (UA) and Amodal-oriented Mix (AoMix). Our method achieves state-of-the-art performance on the BlendPASS dataset, reaching a remarkable mAPQ of 26.58% and mIoU of 43.66%. On public panoramic semantic segmentation datasets, i.e., SynPASS and DensePASS, our method outperforms previous methods and obtains 45.34% and 48.08% in mIoU, respectively. The fresh BlendPASS dataset and our source code will be made publicly available at https://github.com/yihong-97/OASS.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
The USTC-NERCSLIP Systems for The ICMC-ASR Challenge
Authors:
Minghui Wu,
Luzhen Xu,
Jie Zhang,
Haitao Tang,
Yanyan Yue,
Ruizhi Liao,
Jintao Zhao,
Zhengzhe Zhang,
Yichi Wang,
Haoyin Yan,
Hongliang Yu,
Tongle Ma,
Jiachen Liu,
Chongliang Wu,
Yongchao Li,
Yanyong Zhang,
Xin Fang,
Yue Zhang
Abstract:
This report describes the submitted system to the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) challenge, which considers the ASR task with multi-speaker overlapping and Mandarin accent dynamics in the ICMC case. We implement the front-end speaker diarization using the self-supervised learning representation based multi-speaker embedding and beamforming using the speaker position,…
▽ More
This report describes the submitted system to the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) challenge, which considers the ASR task with multi-speaker overlapping and Mandarin accent dynamics in the ICMC case. We implement the front-end speaker diarization using the self-supervised learning representation based multi-speaker embedding and beamforming using the speaker position, respectively. For ASR, we employ an iterative pseudo-label generation method based on fusion model to obtain text labels of unsupervised data. To mitigate the impact of accent, an Accent-ASR framework is proposed, which captures pronunciation-related accent features at a fine-grained level and linguistic information at a coarse-grained level. On the ICMC-ASR eval set, the proposed system achieves a CER of 13.16% on track 1 and a cpCER of 21.48% on track 2, which significantly outperforms the official baseline system and obtains the first rank on both tracks.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds
Authors:
Yiming Zhang,
Yicheng Gu,
Yanhong Zeng,
Zhening Xing,
Yuancheng Wang,
Zhizheng Wu,
Kai Chen
Abstract:
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience. Despite its wide range of applications, existing approaches encounter limitations when it comes to simultaneously synthesizing high-quality and video-aligned (i.e.,, semantic relevant and temporal synchronized) sounds. To overcome these limitations…
▽ More
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing with videos, enabling an immersive audio-visual experience. Despite its wide range of applications, existing approaches encounter limitations when it comes to simultaneously synthesizing high-quality and video-aligned (i.e.,, semantic relevant and temporal synchronized) sounds. To overcome these limitations, we propose FoleyCrafter, a novel framework that leverages a pre-trained text-to-audio model to ensure high-quality audio generation. FoleyCrafter comprises two key components: the semantic adapter for semantic alignment and the temporal controller for precise audio-video synchronization. The semantic adapter utilizes parallel cross-attention layers to condition audio generation on video features, producing realistic sound effects that are semantically relevant to the visual content. Meanwhile, the temporal controller incorporates an onset detector and a timestampbased adapter to achieve precise audio-video alignment. One notable advantage of FoleyCrafter is its compatibility with text prompts, enabling the use of text descriptions to achieve controllable and diverse video-to-audio generation according to user intents. We conduct extensive quantitative and qualitative experiments on standard benchmarks to verify the effectiveness of FoleyCrafter. Models and codes are available at https://github.com/open-mmlab/FoleyCrafter.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
SpectralKAN: Kolmogorov-Arnold Network for Hyperspectral Images Change Detection
Authors:
Yanheng Wang,
Xiaohan Yu,
Yongsheng Gao,
Jianjun Sha,
Jian Wang,
Lianru Gao,
Yonggang Zhang,
Xianhui Rong
Abstract:
It has been verified that deep learning methods, including convolutional neural networks (CNNs), graph neural networks (GNNs), and transformers, can accurately extract features from hyperspectral images (HSIs). These algorithms perform exceptionally well on HSIs change detection (HSIs-CD). However, the downside of these impressive results is the enormous number of parameters, FLOPs, GPU memory, tr…
▽ More
It has been verified that deep learning methods, including convolutional neural networks (CNNs), graph neural networks (GNNs), and transformers, can accurately extract features from hyperspectral images (HSIs). These algorithms perform exceptionally well on HSIs change detection (HSIs-CD). However, the downside of these impressive results is the enormous number of parameters, FLOPs, GPU memory, training and test times required. In this paper, we propose an spectral Kolmogorov-Arnold Network for HSIs-CD (SpectralKAN). SpectralKAN represent a multivariate continuous function with a composition of activation functions to extract HSIs feature and classification. These activation functions are b-spline functions with different parameters that can simulate various functions. In SpectralKAN, a KAN encoder is proposed to enhance computational efficiency for HSIs. And a spatial-spectral KAN encoder is introduced, where the spatial KAN encoder extracts spatial features and compresses the spatial dimensions from patch size to one. The spectral KAN encoder then extracts spectral features and classifies them into changed and unchanged categories. We use five HSIs-CD datasets to verify the effectiveness of SpectralKAN. Experimental verification has shown that SpectralKAN maintains high HSIs-CD accuracy while requiring fewer parameters, FLOPs, GPU memory, training and testing times, thereby increasing the efficiency of HSIs-CD. The code will be available at https://github.com/yanhengwang-heu/SpectralKAN.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Reconfigurable Intelligent Computational Surfaces for MEC-Assisted Autonomous Driving Networks: Design Optimization and Analysis
Authors:
Xueyao Zhang,
Bo Yang,
Zhiwen Yu,
Xuelin Cao,
George C. Alexandropoulos,
Yan Zhang,
Merouane Debbah,
Chau Yuen
Abstract:
This paper investigates autonomous driving safety improvement via task offloading from cellular vehicles (CVs) to a multi-access edge computing (MEC) server using vehicle-to-infrastructure (V2I) links. Considering that the latter links can be reused by vehicle-to-vehicle (V2V) communications to improve spectrum utilization, the receiver of the V2I link may suffer from severe interference that can…
▽ More
This paper investigates autonomous driving safety improvement via task offloading from cellular vehicles (CVs) to a multi-access edge computing (MEC) server using vehicle-to-infrastructure (V2I) links. Considering that the latter links can be reused by vehicle-to-vehicle (V2V) communications to improve spectrum utilization, the receiver of the V2I link may suffer from severe interference that can cause outages during the task offloading. To tackle this issue, we propose the deployment of a reconfigurable intelligent computational surface (RICS) whose computationally capable metamaterials are leveraged to jointly enable V2I reflective links as well as to implement interference cancellation at the V2V links. We devise a joint optimization formulation for the task offloading ratio between the CVs and the MEC server, the spectrum sharing strategy between V2V and V2I communications, as well as the RICS reflection and refraction matrices to maximize an autonomous driving safety task. Due to the non-convexity of the problem and the coupling among its free variables, we transform it into a more tractable equivalent form, which is then decomposed into three sub-problems solved via an alternate approximation method. Our simulation results showcase that the proposed RICS-assisted offloading framework significantly improves the safety of the considered autonomous driving network, yielding a nearly 34\% improvement in the safety coefficient of the CVs. In addition, it is demonstrated that the V2V data rate can be improved by around 60\% indicating that the RICS-induced adjustment of the signals can effectively mitigate interference at the V2V link.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis
Authors:
Yinlin Guo,
Yening Lv,
Jinqiao Dou,
Yan Zhang,
Yuehai Wang
Abstract:
While recent advances in Text-To-Speech synthesis have yielded remarkable improvements in generating high-quality speech, research on lightweight and fast models is limited. This paper introduces FLY-TTS, a new fast, lightweight and high-quality speech synthesis system based on VITS. Specifically, 1) We replace the decoder with ConvNeXt blocks that generate Fourier spectral coefficients followed b…
▽ More
While recent advances in Text-To-Speech synthesis have yielded remarkable improvements in generating high-quality speech, research on lightweight and fast models is limited. This paper introduces FLY-TTS, a new fast, lightweight and high-quality speech synthesis system based on VITS. Specifically, 1) We replace the decoder with ConvNeXt blocks that generate Fourier spectral coefficients followed by the inverse short-time Fourier transform to synthesize waveforms; 2) To compress the model size, we introduce the grouped parameter-sharing mechanism to the text encoder and flow-based model; 3) We further employ the large pre-trained WavLM model for adversarial training to improve synthesis quality. Experimental results show that our model achieves a real-time factor of 0.0139 on an Intel Core i9 CPU, 8.8x faster than the baseline (0.1221), with a 1.6x parameter compression. Objective and subjective evaluations indicate that FLY-TTS exhibits comparable speech quality to the strong baseline.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Cost-efficient Active Illumination Camera For Hyper-spectral Reconstruction
Authors:
Yuxuan Zhang,
T. M. Sazzad,
Yangyang Song,
Spencer J. Chang,
Ritesh Chowdhry,
Tomas Mejia,
Anna Hampton,
Shelby Kucharski,
Stefan Gerber,
Barry Tillman,
Marcio F. R. Resende,
William M. Hammond,
Chris H. Wilson,
Alina Zare,
Sanjeev J. Koppal
Abstract:
Hyper-spectral imaging has recently gained increasing attention for use in different applications, including agricultural investigation, ground tracking, remote sensing and many other. However, the high cost, large physical size and complicated operation process stop hyperspectral cameras from being employed for various applications and research fields. In this paper, we introduce a cost-efficient…
▽ More
Hyper-spectral imaging has recently gained increasing attention for use in different applications, including agricultural investigation, ground tracking, remote sensing and many other. However, the high cost, large physical size and complicated operation process stop hyperspectral cameras from being employed for various applications and research fields. In this paper, we introduce a cost-efficient, compact and easy to use active illumination camera that may benefit many applications. We developed a fully functional prototype of such camera. With the hope of helping with agricultural research, we tested our camera for plant root imaging. In addition, a U-Net model for spectral reconstruction was trained by using a reference hyperspectral camera's data as ground truth and our camera's data as input. We demonstrated our camera's ability to obtain additional information over a typical RGB camera. In addition, the ability to reconstruct hyperspectral data from multi-spectral input makes our device compatible to models and algorithms developed for hyperspectral applications with no modifications required.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
CMRxRecon2024: A Multi-Modality, Multi-View K-Space Dataset Boosting Universal Machine Learning for Accelerated Cardiac MRI
Authors:
Zi Wang,
Fanwen Wang,
Chen Qin,
Jun Lyu,
Ouyang Cheng,
Shuo Wang,
Yan Li,
Mengyao Yu,
Haoyu Zhang,
Kunyuan Guo,
Zhang Shi,
Qirong Li,
Ziqiang Xu,
Yajing Zhang,
Hao Li,
Sha Hua,
Binghua Chen,
Longyu Sun,
Mengting Sun,
Qin Li,
Ying-Hua Chu,
Wenjia Bai,
Jing Qin,
Xiahai Zhuang,
Claudia Prieto
, et al. (7 additional authors not shown)
Abstract:
Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover h…
▽ More
Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover high-quality, clinically interpretable images from undersampled measurements. However, the lack of publicly available cardiac MRI k-space dataset in terms of both quantity and diversity has severely hindered substantial technological progress, particularly for data-driven artificial intelligence. Here, we provide a standardized, diverse, and high-quality CMRxRecon2024 dataset to facilitate the technical development, fair evaluation, and clinical transfer of cardiac MRI reconstruction approaches, towards promoting the universal frameworks that enable fast and robust reconstructions across different cardiac MRI protocols in clinical practice. To the best of our knowledge, the CMRxRecon2024 dataset is the largest and most diverse publicly available cardiac k-space dataset. It is acquired from 330 healthy volunteers, covering commonly used modalities, anatomical views, and acquisition trajectories in clinical cardiac MRI workflows. Besides, an open platform with tutorials, benchmarks, and data processing tools is provided to facilitate data usage, advanced method development, and fair performance evaluation.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
USLC: Universal Self-Learning Control via Physical Performance Policy-Optimization Neural Network
Authors:
Yanhui Zhang,
Weifang Chen
Abstract:
This study addresses the challenge of achieving real-time Universal Self-Learning Control (USLC) in nonlinear dynamic systems with uncertain models. The proposed control method incorporates a Universal Self-Learning module, which introduces a model-free online executor-evaluator framework to enable controller adaptation in the presence of unknown disturbances. By leveraging a neural network model…
▽ More
This study addresses the challenge of achieving real-time Universal Self-Learning Control (USLC) in nonlinear dynamic systems with uncertain models. The proposed control method incorporates a Universal Self-Learning module, which introduces a model-free online executor-evaluator framework to enable controller adaptation in the presence of unknown disturbances. By leveraging a neural network model trained on historical system performance data, the controller can autonomously learn to approximate optimal performance during each learning cycle. Consequently, the controller's structural parameters are incrementally adjusted to achieve a performance threshold comparable to human-level performance. Utilizing nonlinear system stability theory, specifically in the context of three-dimensional manifold space, we demonstrate the stability of USLC in Lipschitz continuous systems. We illustrate the USLC framework numerically with two case studies: a low-order circuit system and a high-order morphing fixed-wing attitude control system. The simulation results verify the effectiveness and universality of the proposed method.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Sparse-view Signal-domain Photoacoustic Tomography Reconstruction Method Based on Neural Representation
Authors:
Bowei Yao,
Yi Zeng,
Haizhao Dai,
Qing Wu,
Youshen Xiao,
Fei Gao,
Yuyao Zhang,
Jingyi Yu,
Xiran Cai
Abstract:
Photoacoustic tomography is a hybrid biomedical technology, which combines the advantages of acoustic and optical imaging. However, for the conventional image reconstruction method, the image quality is affected obviously by artifacts under the condition of sparse sampling. in this paper, a novel model-based sparse reconstruction method via implicit neural representation was proposed for improving…
▽ More
Photoacoustic tomography is a hybrid biomedical technology, which combines the advantages of acoustic and optical imaging. However, for the conventional image reconstruction method, the image quality is affected obviously by artifacts under the condition of sparse sampling. in this paper, a novel model-based sparse reconstruction method via implicit neural representation was proposed for improving the image quality reconstructed from sparse data. Specially, the initial acoustic pressure distribution was modeled as a continuous function of spatial coordinates, and parameterized by a multi-layer perceptron. The weights of multi-layer perceptron were determined by training the network in self-supervised manner. And the total variation regularization term was used to offer the prior knowledge. We compared our result with some ablation studies, and the results show that out method outperforms existing methods on simulation and experimental data. Under the sparse sampling condition, our method can suppress the artifacts and avoid the ill-posed problem effectively, which reconstruct images with higher signal-to-noise ratio and contrast-to-noise ratio than traditional methods. The high-quality results for sparse data make the proposed method hold the potential for further decreasing the hardware cost of photoacoustic tomography system.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Multimodal Cross-Task Interaction for Survival Analysis in Whole Slide Pathological Images
Authors:
Songhan Jiang,
Zhengyu Gan,
Linghan Cai,
Yifeng Wang,
Yongbing Zhang
Abstract:
Survival prediction, utilizing pathological images and genomic profiles, is increasingly important in cancer analysis and prognosis. Despite significant progress, precise survival analysis still faces two main challenges: (1) The massive pixels contained in whole slide images (WSIs) complicate the process of pathological images, making it difficult to generate an effective representation of the tu…
▽ More
Survival prediction, utilizing pathological images and genomic profiles, is increasingly important in cancer analysis and prognosis. Despite significant progress, precise survival analysis still faces two main challenges: (1) The massive pixels contained in whole slide images (WSIs) complicate the process of pathological images, making it difficult to generate an effective representation of the tumor microenvironment (TME). (2) Existing multimodal methods often rely on alignment strategies to integrate complementary information, which may lead to information loss due to the inherent heterogeneity between pathology and genes. In this paper, we propose a Multimodal Cross-Task Interaction (MCTI) framework to explore the intrinsic correlations between subtype classification and survival analysis tasks. Specifically, to capture TME-related features in WSIs, we leverage the subtype classification task to mine tumor regions. Simultaneously, multi-head attention mechanisms are applied in genomic feature extraction, adaptively performing genes grouping to obtain task-related genomic embedding. With the joint representation of pathological images and genomic data, we further introduce a Transport-Guided Attention (TGA) module that uses optimal transport theory to model the correlation between subtype classification and survival analysis tasks, effectively transferring potential information. Extensive experiments demonstrate the superiority of our approaches, with MCTI outperforming state-of-the-art frameworks on three public benchmarks. \href{https://github.com/jsh0792/MCTI}{https://github.com/jsh0792/MCTI}.
△ Less
Submitted 24 June, 2024;
originally announced June 2024.
-
Towards Open Respiratory Acoustic Foundation Models: Pretraining and Benchmarking
Authors:
Yuwei Zhang,
Tong Xia,
Jing Han,
Yu Wu,
Georgios Rizos,
Yang Liu,
Mohammed Mosuily,
Jagmohan Chauhan,
Cecilia Mascolo
Abstract:
Respiratory audio, such as coughing and breathing sounds, has predictive power for a wide range of healthcare applications, yet is currently under-explored. The main problem for those applications arises from the difficulty in collecting large labeled task-specific data for model development. Generalizable respiratory acoustic foundation models pretrained with unlabeled data would offer appealing…
▽ More
Respiratory audio, such as coughing and breathing sounds, has predictive power for a wide range of healthcare applications, yet is currently under-explored. The main problem for those applications arises from the difficulty in collecting large labeled task-specific data for model development. Generalizable respiratory acoustic foundation models pretrained with unlabeled data would offer appealing advantages and possibly unlock this impasse. However, given the safety-critical nature of healthcare applications, it is pivotal to also ensure openness and replicability for any proposed foundation model solution. To this end, we introduce OPERA, an OPEn Respiratory Acoustic foundation model pretraining and benchmarking system, as the first approach answering this need. We curate large-scale respiratory audio datasets (~136K samples, 440 hours), pretrain three pioneering foundation models, and build a benchmark consisting of 19 downstream respiratory health tasks for evaluation. Our pretrained models demonstrate superior performance (against existing acoustic models pretrained with general audio on 16 out of 19 tasks) and generalizability (to unseen datasets and new respiratory audio modalities). This highlights the great promise of respiratory acoustic foundation models and encourages more studies using OPERA as an open resource to accelerate research on respiratory audio for health. The system is accessible from https://github.com/evelyn0414/OPERA.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Full-Space Wireless Sensing Enabled by Multi-Sector Intelligent Surfaces
Authors:
Yumeng Zhang,
Xiaodan Shao,
Hongyu Li,
Bruno Clerckx,
Rui Zhang
Abstract:
The multi-sector intelligent surface (IS), benefiting from a smarter wave manipulation capability, has been shown to enhance channel gain and offer full-space coverage in communications. However, the benefits of multi-sector IS in wireless sensing remain unexplored. This paper introduces the application of multi-sector IS for wireless sensing/localization. Specifically, we propose a new self-sensi…
▽ More
The multi-sector intelligent surface (IS), benefiting from a smarter wave manipulation capability, has been shown to enhance channel gain and offer full-space coverage in communications. However, the benefits of multi-sector IS in wireless sensing remain unexplored. This paper introduces the application of multi-sector IS for wireless sensing/localization. Specifically, we propose a new self-sensing system, where an active source controller uses the multi-sector IS geometry to reflect/scatter the emitted signals towards the entire space, thereby achieving full-space coverage for wireless sensing. Additionally, dedicated sensors are installed aligned with the IS elements at each sector, which collect echo signals from the target and cooperate to sense the target angle. In this context, we develop a maximum likelihood estimator of the target angle for the proposed multi-sector IS self-sensing system, along with the corresponding theoretical limits defined by the Cramér-Rao Bound. The analysis reveals that the advantages of the multi-sector IS self-sensing system stem from two aspects: enhancing the probing power on targets (thereby improving power efficiency) and increasing the rate of target angle (thereby enhancing the transceiver's sensitivity to target angles). Finally, our analysis and simulations confirm that the multi-sector IS self-sensing system, particularly the 4-sector architecture, achieves full-space sensing capability beyond the single-sector IS configuration. Furthermore, similarly to communications, employing directive antenna patterns on each sector's IS elements and sensors significantly enhances sensing capabilities. This enhancement originates from both aspects of improved power efficiency and target angle sensitivity, with the former also being observed in communications while the latter being unique in sensing.
△ Less
Submitted 25 June, 2024; v1 submitted 22 June, 2024;
originally announced June 2024.
-
Functional photoacoustic noninvasive Doppler angiography in humans
Authors:
Yang Zhang,
Joshua Olick-Gibson,
Karteekeya Sastry,
Lihong V. Wang
Abstract:
Optical imaging of blood flow yields critical functional insights into the circulatory system, but its clinical implementation has typically been limited to shallow depths (~1 millimeter) due to light scattering in biological tissue. Here, we present photoacoustic noninvasive Doppler angiography (PANDA) for deep blood flow imaging. PANDA synergizes the photoacoustic and Doppler effects to generate…
▽ More
Optical imaging of blood flow yields critical functional insights into the circulatory system, but its clinical implementation has typically been limited to shallow depths (~1 millimeter) due to light scattering in biological tissue. Here, we present photoacoustic noninvasive Doppler angiography (PANDA) for deep blood flow imaging. PANDA synergizes the photoacoustic and Doppler effects to generate color Doppler velocity and power Doppler blood flow maps of the vascular lumen. Our results demonstrate PANDA's ability to measure blood flow in vivo up to one centimeter in depth, marking approximately an order of magnitude improvement over existing high-resolution pure optical modalities. PANDA enhances photoacoustic flow imaging by increasing depth and enabling cross-sectional blood vessel imaging. We also showcase PANDA's clinical feasibility through three-dimensional imaging of blood flow in healthy subjects and a patient with varicose veins. By integrating the imaging system onto a mobile platform, we have designed PANDA to be a portable modality that is primed for expedient clinical translation. PANDA offers noninvasive, single modality imaging of hemoglobin and blood flow with three-dimensional capability, facilitating comprehensive assessment of deep vascular dynamics in humans.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
ECLIPSE: Expunging Clean-label Indiscriminate Poisons via Sparse Diffusion Purification
Authors:
Xianlong Wang,
Shengshan Hu,
Yechao Zhang,
Ziqi Zhou,
Leo Yu Zhang,
Peng Xu,
Wei Wan,
Hai Jin
Abstract:
Clean-label indiscriminate poisoning attacks add invisible perturbations to correctly labeled training images, thus dramatically reducing the generalization capability of the victim models. Recently, some defense mechanisms have been proposed such as adversarial training, image transformation techniques, and image purification. However, these schemes are either susceptible to adaptive attacks, bui…
▽ More
Clean-label indiscriminate poisoning attacks add invisible perturbations to correctly labeled training images, thus dramatically reducing the generalization capability of the victim models. Recently, some defense mechanisms have been proposed such as adversarial training, image transformation techniques, and image purification. However, these schemes are either susceptible to adaptive attacks, built on unrealistic assumptions, or only effective against specific poison types, limiting their universal applicability. In this research, we propose a more universally effective, practical, and robust defense scheme called ECLIPSE. We first investigate the impact of Gaussian noise on the poisons and theoretically prove that any kind of poison will be largely assimilated when imposing sufficient random noise. In light of this, we assume the victim has access to an extremely limited number of clean images (a more practical scene) and subsequently enlarge this sparse set for training a denoising probabilistic model (a universal denoising tool). We then begin by introducing Gaussian noise to absorb the poisons and then apply the model for denoising, resulting in a roughly purified dataset. Finally, to address the trade-off of the inconsistency in the assimilation sensitivity of different poisons by Gaussian noise, we propose a lightweight corruption compensation module to effectively eliminate residual poisons, providing a more universal defense approach. Extensive experiments demonstrate that our defense approach outperforms 10 state-of-the-art defenses. We also propose an adaptive attack against ECLIPSE and verify the robustness of our defense scheme. Our code is available at https://github.com/CGCL-codes/ECLIPSE.
△ Less
Submitted 24 June, 2024; v1 submitted 21 June, 2024;
originally announced June 2024.
-
Trustworthy Enhanced Multi-view Multi-modal Alzheimer's Disease Prediction with Brain-wide Imaging Transcriptomics Data
Authors:
Shan Cong,
Zhoujie Fan,
Hongwei Liu,
Yinghan Zhang,
Xin Wang,
Haoran Luo,
Xiaohui Yao
Abstract:
Brain transcriptomics provides insights into the molecular mechanisms by which the brain coordinates its functions and processes. However, existing multimodal methods for predicting Alzheimer's disease (AD) primarily rely on imaging and sometimes genetic data, often neglecting the transcriptomic basis of brain. Furthermore, while striving to integrate complementary information between modalities,…
▽ More
Brain transcriptomics provides insights into the molecular mechanisms by which the brain coordinates its functions and processes. However, existing multimodal methods for predicting Alzheimer's disease (AD) primarily rely on imaging and sometimes genetic data, often neglecting the transcriptomic basis of brain. Furthermore, while striving to integrate complementary information between modalities, most studies overlook the informativeness disparities between modalities. Here, we propose TMM, a trusted multiview multimodal graph attention framework for AD diagnosis, using extensive brain-wide transcriptomics and imaging data. First, we construct view-specific brain regional co-function networks (RRIs) from transcriptomics and multimodal radiomics data to incorporate interaction information from both biomolecular and imaging perspectives. Next, we apply graph attention (GAT) processing to each RRI network to produce graph embeddings and employ cross-modal attention to fuse transcriptomics-derived embedding with each imagingderived embedding. Finally, a novel true-false-harmonized class probability (TFCP) strategy is designed to assess and adaptively adjust the prediction confidence of each modality for AD diagnosis. We evaluate TMM using the AHBA database with brain-wide transcriptomics data and the ADNI database with three imaging modalities (AV45-PET, FDG-PET, and VBM-MRI). The results demonstrate the superiority of our method in identifying AD, EMCI, and LMCI compared to state-of-the-arts. Code and data are available at https://github.com/Yaolab-fantastic/TMM.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Zero-Shot Image Denoising for High-Resolution Electron Microscopy
Authors:
Xuanyu Tian,
Zhuoya Dong,
Xiyue Lin,
Yue Gao,
Hongjiang Wei,
Yanhang Ma,
Jingyi Yu,
Yuyao Zhang
Abstract:
High-resolution electron microscopy (HREM) imaging technique is a powerful tool for directly visualizing a broad range of materials in real-space. However, it faces challenges in denoising due to ultra-low signal-to-noise ratio (SNR) and scarce data availability. In this work, we propose Noise2SR, a zero-shot self-supervised learning (ZS-SSL) denoising framework for HREM. Within our framework, we…
▽ More
High-resolution electron microscopy (HREM) imaging technique is a powerful tool for directly visualizing a broad range of materials in real-space. However, it faces challenges in denoising due to ultra-low signal-to-noise ratio (SNR) and scarce data availability. In this work, we propose Noise2SR, a zero-shot self-supervised learning (ZS-SSL) denoising framework for HREM. Within our framework, we propose a super-resolution (SR) based self-supervised training strategy, incorporating the Random Sub-sampler module. The Random Sub-sampler is designed to generate approximate infinite noisy pairs from a single noisy image, serving as an effective data augmentation in zero-shot denoising. Noise2SR trains the network with paired noisy images of different resolutions, which is conducted via SR strategy. The SR-based training facilitates the network adopting more pixels for supervision, and the random sub-sampling helps compel the network to learn continuous signals enhancing the robustness. Meanwhile, we mitigate the uncertainty caused by random-sampling by adopting minimum mean squared error (MMSE) estimation for the denoised results. With the distinctive integration of training strategy and proposed designs, Noise2SR can achieve superior denoising performance using a single noisy HREM image. We evaluate the performance of Noise2SR in both simulated and real HREM denoising tasks. It outperforms state-of-the-art ZS-SSL methods and achieves comparable denoising performance with supervised methods. The success of Noise2SR suggests its potential for improving the SNR of images in material imaging domains.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection
Authors:
Kyungbok Lee,
You Zhang,
Zhiyao Duan
Abstract:
This paper addresses the challenge of developing a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for t…
▽ More
This paper addresses the challenge of developing a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for the model to interpret which cues from the video indicate it is fake. Motivated by these considerations, we then propose a multi-stream fusion approach with one-class learning as a representation-level regularization technique. We study the generalization problem of audio-visual deepfake detection by creating a new benchmark by extending and re-splitting the existing FakeAVCeleb dataset. The benchmark contains four categories of fake video(Real Audio-Fake Visual, Fake Audio-Fake Visual, Fake Audio-Real Visual, and unsynchronized video). The experimental results show that our approach improves the model's detection of unseen attacks by an average of 7.31% across four test sets, compared to the baseline model. Additionally, our proposed framework offers interpretability, indicating which modality the model identifies as fake.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Knowledge-driven Subspace Fusion and Gradient Coordination for Multi-modal Learning
Authors:
Yupei Zhang,
Xiaofei Wang,
Fangliangzi Meng,
Jin Tang,
Chao Li
Abstract:
Multi-modal learning plays a crucial role in cancer diagnosis and prognosis. Current deep learning based multi-modal approaches are often limited by their abilities to model the complex correlations between genomics and histology data, addressing the intrinsic complexity of tumour ecosystem where both tumour and microenvironment contribute to malignancy. We propose a biologically interpretative an…
▽ More
Multi-modal learning plays a crucial role in cancer diagnosis and prognosis. Current deep learning based multi-modal approaches are often limited by their abilities to model the complex correlations between genomics and histology data, addressing the intrinsic complexity of tumour ecosystem where both tumour and microenvironment contribute to malignancy. We propose a biologically interpretative and robust multi-modal learning framework to efficiently integrate histology images and genomics by decomposing the feature subspace of histology images and genomics, reflecting distinct tumour and microenvironment features. To enhance cross-modal interactions, we design a knowledge-driven subspace fusion scheme, consisting of a cross-modal deformable attention module and a gene-guided consistency strategy. Additionally, in pursuit of dynamically optimizing the subspace knowledge, we further propose a novel gradient coordination learning strategy. Extensive experiments demonstrate the effectiveness of the proposed method, outperforming state-of-the-art techniques in three downstream tasks of glioma diagnosis, tumour grading, and survival analysis. Our code is available at https://github.com/helenypzhang/Subspace-Multimodal-Learning.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Recurrent Inference Machine for Medical Image Registration
Authors:
Yi Zhang,
Yidong Zhao,
Hui Xue,
Peter Kellman,
Stefan Klein,
Qian Tao
Abstract:
Image registration is essential for medical image applications where alignment of voxels across multiple images is needed for qualitative or quantitative analysis. With recent advancements in deep neural networks and parallel computing, deep learning-based medical image registration methods become competitive with their flexible modelling and fast inference capabilities. However, compared to tradi…
▽ More
Image registration is essential for medical image applications where alignment of voxels across multiple images is needed for qualitative or quantitative analysis. With recent advancements in deep neural networks and parallel computing, deep learning-based medical image registration methods become competitive with their flexible modelling and fast inference capabilities. However, compared to traditional optimization-based registration methods, the speed advantage may come at the cost of registration performance at inference time. Besides, deep neural networks ideally demand large training datasets while optimization-based methods are training-free. To improve registration accuracy and data efficiency, we propose a novel image registration method, termed Recurrent Inference Image Registration (RIIR) network. RIIR is formulated as a meta-learning solver to the registration problem in an iterative manner. RIIR addresses the accuracy and data efficiency issues, by learning the update rule of optimization, with implicit regularization combined with explicit gradient input.
We evaluated RIIR extensively on brain MRI and quantitative cardiac MRI datasets, in terms of both registration accuracy and training data efficiency. Our experiments showed that RIIR outperformed a range of deep learning-based methods, even with only $5\%$ of the training data, demonstrating high data efficiency. Key findings from our ablation studies highlighted the important added value of the hidden states introduced in the recurrent inference framework for meta-learning. Our proposed RIIR offers a highly data-efficient framework for deep learning-based medical image registration.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
AI-Empowered Multiple Access for 6G: A Survey of Spectrum Sensing, Protocol Designs, and Optimizations
Authors:
Xuelin Cao,
Bo Yang,
Kaining Wang,
Xinghua Li,
Zhiwen Yu,
Chau Yuen,
Yan Zhang,
Zhu Han
Abstract:
With the rapidly increasing number of bandwidth-intensive terminals capable of intelligent computing and communication, such as smart devices equipped with shallow neural network models, the complexity of multiple access for these intelligent terminals is increasing due to the dynamic network environment and ubiquitous connectivity in 6G systems. Traditional multiple access (MA) design and optimiz…
▽ More
With the rapidly increasing number of bandwidth-intensive terminals capable of intelligent computing and communication, such as smart devices equipped with shallow neural network models, the complexity of multiple access for these intelligent terminals is increasing due to the dynamic network environment and ubiquitous connectivity in 6G systems. Traditional multiple access (MA) design and optimization methods are gradually losing ground to artificial intelligence (AI) techniques that have proven their superiority in handling complexity. AI-empowered MA and its optimization strategies aimed at achieving high Quality-of-Service (QoS) are attracting more attention, especially in the area of latency-sensitive applications in 6G systems. In this work, we aim to: 1) present the development and comparative evaluation of AI-enabled MA; 2) provide a timely survey focusing on spectrum sensing, protocol design, and optimization for AI-empowered MA; and 3) explore the potential use cases of AI-empowered MA in the typical application scenarios within 6G systems. Specifically, we first present a unified framework of AI-empowered MA for 6G systems by incorporating various promising machine learning techniques in spectrum sensing, resource allocation, MA protocol design, and optimization. We then introduce AI-empowered MA spectrum sensing related to spectrum sharing and spectrum interference management. Next, we discuss the AI-empowered MA protocol designs and implementation methods by reviewing and comparing the state-of-the-art, and we further explore the optimization algorithms related to dynamic resource management, parameter adjustment, and access scheme switching. Finally, we discuss the current challenges, point out open issues, and outline potential future research directions in this field.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
An Empirical Study on the Fairness of Foundation Models for Multi-Organ Image Segmentation
Authors:
Qin Li,
Yizhe Zhang,
Yan Li,
Jun Lyu,
Meng Liu,
Longyu Sun,
Mengting Sun,
Qirong Li,
Wenyue Mao,
Xinran Wu,
Yajing Zhang,
Yinghua Chu,
Shuo Wang,
Chengyan Wang
Abstract:
The segmentation foundation model, e.g., Segment Anything Model (SAM), has attracted increasing interest in the medical image community. Early pioneering studies primarily concentrated on assessing and improving SAM's performance from the perspectives of overall accuracy and efficiency, yet little attention was given to the fairness considerations. This oversight raises questions about the potenti…
▽ More
The segmentation foundation model, e.g., Segment Anything Model (SAM), has attracted increasing interest in the medical image community. Early pioneering studies primarily concentrated on assessing and improving SAM's performance from the perspectives of overall accuracy and efficiency, yet little attention was given to the fairness considerations. This oversight raises questions about the potential for performance biases that could mirror those found in task-specific deep learning models like nnU-Net. In this paper, we explored the fairness dilemma concerning large segmentation foundation models. We prospectively curate a benchmark dataset of 3D MRI and CT scans of the organs including liver, kidney, spleen, lung and aorta from a total of 1056 healthy subjects with expert segmentations. Crucially, we document demographic details such as gender, age, and body mass index (BMI) for each subject to facilitate a nuanced fairness analysis. We test state-of-the-art foundation models for medical image segmentation, including the original SAM, medical SAM and SAT models, to evaluate segmentation efficacy across different demographic groups and identify disparities. Our comprehensive analysis, which accounts for various confounding factors, reveals significant fairness concerns within these foundational models. Moreover, our findings highlight not only disparities in overall segmentation metrics, such as the Dice Similarity Coefficient but also significant variations in the spatial distribution of segmentation errors, offering empirical evidence of the nuanced challenges in ensuring fairness in medical image segmentation.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Deep-learning-based groupwise registration for motion correction of cardiac $T_1$ mapping
Authors:
Yi Zhang,
Yidong Zhao,
Lu Huang,
Liming Xia,
Qian Tao
Abstract:
Quantitative $T_1$ mapping by MRI is an increasingly important tool for clinical assessment of cardiovascular diseases. The cardiac $T_1$ map is derived by fitting a known signal model to a series of baseline images, while the quality of this map can be deteriorated by involuntary respiratory and cardiac motion. To correct motion, a template image is often needed to register all baseline images, b…
▽ More
Quantitative $T_1$ mapping by MRI is an increasingly important tool for clinical assessment of cardiovascular diseases. The cardiac $T_1$ map is derived by fitting a known signal model to a series of baseline images, while the quality of this map can be deteriorated by involuntary respiratory and cardiac motion. To correct motion, a template image is often needed to register all baseline images, but the choice of template is nontrivial, leading to inconsistent performance sensitive to image contrast. In this work, we propose a novel deep-learning-based groupwise registration framework, which omits the need for a template, and registers all baseline images simultaneously. We design two groupwise losses for this registration framework: the first is a linear principal component analysis (PCA) loss that enforces alignment of baseline images irrespective of the intensity variation, and the second is an auxiliary relaxometry loss that enforces adherence of intensity profile to the signal model. We extensively evaluated our method, termed ``PCA-Relax'', and other baseline methods on an in-house cardiac MRI dataset including both pre- and post-contrast $T_1$ sequences. All methods were evaluated under three distinct training-and-evaluation strategies, namely, standard, one-shot, and test-time-adaptation. The proposed PCA-Relax showed further improved performance of registration and mapping over well-established baselines. The proposed groupwise framework is generic and can be adapted to applications involving multiple images.
△ Less
Submitted 21 June, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
Multi-Active-IRS-Assisted Cooperative Sensing: Cramér-Rao Bound and Joint Beamforming Design
Authors:
Yuan Fang,
Xianghao Yu,
Jie Xu,
Ying-Jun Angela Zhang
Abstract:
This paper studies the multi-intelligent reflecting surface (IRS)-assisted cooperative sensing, in which multiple active IRSs are deployed in a distributed manner to facilitate multi-view target sensing at the non-line-of-sight (NLoS) area of the base station (BS). Different from prior works employing passive IRSs, we leverage active IRSs with the capability of amplifying the reflected signals to…
▽ More
This paper studies the multi-intelligent reflecting surface (IRS)-assisted cooperative sensing, in which multiple active IRSs are deployed in a distributed manner to facilitate multi-view target sensing at the non-line-of-sight (NLoS) area of the base station (BS). Different from prior works employing passive IRSs, we leverage active IRSs with the capability of amplifying the reflected signals to overcome the severe multi-hop-reflection path loss in NLoS sensing. In particular, we consider two sensing setups without and with dedicated sensors equipped at active IRSs. In the first case without dedicated sensors at IRSs, we investigate the cooperative sensing at the BS, where the target's direction-of-arrival (DoA) with respect to each IRS is estimated based on the echo signals received at the BS. In the other case with dedicated sensors at IRSs, we consider that each IRS is able to receive echo signals and estimate the target's DoA with respect to itself. For both sensing setups, we first derive the closed-form Cramér-Rao bound (CRB) for estimating target DoA. Then, the (maximum) CRB is minimized by jointly optimizing the transmit beamforming at the BS and the reflective beamforming at the multiple IRSs, subject to the constraints on the maximum transmit power at the BS, as well as the maximum amplification power and the maximum power amplification gain constraints at individual active IRSs. To tackle the resulting highly non-convex (max-)CRB minimization problems, we propose two efficient algorithms to obtain high-quality solutions for the two cases with sensing at the BS and at the IRSs, respectively, based on alternating optimization, successive convex approximation, and semi-definite relaxation.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Unlocking the Potential of Early Epochs: Uncertainty-aware CT Metal Artifact Reduction
Authors:
Xinquan Yang,
Guanqun Zhou,
Wei Sun,
Youjian Zhang,
Zhongya Wang,
Jiahui He,
Zhicheng Zhang
Abstract:
In computed tomography (CT), the presence of metallic implants in patients often leads to disruptive artifacts in the reconstructed images, hindering accurate diagnosis. Recently, a large amount of supervised deep learning-based approaches have been proposed for metal artifact reduction (MAR). However, these methods neglect the influence of initial training weights. In this paper, we have discover…
▽ More
In computed tomography (CT), the presence of metallic implants in patients often leads to disruptive artifacts in the reconstructed images, hindering accurate diagnosis. Recently, a large amount of supervised deep learning-based approaches have been proposed for metal artifact reduction (MAR). However, these methods neglect the influence of initial training weights. In this paper, we have discovered that the uncertainty image computed from the restoration result of initial training weights can effectively highlight high-frequency regions, including metal artifacts. This observation can be leveraged to assist the MAR network in removing metal artifacts. Therefore, we propose an uncertainty constraint (UC) loss that utilizes the uncertainty image as an adaptive weight to guide the MAR network to focus on the metal artifact region, leading to improved restoration. The proposed UC loss is designed to be a plug-and-play method, compatible with any MAR framework, and easily adoptable. To validate the effectiveness of the UC loss, we conduct extensive experiments on the public available Deeplesion and CLINIC-metal dataset. Experimental results demonstrate that the UC loss further optimizes the network training process and significantly improves the removal of metal artifacts.
△ Less
Submitted 20 June, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Approximate Angular Domain Expression for Near-Field XL-MIMO Channel
Authors:
Hongbo Xing,
Yuxiang Zhang,
Jianhua Zhang,
Huixin Xu,
Guangyi Liu,
Qixing Wang
Abstract:
As Extremely Large-Scale Multiple-Input-Multiple-Output (XL-MIMO) technology advances and frequency band rises, the near-field effects in communication are intensifying. A concise and accurate near-field XL-MIMO channel model serves as the cornerstone for investigating the near-field effects. However, existing angular domain XL-MIMO channel models under near-field conditions require non-closed-for…
▽ More
As Extremely Large-Scale Multiple-Input-Multiple-Output (XL-MIMO) technology advances and frequency band rises, the near-field effects in communication are intensifying. A concise and accurate near-field XL-MIMO channel model serves as the cornerstone for investigating the near-field effects. However, existing angular domain XL-MIMO channel models under near-field conditions require non-closed-form wave-number domain integrals for computation, which is complicated. To obtain a more succinct channel model, this paper introduces a closed-form approximate expression based on the principle of stationary phase. It was subsequently shown that when the scatterer distance is larger than the array aperture, the closed-form model can be further simplified as a trapezoidal spectrum. We validate the accuracy of the proposed approximation through simulations of power angular spectrum similarity. The results indicate that the proposed approximation can accurately approximate the near-field angular domain channel within the effective Rayleigh distance.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Risk-Aware Value-Oriented Net Demand Forecasting for Virtual Power Plants
Authors:
Yufan Zhang,
Jiajun Han,
Yuanyuan Shi
Abstract:
This paper develops a risk-aware net demand forecasting product for virtual power plants, which helps reduce the risk of high operation costs. At the training phase, a bilevel program for parameter estimation is formulated, where the upper level optimizes over the forecast model parameter to minimize the conditional value-at-risk (a risk metric) of operation costs. The lower level solves the opera…
▽ More
This paper develops a risk-aware net demand forecasting product for virtual power plants, which helps reduce the risk of high operation costs. At the training phase, a bilevel program for parameter estimation is formulated, where the upper level optimizes over the forecast model parameter to minimize the conditional value-at-risk (a risk metric) of operation costs. The lower level solves the operation problems given the forecast. Leveraging the specific structure of the operation problem, we show that the bilevel program is equivalent to a convex program when the forecast model is linear. Numerical results show that our approach effectively reduces the risk of high costs compared to the forecasting approach developed for risk-neutral decision makers.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
On Efficient Neural Network Architectures for Image Compression
Authors:
Yichi Zhang,
Zhihao Duan,
Fengqing Zhu
Abstract:
Recent advances in learning-based image compression typically come at the cost of high complexity. Designing computationally efficient architectures remains an open challenge. In this paper, we empirically investigate the impact of different network designs in terms of rate-distortion performance and computational complexity. Our experiments involve testing various transforms, including convolutio…
▽ More
Recent advances in learning-based image compression typically come at the cost of high complexity. Designing computationally efficient architectures remains an open challenge. In this paper, we empirically investigate the impact of different network designs in terms of rate-distortion performance and computational complexity. Our experiments involve testing various transforms, including convolutional neural networks and transformers, as well as various context models, including hierarchical, channel-wise, and space-channel context models. Based on the results, we present a series of efficient models, the final model of which has comparable performance to recent best-performing methods but with significantly lower complexity. Extensive experiments provide insights into the design of architectures for learned image compression and potential direction for future research. The code is available at \url{https://gitlab.com/viper-purdue/efficient-compression}.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Improving child speech recognition with augmented child-like speech
Authors:
Yuanyuan Zhang,
Zhengjun Yue,
Tanvina Patel,
Odette Scharenborg
Abstract:
State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingua…
▽ More
State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingual child-to-child VC significantly improved child ASR performance. Experiments on the impact of the quantity of child-to-child cross-lingual VC-generated data on fine-tuning (FT) ASR models gave the best results with two-fold augmentation for our FT-Conformer model and FT-Whisper model which reduced WERs with ~3% absolute compared to the baseline, and with six-fold augmentation for the model trained from scratch, which improved by an absolute 3.6% WER. Moreover, using a small amount of "high-quality" VC-generated data achieved similar results to those of our best-FT models.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Suppressing seizure via optimal electrical stimulation to the hub of epileptic brain network
Authors:
Zhichao Liang,
Guanyi Zhao,
Yinuo Zhang,
Weiting Sun,
Jingzhe Lin,
Jialin Wang,
Quanying Liu
Abstract:
The electrical stimulation to the seizure onset zone (SOZ) serves as an efficient approach to seizure suppression. Recently, seizure dynamics have gained widespread attendance in its network propagation mechanisms. Compared with the direct stimulation to SOZ, other brain network-level approaches that can effectively suppress epileptic seizures remain under-explored. In this study, we introduce a p…
▽ More
The electrical stimulation to the seizure onset zone (SOZ) serves as an efficient approach to seizure suppression. Recently, seizure dynamics have gained widespread attendance in its network propagation mechanisms. Compared with the direct stimulation to SOZ, other brain network-level approaches that can effectively suppress epileptic seizures remain under-explored. In this study, we introduce a platform equipped with a system identification module and a control strategy module, to validate the effectiveness of the hub of the epileptic brain network in suppressing seizure. The identified surrogate dynamics show high predictive performance in reconstructing neural dynamics which enables the model predictive framework to achieve accurate neural stimulation. The electrical stimulation on the hub of the epileptic brain network shows remarkable performance as the direct stimulation of SOZ in suppressing seizure dynamics. Underpinned by network control theory, our platform offers a general tool for the validation of neural stimulation.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Vision Transformer Segmentation for Visual Bird Sound Denoising
Authors:
Sahil Kumar,
Jialu Li,
Youshan Zhang
Abstract:
Audio denoising, especially in the context of bird sounds, remains a challenging task due to persistent residual noise. Traditional and deep learning methods often struggle with artificial or low-frequency noise. In this work, we propose ViTVS, a novel approach that leverages the power of the vision transformer (ViT) architecture. ViTVS adeptly combines segmentation techniques to disentangle clean…
▽ More
Audio denoising, especially in the context of bird sounds, remains a challenging task due to persistent residual noise. Traditional and deep learning methods often struggle with artificial or low-frequency noise. In this work, we propose ViTVS, a novel approach that leverages the power of the vision transformer (ViT) architecture. ViTVS adeptly combines segmentation techniques to disentangle clean audio from complex signal mixtures. Our key contributions encompass the development of ViTVS, introducing comprehensive, long-range, and multi-scale representations. These contributions directly tackle the limitations inherent in conventional approaches. Extensive experiments demonstrate that ViTVS outperforms state-of-the-art methods, positioning it as a benchmark solution for real-world bird sound denoising applications. Source code is available at: https://github.com/aiai-4/ViVTS.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Complex Image-Generative Diffusion Transformer for Audio Denoising
Authors:
Junhui Li,
Pu Wang,
Jialu Li,
Youshan Zhang
Abstract:
The audio denoising technique has captured widespread attention in the deep neural network field. Recently, the audio denoising problem has been converted into an image generation task, and deep learning-based approaches have been applied to tackle this problem. However, its performance is still limited, leaving room for further improvement. In order to enhance audio denoising performance, this pa…
▽ More
The audio denoising technique has captured widespread attention in the deep neural network field. Recently, the audio denoising problem has been converted into an image generation task, and deep learning-based approaches have been applied to tackle this problem. However, its performance is still limited, leaving room for further improvement. In order to enhance audio denoising performance, this paper introduces a complex image-generative diffusion transformer that captures more information from the complex Fourier domain. We explore a novel diffusion transformer by integrating the transformer with a diffusion model. Our proposed model demonstrates the scalability of the transformer and expands the receptive field of sparse attention using attention diffusion. Our work is among the first to utilize diffusion transformers to deal with the image generation task for audio denoising. Extensive experiments on two benchmark datasets demonstrate that our proposed model outperforms state-of-the-art methods.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Diffusion Gaussian Mixture Audio Denoise
Authors:
Pu Wang,
Junhui Li,
Jialu Li,
Liangdong Guo,
Youshan Zhang
Abstract:
Recent diffusion models have achieved promising performances in audio-denoising tasks. The unique property of the reverse process could recover clean signals. However, the distribution of real-world noises does not comply with a single Gaussian distribution and is even unknown. The sampling of Gaussian noise conditions limits its application scenarios. To overcome these challenges, we propose a Di…
▽ More
Recent diffusion models have achieved promising performances in audio-denoising tasks. The unique property of the reverse process could recover clean signals. However, the distribution of real-world noises does not comply with a single Gaussian distribution and is even unknown. The sampling of Gaussian noise conditions limits its application scenarios. To overcome these challenges, we propose a DiffGMM model, a denoising model based on the diffusion and Gaussian mixture models. We employ the reverse process to estimate parameters for the Gaussian mixture model. Given a noisy audio signal, we first apply a 1D-U-Net to extract features and train linear layers to estimate parameters for the Gaussian mixture model, and we approximate the real noise distributions. The noisy signal is continuously subtracted from the estimated noise to output clean audio signals. Extensive experimental results demonstrate that the proposed DiffGMM model achieves state-of-the-art performance.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer
Authors:
Guodong Sun,
Junjie Liu,
Mingxuan Liu,
Moyun Liu,
Yang Zhang
Abstract:
Self-supervised monocular depth estimation aims to infer depth information without relying on labeled data. However, the lack of labeled information poses a significant challenge to the model's representation, limiting its ability to capture the intricate details of the scene accurately. Prior information can potentially mitigate this issue, enhancing the model's understanding of scene structure a…
▽ More
Self-supervised monocular depth estimation aims to infer depth information without relying on labeled data. However, the lack of labeled information poses a significant challenge to the model's representation, limiting its ability to capture the intricate details of the scene accurately. Prior information can potentially mitigate this issue, enhancing the model's understanding of scene structure and texture. Nevertheless, solely relying on a single type of prior information often falls short when dealing with complex scenes, necessitating improvements in generalization performance. To address these challenges, we introduce a novel self-supervised monocular depth estimation model that leverages multiple priors to bolster representation capabilities across spatial, context, and semantic dimensions. Specifically, we employ a hybrid transformer and a lightweight pose network to obtain long-range spatial priors in the spatial dimension. Then, the context prior attention is designed to improve generalization, particularly in complex structures or untextured areas. In addition, semantic priors are introduced by leveraging semantic boundary loss, and semantic prior attention is supplemented, further refining the semantic features extracted by the decoder. Experiments on three diverse datasets demonstrate the effectiveness of the proposed model. It integrates multiple priors to comprehensively enhance the representation ability, improving the accuracy and reliability of depth estimation. Codes are available at: \url{https://github.com/MVME-HBUT/MPRLNet}
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Adaptive Cooperative Streaming of Holographic Video Over Wireless Networks: A Proximal Policy Optimization Solution
Authors:
Wanli Wen,
Jiping Yan,
Yulu Zhang,
Zhen Huang,
Liang Liang,
Yunjian Jia
Abstract:
Adapting holographic video streaming to fluctuating wireless channels is essential to maintain consistent and satisfactory Quality of Experience (QoE) for users, which, however, is a challenging task due to the dynamic and uncertain characteristics of wireless networks. To address this issue, we propose a holographic video cooperative streaming framework designed for a generic wireless network in…
▽ More
Adapting holographic video streaming to fluctuating wireless channels is essential to maintain consistent and satisfactory Quality of Experience (QoE) for users, which, however, is a challenging task due to the dynamic and uncertain characteristics of wireless networks. To address this issue, we propose a holographic video cooperative streaming framework designed for a generic wireless network in which multiple access points can cooperatively transmit video with different bitrates to multiple users. Additionally, we model a novel QoE metric tailored specifically for holographic video streaming, which can effectively encapsulate the nuances of holographic video quality, quality fluctuations, and rebuffering occurrences simultaneously. Furthermore, we formulate a formidable QoE maximization problem, which is a non-convex mixed integer nonlinear programming problem. Using proximal policy optimization (PPO), a new class of reinforcement learning algorithms, we devise a joint beamforming and bitrate control scheme, which can be wisely adapted to fluctuations in the wireless channel. The numerical results demonstrate the superiority of the proposed scheme over representative baselines.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Towards Unsupervised Speech Recognition Without Pronunciation Models
Authors:
Junrui Ni,
Liming Wang,
Yang Zhang,
Kaizhi Qian,
Heting Gao,
Mark Hasegawa-Johnson,
Chang D. Yoo
Abstract:
Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by pro…
▽ More
Recent advancements in supervised automatic speech recognition (ASR) have achieved remarkable performance, largely due to the growing availability of large transcribed speech corpora. However, most languages lack sufficient paired speech and text data to effectively train these systems. In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. We explore a new research direction: word-level unsupervised ASR. Using a curated speech corpus containing only high-frequency English words, our system achieves a word error rate of nearly 20% without parallel transcripts or oracle word boundaries. Furthermore, we experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling. This innovative model surpasses the performance of previous unsupervised ASR models trained with direct distribution matching.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
CLDTA: Contrastive Learning based on Diagonal Transformer Autoencoder for Cross-Dataset EEG Emotion Recognition
Authors:
Yuan Liao,
Yuhong Zhang,
Shenghuan Wang,
Xiruo Zhang,
Yiling Zhang,
Wei Chen,
Yuzhe Gu,
Liya Huang
Abstract:
Recent advances in non-invasive EEG technology have broadened its application in emotion recognition, yielding a multitude of related datasets. Yet, deep learning models struggle to generalize across these datasets due to variations in acquisition equipment and emotional stimulus materials. To address the pressing need for a universal model that fluidly accommodates diverse EEG dataset formats and…
▽ More
Recent advances in non-invasive EEG technology have broadened its application in emotion recognition, yielding a multitude of related datasets. Yet, deep learning models struggle to generalize across these datasets due to variations in acquisition equipment and emotional stimulus materials. To address the pressing need for a universal model that fluidly accommodates diverse EEG dataset formats and bridges the gap between laboratory and real-world data, we introduce a novel deep learning framework: the Contrastive Learning based Diagonal Transformer Autoencoder (CLDTA), tailored for EEG-based emotion recognition. The CLDTA employs a diagonal masking strategy within its encoder to extracts full-channel EEG data's brain network knowledge, facilitating transferability to the datasets with fewer channels. And an information separation mechanism improves model interpretability by enabling straightforward visualization of brain networks. The CLDTA framework employs contrastive learning to distill subject-independent emotional representations and uses a calibration prediction process to enable rapid adaptation of the model to new subjects with minimal samples, achieving accurate emotion recognition. Our analysis across the SEED, SEED-IV, SEED-V, and DEAP datasets highlights CLDTA's consistent performance and proficiency in detecting both task-specific and general features of EEG signals related to emotions, underscoring its potential to revolutionize emotion recognition research.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Development of Focused X-ray Luminescence Compute Tomography Imaging
Authors:
Yile Fang,
Yibing Zhang,
Changqing Li
Abstract:
X-ray luminescence is produced when contrast agents absorb energy from X-ray photons and release a portion of that energy by emitting photons in the visible and near-infrared range. X-ray luminescence computed tomography (XLCT) was introduced in the past decade as a hybrid molecular imaging modality combining the merits of both X-ray imaging (high spatial resolution) and optical imaging (high sens…
▽ More
X-ray luminescence is produced when contrast agents absorb energy from X-ray photons and release a portion of that energy by emitting photons in the visible and near-infrared range. X-ray luminescence computed tomography (XLCT) was introduced in the past decade as a hybrid molecular imaging modality combining the merits of both X-ray imaging (high spatial resolution) and optical imaging (high sensitivity to tracer nanophosphors).
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Smart Wireless Environment Enhanced Telecommunications: A Network Stabilisation Paradigm for Mobile Operators
Authors:
Yangyishi Zhang,
Khethiwe Mhlope,
Aaron Walker,
Fraser Burton
Abstract:
Due to the uncontrolled and complex real-life radio propagation environments, Claude Shannon's information theory of communications describes fundamental limits to state-of-the-art 5G radio access network (RAN) capacity, with respect to fixed radio resource usage. Fortunately, recent research has found that a holographic metasurface-based new physical layer architecture may hold the key to overcom…
▽ More
Due to the uncontrolled and complex real-life radio propagation environments, Claude Shannon's information theory of communications describes fundamental limits to state-of-the-art 5G radio access network (RAN) capacity, with respect to fixed radio resource usage. Fortunately, recent research has found that a holographic metasurface-based new physical layer architecture may hold the key to overcome these fundamental limits of current mobile networks under a new paradigm, smart wireless environment (SWE), where the long-standing challenge of mobile communications, fading channel hostility, may be solved, leading to a step-change boost in network performance and user experience. Despite recent research activities in SWE, the best way to implement it as a network operator remains an open challenge. In this industrial review, we adopt a novel yet realistic mobile channel stabilisation perspective for network operators to understand this paradigm shift. More specifically, we provide a technical analysis of the synergy between key next-gen mobile network enablers, e.g., holographic metasurface, wireless sensing, and machine intelligence, as well as of how this synergy leads to a robust future RAN architecture. Against the as yet unclear theoretical boundaries and low technology readiness level (TRL) of SWE enhanced telecommunications, we conclude by identifying critical challenges in future commercial deployments.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.