Skip to main content

Showing 1–50 of 1,149 results for author: Zhang, J

  1. arXiv:2407.13545  [pdf, other

    eess.IV cs.CV

    DiffuX2CT: Diffusion Learning to Reconstruct CT Images from Biplanar X-Rays

    Authors: Xuhui Liu, Zhi Qiao, Runkun Liu, Hong Li, Juan Zhang, Xiantong Zhen, Zhen Qian, Baochang Zhang

    Abstract: Computed tomography (CT) is widely utilized in clinical settings because it delivers detailed 3D images of the human body. However, performing CT scans is not always feasible due to radiation exposure and limitations in certain surgical environments. As an alternative, reconstructing CT images from ultra-sparse X-rays offers a valuable solution and has gained significant interest in scientific res… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  2. arXiv:2407.13198  [pdf, other

    cs.SD eess.AS

    DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation

    Authors: Baihan Li, Zeyu Xie, Xuenan Xu, Yiwei Guo, Ming Yan, Ji Zhang, Kai Yu, Mengyue Wu

    Abstract: Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assist… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  3. arXiv:2407.10632  [pdf, other

    eess.IV cs.AI cs.CV

    Bidirectional Stereo Image Compression with Cross-Dimensional Entropy Model

    Authors: Zhening Liu, Xinjie Zhang, Jiawei Shao, Zehong Lin, Jun Zhang

    Abstract: With the rapid advancement of stereo vision technologies, stereo image compression has emerged as a crucial field that continues to draw significant attention. Previous approaches have primarily employed a unidirectional paradigm, where the compression of one view is dependent on the other, resulting in imbalanced compression. To address this issue, we introduce a symmetric bidirectional stereo im… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  4. arXiv:2407.10147  [pdf, ps, other

    eess.SP cs.IT

    Near-Field User Localization and Channel Estimation for XL-MIMO Systems: Fundamentals, Recent Advances, and Outlooks

    Authors: Hao Lei, Jiayi Zhang, Zhe Wang, Huahua Xiao, Bo Ai, Emil Björnson

    Abstract: Extremely large-scale multiple-input multipleoutput (XL-MIMO) is believed to be a cornerstone of sixth-generation (6G) wireless networks. XL-MIMO uses more antennas to both achieve unprecedented spatial degrees of freedom (DoFs) and exploit new electromagnetic (EM) phenomena occurring in the radiative near-field. The near-field effects provide the XL-MIMO array with depth perception, enabling prec… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

    Comments: 9 pages, 4 figures, 2tables, submitted to IEEE WCM

  5. arXiv:2407.10080  [pdf, other

    cs.IT eess.SY

    Design and Optimization on Successive RIS-assisted Multi-hop Wireless Communications

    Authors: Rujing Xiong, Jialong Lu, Jianan Zhang, Minggang Liu, Xuehui Dong, Tiebin Mi, Robert Caiming Qiu

    Abstract: As an emerging wireless communication technology, reconfigurable intelligent surface (RIS) has become a basic choice for providing signal coverage services in scenarios with dense obstacles or long tunnels through multi-hop configurations. Conventional works of literature mainly focus on alternating optimization or single-beam calculation in RIS phase configuration, which is limited in considering… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

  6. arXiv:2407.08550  [pdf

    cs.AI cs.ET cs.MA cs.RO eess.SY

    Incorporating Large Language Models into Production Systems for Enhanced Task Automation and Flexibility

    Authors: Yuchen Xia, Jize Zhang, Nasser Jazdi, Michael Weyrich

    Abstract: This paper introduces a novel approach to integrating large language model (LLM) agents into automated production systems, aimed at enhancing task automation and flexibility. We organize production operations within a hierarchical framework based on the automation pyramid. Atomic operation functionalities are modeled as microservices, which are executed through interface invocation within a dedica… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Report number: VDI-Berichte Nr. 2437, 2024

  7. arXiv:2407.08401  [pdf, other

    eess.SY

    Application of Data-Driven Model Predictive Control for Autonomous Vehicle Steering

    Authors: Jiarui Zhang, Aijing Kong, Yu Tang, Zhichao Lv, Lulu Guo, Peng Hang

    Abstract: With the development of autonomous driving technology, there are increasing demands for vehicle control, and MPC has become a widely researched topic in both industry and academia. Existing MPC control methods based on vehicle kinematics or dynamics have challenges such as difficult modeling, numerous parameters, strong nonlinearity, and high computational cost. To address these issues, this paper… ▽ More

    Submitted 18 July, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

  8. arXiv:2407.08216  [pdf, other

    eess.IV cs.AI cs.CV q-bio.QM

    Multimodal contrastive learning for spatial gene expression prediction using histology images

    Authors: Wenwen Min, Zhiceng Shi, Jun Zhang, Jun Wan, Changmiao Wang

    Abstract: In recent years, the advent of spatial transcriptomics (ST) technology has unlocked unprecedented opportunities for delving into the complexities of gene expression patterns within intricate biological systems. Despite its transformative potential, the prohibitive cost of ST technology remains a significant barrier to its widespread adoption in large-scale studies. An alternative, more cost-effect… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: BIB, Code: https://github.com/shizhiceng/mclSTExp

  9. arXiv:2407.06042  [pdf, ps, other

    eess.SP cs.IT

    Near-Optimal MIMO Detection Using Gradient-Based MCMC in Discrete Spaces

    Authors: Xingyu Zhou, Le Liang, Jing Zhang, Chao-Kai Wen, Shi Jin

    Abstract: The discrete nature of transmitted symbols poses challenges for achieving optimal detection in multiple-input multiple-output (MIMO) systems associated with a large number of antennas. Recently, the combination of two powerful machine learning methods, Markov chain Monte Carlo (MCMC) sampling and gradient descent, has emerged as a highly efficient solution to address this issue. However, existing… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  10. arXiv:2407.04888  [pdf, other

    eess.IV cs.CV

    Unraveling Radiomics Complexity: Strategies for Optimal Simplicity in Predictive Modeling

    Authors: Mahdi Ait Lhaj Loutfi, Teodora Boblea Podasca, Alex Zwanenburg, Taman Upadhaya, Jorge Barrios, David R. Raleigh, William C. Chen, Dante P. I. Capaldi, Hong Zheng, Olivier Gevaert, Jing Wu, Alvin C. Silva, Paul J. Zhang, Harrison X. Bai, Jan Seuntjens, Steffen Löck, Patrick O. Richard, Olivier Morin, Caroline Reinhold, Martin Lepage, Martin Vallières

    Abstract: Background: The high dimensionality of radiomic feature sets, the variability in radiomic feature types and potentially high computational requirements all underscore the need for an effective method to identify the smallest set of predictive features for a given clinical problem. Purpose: Develop a methodology and tools to identify and explain the smallest set of predictive radiomic features. Mat… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  11. arXiv:2407.04675  [pdf, other

    eess.AS cs.SD

    Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

    Authors: Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong, Zongwei Lan, Tianyu Li , et al. (30 additional authors not shown)

    Abstract: Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this wor… ▽ More

    Submitted 10 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

  12. arXiv:2407.04544  [pdf, other

    eess.SY

    Arbitrary Waveform Generated Metasurface: A New Paradigm for Direct Modulation and Beamforming Decoupling

    Authors: Xuehui Dong, Bokai Lai, Rujing Xiong, Jianan Zhang, Miyu Feng, Tiebin Mi, Robert Caiming Qiu

    Abstract: Passive arbitrary waveform generation (AWG) are especially important in a variety of fields like radar detection, wireless communications and integrated sensing and communications. Typically, backscatter devices are used to achieve passive signal reflection modulation to facilitate information transmission or to interfere with radar echoes. Reconfigurable Intelligent Surface (RIS) or Metasurface i… ▽ More

    Submitted 8 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

  13. arXiv:2407.03449  [pdf, other

    eess.SP

    A Tutorial on Fluid Antenna System for 6G Networks: Encompassing Communication Theory, Optimization Methods and Hardware Designs

    Authors: Wee Kiat New, Kai-Kit Wong, Hao Xu, Chao Wang, Farshad Rostami Ghadi, Jichen Zhang, Junhui Rao, Ross Murch, Pablo Ramírez-Espinosa, David Morales-Jimenez, Chan-Byoung Chae, Kin-Fai Tong

    Abstract: The advent of the sixth-generation (6G) networks presents another round of revolution for the mobile communication landscape, promising an immersive experience, robust reliability, minimal latency, extreme connectivity, ubiquitous coverage, and capabilities beyond communication, including intelligence and sensing. To achieve these ambitious goals, it is apparent that 6G networks need to incorporat… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: 50 pages, 45 figures, 5 tables. Submitted for potential publication

  14. arXiv:2407.02182  [pdf, other

    cs.CV cs.RO eess.IV

    Occlusion-Aware Seamless Segmentation

    Authors: Yihong Cao, Jiaming Zhang, Hao Shi, Kunyu Peng, Yuhongxuan Zhang, Hui Zhang, Rainer Stiefelhagen, Kailun Yang

    Abstract: Panoramic images can broaden the Field of View (FoV), occlusion-aware prediction can deepen the understanding of the scene, and domain adaptation can transfer across viewing domains. In this work, we introduce a novel task, Occlusion-Aware Seamless Segmentation (OASS), which simultaneously tackles all these three challenges. For benchmarking OASS, we establish a new human-annotated dataset for Ble… ▽ More

    Submitted 17 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024. The fresh dataset and source code are available at https://github.com/yihong-97/OASS

  15. arXiv:2407.02052  [pdf, other

    eess.AS cs.SD

    The USTC-NERCSLIP Systems for The ICMC-ASR Challenge

    Authors: Minghui Wu, Luzhen Xu, Jie Zhang, Haitao Tang, Yanyan Yue, Ruizhi Liao, Jintao Zhao, Zhengzhe Zhang, Yichi Wang, Haoyin Yan, Hongliang Yu, Tongle Ma, Jiachen Liu, Chongliang Wu, Yongchao Li, Yanyong Zhang, Xin Fang, Yue Zhang

    Abstract: This report describes the submitted system to the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) challenge, which considers the ASR task with multi-speaker overlapping and Mandarin accent dynamics in the ICMC case. We implement the front-end speaker diarization using the self-supervised learning representation based multi-speaker embedding and beamforming using the speaker position,… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted at ICASSP 2024

  16. arXiv:2407.01956  [pdf, other

    eess.SY cs.RO

    Cloud-Edge-Terminal Collaborative AIGC for Autonomous Driving

    Authors: Jianan Zhang, Zhiwei Wei, Boxun Liu, Xiayi Wang, Yong Yu, Rongqing Zhang

    Abstract: In dynamic autonomous driving environment, Artificial Intelligence-Generated Content (AIGC) technology can supplement vehicle perception and decision making by leveraging models' generative and predictive capabilities, and has the potential to enhance motion planning, trajectory prediction and traffic simulation. This article proposes a cloud-edge-terminal collaborative architecture to support AIG… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  17. arXiv:2407.01872  [pdf, other

    cs.CV cs.RO eess.IV

    Referring Atomic Video Action Recognition

    Authors: Kunyu Peng, Jia Fu, Kailun Yang, Di Wen, Yufan Chen, Ruiping Liu, Junwei Zheng, Jiaming Zhang, M. Saquib Sarfraz, Rainer Stiefelhagen, Alina Roitberg

    Abstract: We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic acti… ▽ More

    Submitted 10 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024. The dataset and code will be made publicly available at https://github.com/KPeng9510/RAVAR

  18. arXiv:2407.00297  [pdf

    eess.IV cs.CV

    UADSN: Uncertainty-Aware Dual-Stream Network for Facial Nerve Segmentation

    Authors: Guanghao Zhu, Lin Liu, Jing Zhang, Xiaohui Du, Ruqian Hao, Juanxiu Liu

    Abstract: Facial nerve segmentation is crucial for preoperative path planning in cochlear implantation surgery. Recently, researchers have proposed some segmentation methods, such as atlas-based and deep learning-based methods. However, since the facial nerve is a tubular organ with a diameter of only 1.0-1.5mm, it is challenging to locate and segment the facial nerve in CT scans. In this work, we propose a… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

  19. arXiv:2406.19796  [pdf, other

    eess.IV cs.CV

    Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting

    Authors: Wei Li, Jingyang Zhang, Pheng-Ann Heng, Lixu Gu

    Abstract: Generalist segmentation models are increasingly favored for diverse tasks involving various objects from different image sources. Task-Incremental Learning (TIL) offers a privacy-preserving training paradigm using tasks arriving sequentially, instead of gathering them due to strict data sharing policies. However, the task evolution can span a wide scope that involves shifts in both image appearanc… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: Accepted by MICCAI24

  20. arXiv:2406.19769  [pdf, other

    eess.SP

    Decision Transformer for IRS-Assisted Systems with Diffusion-Driven Generative Channels

    Authors: Jie Zhang, Jun Li, Zhe Wang, Yu Han, Long Shi, Bin Cao

    Abstract: In this paper, we propose a novel diffusion-decision transformer (D2T) architecture to optimize the beamforming strategies for intelligent reflecting surface (IRS)-assisted multiple-input single-output (MISO) communication systems. The first challenge lies in the expensive computation cost to recover the real-time channel state information (CSI) from the received pilot signals, which usually requi… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  21. arXiv:2406.19649  [pdf

    eess.IV cs.CV

    AstMatch: Adversarial Self-training Consistency Framework for Semi-Supervised Medical Image Segmentation

    Authors: Guanghao Zhu, Jing Zhang, Juanxiu Liu, Xiaohui Du, Ruqian Hao, Yong Liu, Lin Liu

    Abstract: Semi-supervised learning (SSL) has shown considerable potential in medical image segmentation, primarily leveraging consistency regularization and pseudo-labeling. However, many SSL approaches only pay attention to low-level consistency and overlook the significance of pseudo-label reliability. Therefore, in this work, we propose an adversarial self-training consistency framework (AstMatch). First… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  22. arXiv:2406.19608  [pdf, other

    eess.SY

    Multi-service collaboration and composition of cloud manufacturing customized production based on problem decomposition

    Authors: Hao Yue, Yingtao Wu, Min Wang, Hesuan Hu, Weimin Wu, Jihui Zhang

    Abstract: Cloud manufacturing system is a service-oriented and knowledge-based one, which can provide solutions for the large-scale customized production. The service resource allocation is the primary factor that restricts the production time and cost in the cloud manufacturing customized production (CMCP). In order to improve the efficiency and reduce the cost in CMCP, we propose a new framework which con… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 12 pages, 8 figures

    ACM Class: J.0

  23. arXiv:2406.19485  [pdf, other

    eess.IV cs.CV

    GAPNet: Granularity Attention Network with Anatomy-Prior-Constraint for Carotid Artery Segmentation

    Authors: Lin Zhang, Chenggang Lu, Xin-yang Shi, Caifeng Shan, Jiong Zhang, Da Chen, Laurent D. Cohen

    Abstract: Atherosclerosis is a chronic, progressive disease that primarily affects the arterial walls. It is one of the major causes of cardiovascular disease. Magnetic Resonance (MR) black-blood vessel wall imaging (BB-VWI) offers crucial insights into vascular disease diagnosis by clearly visualizing vascular structures. However, the complex anatomy of the neck poses challenges in distinguishing the carot… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  24. arXiv:2406.18327  [pdf, other

    eess.IV cs.CV cs.LG

    Multi-modal Evidential Fusion Network for Trusted PET/CT Tumor Segmentation

    Authors: Yuxuan Qi, Li Lin, Jiajun Wang, Jingya Zhang, Bin Zhang

    Abstract: Accurate segmentation of tumors in PET/CT images is important in computer-aided diagnosis and treatment of cancer. The key issue of such a segmentation problem lies in the effective integration of complementary information from PET and CT images. However, the quality of PET and CT images varies widely in clinical settings, which leads to uncertainty in the modality information extracted by network… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  25. arXiv:2406.18242  [pdf, other

    cs.CV eess.IV

    ConStyle v2: A Strong Prompter for All-in-One Image Restoration

    Authors: Dongqi Fan, Junhao Zhang, Liang Chang

    Abstract: This paper introduces ConStyle v2, a strong plug-and-play prompter designed to output clean visual prompts and assist U-Net Image Restoration models in handling multiple degradations. The joint training process of IRConStyle, an Image Restoration framework consisting of ConStyle and a general restoration network, is divided into two stages: first, pre-training ConStyle alone, and then freezing its… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  26. arXiv:2406.17338  [pdf, other

    eess.IV cs.CV cs.LG

    Robustly Optimized Deep Feature Decoupling Network for Fatty Liver Diseases Detection

    Authors: Peng Huang, Shu Hu, Bo Peng, Jiashu Zhang, Xi Wu, Xin Wang

    Abstract: Current medical image classification efforts mainly aim for higher average performance, often neglecting the balance between different classes. This can lead to significant differences in recognition accuracy between classes and obvious recognition weaknesses. Without the support of massive data, deep learning faces challenges in fine-grained classification of fatty liver. In this paper, we propos… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: MICCAI 2024

  27. arXiv:2406.17159  [pdf, other

    eess.AS cs.MM cs.SD

    Exploring compressibility of transformer based text-to-music (TTM) models

    Authors: Vasileios Moschopoulos, Thanasis Kotsiopoulos, Pablo Peso Parada, Konstantinos Nikiforidis, Alexandros Stergiadis, Gerasimos Papakostas, Md Asif Jalal, Jisi Zhang, Anastasios Drosou, Karthikeyan Saravanan

    Abstract: State-of-the art Text-To-Music (TTM) generative AI models are large and require desktop or server class compute, making them infeasible for deployment on mobile phones. This paper presents an analysis of trade-offs between model compression and generation performance of TTM models. We study compression through knowledge distillation and specific modifications that enable applicability over the var… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Proceedings of INTERSPEECH 2024

  28. arXiv:2406.16323  [pdf, other

    eess.SP

    Low-Complexity CSI Feedback for FDD Massive MIMO Systems via Learning to Optimize

    Authors: Yifan Ma, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief

    Abstract: In frequency-division duplex (FDD) massive multiple-input multiple-output (MIMO) systems, the growing number of base station antennas leads to prohibitive feedback overhead for downlink channel state information (CSI). To address this challenge, state-of-the-art (SOTA) fully data-driven deep learning (DL)-based CSI feedback schemes have been proposed. However, the high computational complexity and… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: submitted to IEEE for publication

  29. arXiv:2406.15222  [pdf

    eess.IV cs.AI cs.CV

    Rapid and Accurate Diagnosis of Acute Aortic Syndrome using Non-contrast CT: A Large-scale, Retrospective, Multi-center and AI-based Study

    Authors: Yujian Hu, Yilang Xiang, Yan-Jie Zhou, Yangyan He, Shifeng Yang, Xiaolong Du, Chunlan Den, Youyao Xu, Gaofeng Wang, Zhengyao Ding, Jingyong Huang, Wenjun Zhao, Xuejun Wu, Donglin Li, Qianqian Zhu, Zhenjiang Li, Chenyang Qiu, Ziheng Wu, Yunjun He, Chen Tian, Yihui Qiu, Zuodong Lin, Xiaolong Zhang, Yuan He, Zhenpeng Yuan , et al. (15 additional authors not shown)

    Abstract: Chest pain symptoms are highly prevalent in emergency departments (EDs), where acute aortic syndrome (AAS) is a catastrophic cardiovascular emergency with a high fatality rate, especially when timely and accurate treatment is not administered. However, current triage practices in the ED can cause up to approximately half of patients with AAS to have an initially missed diagnosis or be misdiagnosed… ▽ More

    Submitted 24 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: under peer review

  30. arXiv:2406.13977  [pdf, other

    eess.IV cs.CV

    Similarity-aware Syncretic Latent Diffusion Model for Medical Image Translation with Representation Learning

    Authors: Tingyi Lin, Pengju Lyu, Jie Zhang, Yuqing Wang, Cheng Wang, Jianjun Zhu

    Abstract: Non-contrast CT (NCCT) imaging may reduce image contrast and anatomical visibility, potentially increasing diagnostic uncertainty. In contrast, contrast-enhanced CT (CECT) facilitates the observation of regions of interest (ROI). Leading generative models, especially the conditional diffusion model, demonstrate remarkable capabilities in medical image modality transformation. Typical conditional d… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  31. arXiv:2406.13340  [pdf, other

    cs.CL cs.SD eess.AS

    SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

    Authors: Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, Zhizheng Wu

    Abstract: Speech encompasses a wealth of information, including but not limited to content, paralinguistic, and environmental information. This comprehensive nature of speech significantly impacts communication and is crucial for human-computer interaction. Chat-Oriented Large Language Models (LLMs), known for their general-purpose assistance capabilities, have evolved to handle multi-modal inputs, includin… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  32. arXiv:2406.13275  [pdf, other

    cs.SD cs.CL eess.AS

    Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

    Authors: Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang

    Abstract: Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED)… ▽ More

    Submitted 25 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  33. arXiv:2406.13145  [pdf, other

    eess.SY cs.LG

    Constructing and Evaluating Digital Twins: An Intelligent Framework for DT Development

    Authors: Longfei Ma, Nan Cheng, Xiucheng Wang, Jiong Chen, Yinjun Gao, Dongxiao Zhang, Jun-Jie Zhang

    Abstract: The development of Digital Twins (DTs) represents a transformative advance for simulating and optimizing complex systems in a controlled digital space. Despite their potential, the challenge of constructing DTs that accurately replicate and predict the dynamics of real-world systems remains substantial. This paper introduces an intelligent framework for the construction and evaluation of DTs, spec… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  34. arXiv:2406.11519  [pdf, other

    cs.CV eess.IV

    HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model

    Authors: Di Wang, Meiqi Hu, Yao Jin, Yuchun Miao, Jiaqi Yang, Yichu Xu, Xiaolei Qin, Jiaqi Ma, Lingyu Sun, Chenxing Li, Chuan Fu, Hongruixuan Chen, Chengxi Han, Naoto Yokoya, Jing Zhang, Minqiang Xu, Lin Liu, Lefei Zhang, Chen Wu, Bo Du, Dacheng Tao, Liangpei Zhang

    Abstract: Foundation models (FMs) are revolutionizing the analysis and understanding of remote sensing (RS) scenes, including aerial RGB, multispectral, and SAR images. However, hyperspectral images (HSIs), which are rich in spectral information, have not seen much application of FMs, with existing methods often restricted to specific tasks and lacking generality. To fill this gap, we introduce HyperSIGMA,… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: The code and models will be released at https://github.com/WHU-Sigma/HyperSIGMA

  35. arXiv:2406.11446  [pdf, other

    eess.SP

    An Approximate Wave-Number Domain Expression for Near-Field XL-MIMO Channel

    Authors: Hongbo Xing, Yuxiang Zhang, Jianhua Zhang, Huixin Xu, Guangyi Liu, Qixing Wang

    Abstract: As Extremely Large-Scale Multiple-Input-Multiple-Output (XL-MIMO) technology advances and carrier frequency rises, the near-field effects in communication are intensifying. A concise and accurate near-field XL-MIMO channel model serves as the cornerstone for investigating the near-field effects. However, existing wave-number domain XL-MIMO channel models under near-field conditions require non-clo… ▽ More

    Submitted 16 July, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  36. arXiv:2406.09304  [pdf

    physics.app-ph eess.SP

    Self-reconfigurable Multifunctional Memristive Nociceptor for Intelligent Robotics

    Authors: Shengbo Wang, Mingchao Fang, Lekai Song, Cong Li, Jian Zhang, Arokia Nathan, Guohua Hu, Shuo Gao

    Abstract: Artificial nociceptors, mimicking human-like stimuli perception, are of significance for intelligent robotics to work in hazardous and dynamic scenarios. One of the most essential characteristics of the human nociceptor is its self-adjustable attribute, which indicates that the threshold of determination of a potentially hazardous stimulus relies on environmental knowledge. This critical attribute… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 14 pages, 4 figures

  37. arXiv:2406.09190  [pdf, other

    eess.SP

    Rethinking Waveform for 6G: Harnessing Delay-Doppler Alignment Modulation

    Authors: Zhiqiang Xiao, Xianda Liu, Yong Zeng, J. Andrew Zhang, Shi Jin, Rui Zhang

    Abstract: Waveform design has served as a cornerstone for each generation of mobile communication systems. The future sixth-generation (6G) mobile communication networks are expected to employ larger-scale antenna arrays and exploit higher-frequency bands for further boosting data transmission rate and providing ubiquitous wireless sensing. This brings new opportunities and challenges for 6G waveform design… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  38. arXiv:2406.08266  [pdf, other

    eess.AS cs.SD

    Refining Self-Supervised Learnt Speech Representation using Brain Activations

    Authors: Hengyu Li, Kangdi Mei, Zhaoci Liu, Yang Ai, Liping Chen, Jie Zhang, Zhenhua Ling

    Abstract: It was shown in literature that speech representations extracted by self-supervised pre-trained models exhibit similarities with brain activations of human for speech perception and fine-tuning speech representation models on downstream tasks can further improve the similarity. However, it still remains unclear if this similarity can be used to optimize the pre-trained speech models. In this work,… ▽ More

    Submitted 13 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: accpeted by Interspeech2024

  39. arXiv:2406.07992  [pdf, other

    cs.LG eess.SP

    A Federated Online Restless Bandit Framework for Cooperative Resource Allocation

    Authors: Jingwen Tong, Xinran Li, Liqun Fu, Jun Zhang, Khaled B. Letaief

    Abstract: Restless multi-armed bandits (RMABs) have been widely utilized to address resource allocation problems with Markov reward processes (MRPs). Existing works often assume that the dynamics of MRPs are known prior, which makes the RMAB problem solvable from an optimization perspective. Nevertheless, an efficient learning-based solution for RMABs with unknown system dynamics remains an open problem. In… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  40. arXiv:2406.07914  [pdf, other

    cs.SD eess.AS

    Can Large Language Models Understand Spatial Audio?

    Authors: Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Jun Zhang, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

    Abstract: This paper explores enabling large language models (LLMs) to understand spatial information from multichannel audio, a skill currently lacking in auditory LLMs. By leveraging LLMs' advanced cognitive and inferential abilities, the aim is to enhance understanding of 3D environments via audio. We study 3 spatial audio tasks: sound source localization (SSL), far-field speech recognition (FSR), and lo… ▽ More

    Submitted 14 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  41. arXiv:2406.07880  [pdf, other

    cs.CV eess.IV

    A Comprehensive Survey on Machine Learning Driven Material Defect Detection: Challenges, Solutions, and Future Prospects

    Authors: Jun Bai, Di Wu, Tristan Shelley, Peter Schubel, David Twine, John Russell, Xuesen Zeng, Ji Zhang

    Abstract: Material defects (MD) represent a primary challenge affecting product performance and giving rise to safety issues in related products. The rapid and accurate identification and localization of MD constitute crucial research endeavours in addressing contemporary challenges associated with MD. Although conventional non-destructive testing methods such as ultrasonic and X-ray approaches have mitigat… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  42. arXiv:2406.07842  [pdf, other

    eess.AS cs.CL

    Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR

    Authors: Yerbolat Khassanov, Zhipeng Chen, Tianfeng Chen, Tze Yuang Chong, Wei Li, Jun Zhang, Lu Lu, Yuxuan Wang

    Abstract: This paper addresses challenges in integrating new languages into a pre-trained multilingual automatic speech recognition (mASR) system, particularly in scenarios where training data for existing languages is limited or unavailable. The proposed method employs a dual-pipeline with low-rank adaptation (LoRA). It maintains two data flow pipelines-one for existing languages and another for new langua… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures, 4 tables

  43. arXiv:2406.07012  [pdf, other

    cs.SD cs.CL eess.AS

    Bridging Language Gaps in Audio-Text Retrieval

    Authors: Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang

    Abstract: Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multi… ▽ More

    Submitted 16 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: interspeech2024

  44. arXiv:2406.06992  [pdf, other

    cs.SD eess.AS

    Scaling up masked audio encoder learning for general audio classification

    Authors: Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

    Abstract: Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and… ▽ More

    Submitted 13 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024

  45. arXiv:2406.05499  [pdf, other

    eess.SP

    A Pixel-based Reconfigurable Antenna Design for Fluid Antenna Systems

    Authors: Jichen Zhang, Junhui Rao, Zhaoyang Ming, Zan Li, Chi-Yuk Chiu, Kai-Kit Wong, Kin-Fai Tong, Ross Murch

    Abstract: Fluid Antenna Systems (FASs) have recently been proposed for enhancing the performance of wireless communication. Previous antenna designs to meet the requirements of FAS have been based on mechanically movable or liquid antennas and therefore have limited reconfiguration speeds. In this paper, we propose a design for a pixel-based reconfigurable antenna (PRA) that meets the requirements of FAS an… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 June, 2024; originally announced June 2024.

    Comments: 13 pages, 16 figures, Submitted to IEEE Transations on Antennas and Propagation

  46. arXiv:2406.05325  [pdf, other

    eess.AS cs.SD

    LDM-SVC: Latent Diffusion Model Based Zero-Shot Any-to-Any Singing Voice Conversion with Singer Guidance

    Authors: Shihao Chen, Yu Gu, Jie Zhang, Na Li, Rilin Chen, Liping Chen, Lirong Dai

    Abstract: Any-to-any singing voice conversion (SVC) is an interesting audio editing technique, aiming to convert the singing voice of one singer into that of another, given only a few seconds of singing data. However, during the conversion process, the issue of timbre leakage is inevitable: the converted singing voice still sounds like the original singer's voice. To tackle this, we propose a latent diffusi… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  47. arXiv:2406.03872  [pdf, other

    cs.CL cs.SD eess.AS

    BLSP-Emo: Towards Empathetic Large Speech-Language Models

    Authors: Chen Wang, Minpeng Liao, Zhongqiang Huang, Junhong Wu, Chengqing Zong, Jiajun Zhang

    Abstract: The recent release of GPT-4o showcased the potential of end-to-end multimodal models, not just in terms of low latency but also in their ability to understand and generate expressive speech with rich emotions. While the details are unknown to the open research community, it likely involves significant amounts of curated data and compute, neither of which is readily accessible. In this paper, we pr… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  48. arXiv:2406.03002  [pdf, other

    eess.IV cs.CV

    Phy-Diff: Physics-guided Hourglass Diffusion Model for Diffusion MRI Synthesis

    Authors: Juanhua Zhang, Ruodan Yan, Alessandro Perelli, Xi Chen, Chao Li

    Abstract: Diffusion MRI (dMRI) is an important neuroimaging technique with high acquisition costs. Deep learning approaches have been used to enhance dMRI and predict diffusion biomarkers through undersampled dMRI. To generate more comprehensive raw dMRI, generative adversarial network based methods are proposed to include b-values and b-vectors as conditions, but they are limited by unstable training and l… ▽ More

    Submitted 10 July, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: Accepted by MICCAI 2024

  49. arXiv:2406.02975  [pdf, other

    eess.SP

    A Shared-Aperture Dual-Band sub-6 GHz and mmWave Reconfigurable Intelligent Surface With Independent Operation

    Authors: Junhui Rao, Yujie Zhang, Shiwen Tang, Zan Li, Zhaoyang Ming, Jichen Zhang, Chi Yuk Chiu, Ross Murch

    Abstract: A novel dual-band reconfigurable intelligent surface (DBI-RIS) design that combines the functionalities of millimeter-wave (mmWave) and sub-6 GHz bands within a single aperture is proposed. This design aims to bridge the gap between current single-band reconfigurable intelligent surfaces (RISs) and wireless systems utilizing sub-6 GHz and mmWave bands that require RIS with independently reconfigur… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  50. arXiv:2406.02430  [pdf, other

    eess.AS cs.SD

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Authors: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu , et al. (21 additional authors not shown)

    Abstract: We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and sub… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.