Skip to main content

Showing 1–18 of 18 results for author: Kong, Z

  1. arXiv:2406.15487  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Improving Text-To-Audio Models with Synthetic Captions

    Authors: Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, Bryan Catanzaro

    Abstract: It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model}… ▽ More

    Submitted 8 July, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  2. arXiv:2404.07616  [pdf, other

    cs.CL cs.SD eess.AS

    Audio Dialogues: Dialogues dataset for audio and music understanding

    Authors: Arushi Goel, Zhifeng Kong, Rafael Valle, Bryan Catanzaro

    Abstract: Existing datasets for audio understanding primarily focus on single-turn interactions (i.e. audio captioning, audio question answering) for describing audio in natural language, thus limiting understanding audio via interactive dialogue. To address this gap, we introduce Audio Dialogues: a multi-turn dialogue dataset containing 163.8k samples for general audio sounds and music. In addition to dial… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: Demo website: https://audiodialogues.github.io/

  3. arXiv:2402.08235  [pdf, other

    eess.IV cs.CV

    Color Image Denoising Using The Green Channel Prior

    Authors: Zhaoming Kong, Xiaowei Yang

    Abstract: Noise removal in the standard RGB (sRGB) space remains a challenging task, in that the noise statistics of real-world images can be different in R, G and B channels. In fact, the green channel usually has twice the sampling rate in raw data and a higher signal-to-noise ratio than red/blue ones. However, the green channel prior (GCP) is often understated or ignored in color image denoising since ma… ▽ More

    Submitted 13 February, 2024; originally announced February 2024.

  4. arXiv:2402.01831  [pdf, other

    cs.SD cs.LG eess.AS

    Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

    Authors: Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, Bryan Catanzaro

    Abstract: Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs. In this paper, we propose Audio Flamingo, a novel audio language model with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) stro… ▽ More

    Submitted 28 May, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

    Comments: ICML 2024

  5. CleanUNet 2: A Hybrid Speech Denoising Model on Waveform and Spectrogram

    Authors: Zhifeng Kong, Wei Ping, Ambrish Dantrey, Bryan Catanzaro

    Abstract: In this work, we present CleanUNet 2, a speech denoising model that combines the advantages of waveform denoiser and spectrogram denoiser and achieves the best of both worlds. CleanUNet 2 uses a two-stage framework inspired by popular speech synthesis methods that consist of a waveform model and a spectrogram model. Specifically, CleanUNet 2 builds upon CleanUNet, the state-of-the-art waveform den… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

    Comments: INTERSPEECH 2023

    Journal ref: Proc. INTERSPEECH 2023, pages 790--794

  6. arXiv:2304.08990  [pdf, other

    eess.IV cs.CV

    A Comparison of Image Denoising Methods

    Authors: Zhaoming Kong, Fangxi Deng, Haomin Zhuang, Jun Yu, Lifang He, Xiaowei Yang

    Abstract: The advancement of imaging devices and countless images generated everyday pose an increasingly high demand on image denoising, which still remains a challenging task in terms of both effectiveness and efficiency. To improve denoising quality, numerous denoising techniques and approaches have been proposed in the past decades, including different transforms, regularization terms, algebraic represe… ▽ More

    Submitted 9 May, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

    Comments: In this paper, we intend to collect and compare various denoising methods to investigate their effectiveness, efficiency, applicability and generalization ability with both synthetic and real-world experiments. arXiv admin note: substantial text overlap with arXiv:2011.03462

  7. arXiv:2301.01732  [pdf, ps, other

    eess.IV cs.CV physics.med-ph

    Explicit Abnormality Extraction for Unsupervised Motion Artifact Reduction in Magnetic Resonance Imaging

    Authors: Yusheng Zhou, Hao Li, Jianan Liu, Zhengmin Kong, Tao Huang, Euijoon Ahn, Zhihan Lv, Jinman Kim, David Dagan Feng

    Abstract: Motion artifacts compromise the quality of magnetic resonance imaging (MRI) and pose challenges to achieving diagnostic outcomes and image-guided therapies. In recent years, supervised deep learning approaches have emerged as successful solutions for motion artifact reduction (MAR). One disadvantage of these methods is their dependency on acquiring paired sets of motion artifact-corrupted (MA-corr… ▽ More

    Submitted 5 July, 2024; v1 submitted 4 January, 2023; originally announced January 2023.

  8. arXiv:2202.07790  [pdf, other

    cs.SD cs.LG eess.AS

    Speech Denoising in the Waveform Domain with Self-Attention

    Authors: Zhifeng Kong, Wei Ping, Ambrish Dantrey, Bryan Catanzaro

    Abstract: In this work, we present CleanUNet, a causal speech denoising model on the raw waveform. The proposed model is based on an encoder-decoder architecture combined with several self-attention blocks to refine its bottleneck representations, which is crucial to obtain good results. The model is optimized through a set of losses defined over both waveform and multi-resolution spectrograms. The proposed… ▽ More

    Submitted 6 July, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

    Comments: Published in ICASSP 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Listen to audio samples from CleanUNet at: https://cleanunet.github.io/

  9. arXiv:2111.10803  [pdf, other

    eess.IV cs.CV

    Structure-Preserving Graph Kernel for Brain Network Classification

    Authors: Jun Yu, Zhaoming Kong, Aditya Kendre, Hao Peng, Carl Yang, Lichao Sun, Alex Leow, Lifang He

    Abstract: This paper presents a novel graph-based kernel learning approach for connectome analysis. Specifically, we demonstrate how to leverage the naturally available structure within the graph representation to encode prior knowledge in the kernel. We first proposed a matrix factorization to directly extract structural features from natural symmetric graph representations of connectome data. We then used… ▽ More

    Submitted 21 February, 2022; v1 submitted 21 November, 2021; originally announced November 2021.

  10. arXiv:2011.03462  [pdf, other

    eess.IV cs.CV

    A Comprehensive Comparison of Multi-Dimensional Image Denoising Methods

    Authors: Zhaoming Kong, Xiaowei Yang, Lifang He

    Abstract: Filtering multi-dimensional images such as color images, color videos, multispectral images and magnetic resonance images is challenging in terms of both effectiveness and efficiency. Leveraging the nonlocal self-similarity (NLSS) characteristic of images and sparse representation in the transform domain, the block-matching and 3D filtering (BM3D) based methods show powerful denoising performance.… ▽ More

    Submitted 6 November, 2020; originally announced November 2020.

  11. arXiv:2009.09761  [pdf, other

    eess.AS cs.CL cs.LG cs.SD stat.ML

    DiffWave: A Versatile Diffusion Model for Audio Synthesis

    Authors: Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, Bryan Catanzaro

    Abstract: In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave p… ▽ More

    Submitted 30 March, 2021; v1 submitted 21 September, 2020; originally announced September 2020.

    Comments: ICLR 2021 (oral)

  12. arXiv:1801.09289  [pdf, other

    eess.SY cs.FL

    Data-Driven Approximate Abstraction for Black-Box Piecewise Affine Systems

    Authors: Gang Chen, Zhaodan Kong

    Abstract: How to effectively and reliably guarantee the correct functioning of safety-critical cyber-physical systems in uncertain conditions is a challenging problem. This paper presents a data-driven algorithm to derive approximate abstractions for piecewise affine systems with unknown dynamics. It advocates a significant shift from the current paradigm of abstraction, which starts from a model with known… ▽ More

    Submitted 30 January, 2018; v1 submitted 28 January, 2018; originally announced January 2018.

  13. arXiv:1609.07409  [pdf, other

    eess.SY

    Q-Learning for Robust Satisfaction of Signal Temporal Logic Specifications

    Authors: Derya Aksaray, Austin Jones, Zhaodan Kong, Mac Schwager, Calin Belta

    Abstract: This paper addresses the problem of learning optimal policies for satisfying signal temporal logic (STL) specifications by agents with unknown stochastic dynamics. The system is modeled as a Markov decision process, in which the states represent partitions of a continuous space and the transition probabilities are unknown. We formulate two synthesis problems where the desired STL specification is… ▽ More

    Submitted 23 September, 2016; originally announced September 2016.

    Comments: This paper is accepted to IEEE CDC 2016

  14. arXiv:1603.00814  [pdf, other

    eess.SY

    Active Requirement Mining of Bounded-Time Temporal Properties of Cyber-Physical Systems

    Authors: Gang Chen, Zachary Sabato, Zhaodan Kong

    Abstract: This paper uses active learning to solve the problem of mining bounded-time signal temporal requirements of cyber-physical systems or simply the requirement mining problem. By utilizing robustness degree, we formulates the requirement mining problem into two optimization problems, a parameter synthesis problem and a falsification problem. We then propose a new active learning algorithm called Gaus… ▽ More

    Submitted 2 March, 2016; originally announced March 2016.

  15. arXiv:1510.06460  [pdf, other

    eess.SY cs.RO

    Robust Satisfaction of Temporal Logic Specifications via Reinforcement Learning

    Authors: Austin Jones, Derya Aksaray, Zhaodan Kong, Mac Schwager, Calin Belta

    Abstract: We consider the problem of steering a system with unknown, stochastic dynamics to satisfy a rich, temporally layered task given as a signal temporal logic formula. We represent the system as a Markov decision process in which the states are built from a partition of the state space and the transition probabilities are unknown. We present provably convergent reinforcement learning algorithms to max… ▽ More

    Submitted 21 October, 2015; originally announced October 2015.

    Comments: 8 pages, 4 figures

  16. arXiv:1403.5462  [pdf, ps, other

    eess.SY

    Saliency Based Control in Random Feature Networks

    Authors: John Baillieul, Zhaodan Kong

    Abstract: The ability to rapidly focus attention and react to salient environmental features enables animals to move agiley through their habitats. To replicate this kind of high-performance control of movement in synthetic systems, we propose a new approach to feedback control that bases control actions on randomly perceived features. Connections will be made with recent work incorporating communication pr… ▽ More

    Submitted 6 August, 2014; v1 submitted 21 March, 2014; originally announced March 2014.

    Comments: 9 pages, 2 figures

  17. arXiv:1311.4419  [pdf, other

    eess.SY cs.RO physics.bio-ph

    Perception and Steering Control in Paired Bat Flight

    Authors: Zhaodan Kong, Kayhan Ozcimder, Nathan W. Fuller, John Baillieul

    Abstract: Animals within groups need to coordinate their reactions to perceived environmental features and to each other in order to safely move from one point to another. This paper extends our previously published work on the flight patterns of Myotis velifer that have been observed in a habitat near Johnson City, Texas. Each evening, these bats emerge from a cave in sequences of small groups that typical… ▽ More

    Submitted 15 November, 2013; originally announced November 2013.

    Comments: Submitted to the 19th World Congress of the International Federation of Automatic Control (IFAC)

  18. arXiv:1303.3072  [pdf, other

    eess.SY

    Optical Flow Sensing and the Inverse Perception Problem for Flying Bats

    Authors: Zhaodan Kong, Kayhan Özcimder, Nathan Fuller, Alison Greco, Diane Theriault, Zheng Wu, Thomas Kunz, Margrit Betke, John Baillieul

    Abstract: The movements of birds, bats, and other flying species are governed by complex sensorimotor systems that allow the animals to react to stationary environmental features as well as to wind disturbances, other animals in nearby airspace, and a wide variety of unexpected challenges. The paper and talk will describe research that analyzes the three-dimensional trajectories of bats flying in a habitat… ▽ More

    Submitted 12 March, 2013; originally announced March 2013.

    Comments: 20 Pages, 7 figures