-
ActionSwitch: Class-agnostic Detection of Simultaneous Actions in Streaming Videos
Authors:
Hyolim Kang,
Jeongseok Hyun,
Joungbin An,
Youngjae Yu,
Seon Joo Kim
Abstract:
Online Temporal Action Localization (On-TAL) is a critical task that aims to instantaneously identify action instances in untrimmed streaming videos as soon as an action concludes -- a major leap from frame-based Online Action Detection (OAD). Yet, the challenge of detecting overlapping actions is often overlooked even though it is a common scenario in streaming videos. Current methods that can ad…
▽ More
Online Temporal Action Localization (On-TAL) is a critical task that aims to instantaneously identify action instances in untrimmed streaming videos as soon as an action concludes -- a major leap from frame-based Online Action Detection (OAD). Yet, the challenge of detecting overlapping actions is often overlooked even though it is a common scenario in streaming videos. Current methods that can address concurrent actions depend heavily on class information, limiting their flexibility. This paper introduces ActionSwitch, the first class-agnostic On-TAL framework capable of detecting overlapping actions. By obviating the reliance on class information, ActionSwitch provides wider applicability to various situations, including overlapping actions of the same class or scenarios where class information is unavailable. This approach is complemented by the proposed "conservativeness loss", which directly embeds a conservative decision-making principle into the loss function for On-TAL. Our ActionSwitch achieves state-of-the-art performance in complex datasets, including Epic-Kitchens 100 targeting the challenging egocentric view and FineAction consisting of fine-grained actions.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach
Authors:
Irina Jurenka,
Markus Kunesch,
Kevin R. McKee,
Daniel Gillick,
Shaojian Zhu,
Sara Wiltberger,
Shubham Milind Phal,
Katherine Hermann,
Daniel Kasenberg,
Avishkar Bhoopchand,
Ankit Anand,
Miruna Pîslar,
Stephanie Chan,
Lisa Wang,
Jennifer She,
Parsa Mahmoudieh,
Aliya Rysbek,
Wei-Jen Ko,
Andrea Huber,
Brett Wiltshire,
Gal Elidan,
Roni Rabin,
Jasmin Rubinovitz,
Amit Pitaru,
Mac McAllister
, et al. (49 additional authors not shown)
Abstract:
A major challenge facing the world is the provision of equitable and universal access to quality education. Recent advances in generative AI (gen AI) have created excitement about the potential of new technologies to offer a personal tutor for every learner and a teaching assistant for every teacher. The full extent of this dream, however, has not yet materialised. We argue that this is primarily…
▽ More
A major challenge facing the world is the provision of equitable and universal access to quality education. Recent advances in generative AI (gen AI) have created excitement about the potential of new technologies to offer a personal tutor for every learner and a teaching assistant for every teacher. The full extent of this dream, however, has not yet materialised. We argue that this is primarily due to the difficulties with verbalising pedagogical intuitions into gen AI prompts and the lack of good evaluation practices, reinforced by the challenges in defining excellent pedagogy. Here we present our work collaborating with learners and educators to translate high level principles from learning science into a pragmatic set of seven diverse educational benchmarks, spanning quantitative, qualitative, automatic and human evaluations; and to develop a new set of fine-tuning datasets to improve the pedagogical capabilities of Gemini, introducing LearnLM-Tutor. Our evaluations show that LearnLM-Tutor is consistently preferred over a prompt tuned Gemini by educators and learners on a number of pedagogical dimensions. We hope that this work can serve as a first step towards developing a comprehensive educational evaluation framework, and that this can enable rapid progress within the AI and EdTech communities towards maximising the positive impact of gen AI in education.
△ Less
Submitted 21 May, 2024;
originally announced July 2024.
-
Flexible Antenna Arrays for Wireless Communications: Modeling and Performance Evaluation
Authors:
Songjie Yang,
Jiancheng An,
Yue Xiu,
Wanting Lyu,
Boyu Ning,
Zhongpei Zhang,
Merouane Debbah,
Chau Yuen
Abstract:
Flexible antenna arrays (FAAs), distinguished by their rotatable, bendable, and foldable properties, are extensively employed in flexible radio systems to achieve customized radiation patterns. This paper aims to illustrate that FAAs, capable of dynamically adjusting surface shapes, can enhance communication performances with both omni-directional and directional antenna patterns, in terms of mult…
▽ More
Flexible antenna arrays (FAAs), distinguished by their rotatable, bendable, and foldable properties, are extensively employed in flexible radio systems to achieve customized radiation patterns. This paper aims to illustrate that FAAs, capable of dynamically adjusting surface shapes, can enhance communication performances with both omni-directional and directional antenna patterns, in terms of multi-path channel power and channel angle Cramér-Rao bounds. To this end, we develop a mathematical model that elucidates the impacts of the variations in antenna positions and orientations as the array transitions from a flat to a rotated, bent, and folded state, all contingent on the flexible degree-of-freedom. Moreover, since the array shape adjustment operates across the entire beamspace, especially with directional patterns, we discuss the sum-rate in the multi-sector base station that covers the $360^\circ$ communication area. Particularly, to thoroughly explore the multi-sector sum-rate, we propose separate flexible precoding (SFP), joint flexible precoding (JFP), and semi-joint flexible precoding (SJFP), respectively. In our numerical analysis comparing the optimized FAA to the fixed uniform planar array, we find that the bendable FAA achieves a remarkable $156\%$ sum-rate improvement compared to the fixed planar array in the case of JFP with the directional pattern. Furthermore, the rotatable FAA exhibits notably superior performance in SFP and SJFP cases with omni-directional patterns, with respective $35\%$ and $281\%$.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Stacked Intelligent Metasurfaces for Wireless Sensing and Communication: Applications and Challenges
Authors:
Hao Liu,
Jiancheng An,
Xing Jia,
Shining Lin,
Xianghao Yao,
Lu Gan,
Bruno Clerckx,
Chau Yuen,
Mehdi Bennis,
Mérouane Debbah
Abstract:
The rapid advancement of wireless communication technologies has precipitated an unprecedented demand for high data rates, extremely low latency, and ubiquitous connectivity. In order to achieve these goals, stacked intelligent metasurfaces (SIM) has been developed as a novel solution to perform advanced signal processing tasks directly in the electromagnetic wave domain, thus achieving ultra-fast…
▽ More
The rapid advancement of wireless communication technologies has precipitated an unprecedented demand for high data rates, extremely low latency, and ubiquitous connectivity. In order to achieve these goals, stacked intelligent metasurfaces (SIM) has been developed as a novel solution to perform advanced signal processing tasks directly in the electromagnetic wave domain, thus achieving ultra-fast computing speed and reducing hardware complexity. This article provides an overview of the SIM technology by discussing its hardware architectures, advantages, and potential applications for wireless sensing and communication. Specifically, we explore the utilization of SIMs in enabling wave-domain beamforming, channel modeling and estimation in SIM-assisted communication systems. Furthermore, we elaborate on the potential of utilizing a SIM to build a hybrid optical-electronic neural network (HOENN) and demonstrate its efficacy by examining two case studies: disaster monitoring and direction-of-arrival estimation. Finally, we identify key implementation challenges, including practical hardware imperfections, efficient SIM configuration for realizing wave-domain signal processing, and performance analysis to motivate future research on this important and far-reaching topic.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Environment-Aware Codebook Design for RIS-Assisted MU-MISO Communications: Implementation and Performance Analysis
Authors:
Zhiheng Yu,
Jiancheng An,
Ertugrul Basar,
Lu Gan,
Chau Yuen
Abstract:
Reconfigurable intelligent surface (RIS) provides a new electromagnetic response control solution, which can proactively reshape the characteristics of wireless channel environments. In RIS-assisted communication systems, the acquisition of channel state information (CSI) and the optimization of reflecting coefficients constitute major design challenges. To address these issues, codebook-based sol…
▽ More
Reconfigurable intelligent surface (RIS) provides a new electromagnetic response control solution, which can proactively reshape the characteristics of wireless channel environments. In RIS-assisted communication systems, the acquisition of channel state information (CSI) and the optimization of reflecting coefficients constitute major design challenges. To address these issues, codebook-based solutions have been developed recently, which, however, are mostly environment-agnostic. In this paper, a novel environment-aware codebook protocol is proposed, which can significantly reduce both pilot overhead and computational complexity, while maintaining expected communication performance. Specifically, first of all, a channel training framework is introduced to divide the training phase into several blocks. In each block, we directly estimate the composite end-to-end channel and focus only on the transmit beamforming. Second, we propose an environment-aware codebook generation scheme, which first generates a group of channels based on statistical CSI, and then obtains their corresponding RIS configuration by utilizing the alternating optimization (AO) method offline. In each online training block, the RIS is configured based on the corresponding codeword in the environment-aware codebook, and the optimal codeword resulting in the highest sum rate is adopted for assisting in the downlink data transmission. Third, we analyze the theoretical performance of the environment-aware codebook-based protocol taking into account the channel estimation errors. Finally, numerical simulations are provided to verify our theoretical analysis and the performance of the proposed scheme. In particular, the simulation results demonstrate that our protocol is more competitive than conventional environment-agnostic codebooks.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Object Aware Egocentric Online Action Detection
Authors:
Joungbin An,
Yunsu Park,
Hyolim Kang,
Seon Joo Kim
Abstract:
Advancements in egocentric video datasets like Ego4D, EPIC-Kitchens, and Ego-Exo4D have enriched the study of first-person human interactions, which is crucial for applications in augmented reality and assisted living. Despite these advancements, current Online Action Detection methods, which efficiently detect actions in streaming videos, are predominantly designed for exocentric views and thus f…
▽ More
Advancements in egocentric video datasets like Ego4D, EPIC-Kitchens, and Ego-Exo4D have enriched the study of first-person human interactions, which is crucial for applications in augmented reality and assisted living. Despite these advancements, current Online Action Detection methods, which efficiently detect actions in streaming videos, are predominantly designed for exocentric views and thus fail to capitalize on the unique perspectives inherent to egocentric videos. To address this gap, we introduce an Object-Aware Module that integrates egocentric-specific priors into existing OAD frameworks, enhancing first-person footage interpretation. Utilizing object-specific details and temporal dynamics, our module improves scene understanding in detecting actions. Validated extensively on the Epic-Kitchens 100 dataset, our work can be seamlessly integrated into existing models with minimal overhead and bring consistent performance enhancements, marking an important step forward in adapting action detection systems to egocentric video analysis.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models
Authors:
Xijie Huang,
Xinyuan Wang,
Hantao Zhang,
Jiawen Xi,
Jingkun An,
Hao Wang,
Chengwei Pan
Abstract:
Security concerns related to Large Language Models (LLMs) have been extensively explored, yet the safety implications for Multimodal Large Language Models (MLLMs), particularly in medical contexts (MedMLLMs), remain insufficiently studied. This paper delves into the underexplored security vulnerabilities of MedMLLMs, especially when deployed in clinical environments where the accuracy and relevanc…
▽ More
Security concerns related to Large Language Models (LLMs) have been extensively explored, yet the safety implications for Multimodal Large Language Models (MLLMs), particularly in medical contexts (MedMLLMs), remain insufficiently studied. This paper delves into the underexplored security vulnerabilities of MedMLLMs, especially when deployed in clinical environments where the accuracy and relevance of question-and-answer interactions are critically tested against complex medical challenges. By combining existing clinical medical data with atypical natural phenomena, we redefine two types of attacks: mismatched malicious attack (2M-attack) and optimized mismatched malicious attack (O2M-attack). Using our own constructed voluminous 3MAD dataset, which covers a wide range of medical image modalities and harmful medical scenarios, we conduct a comprehensive analysis and propose the MCM optimization method, which significantly enhances the attack success rate on MedMLLMs. Evaluations with this dataset and novel attack methods, including white-box attacks on LLaVA-Med and transfer attacks on four other state-of-the-art models, indicate that even MedMLLMs designed with enhanced security features are vulnerable to security breaches. Our work underscores the urgent need for a concerted effort to implement robust security measures and enhance the safety and efficacy of open-source MedMLLMs, particularly given the potential severity of jailbreak attacks and other malicious or clinically significant exploits in medical settings. For further research and replication, anonymous access to our code is available at https://github.com/dirtycomputer/O2M_attack. Warning: Medical large model jailbreaking may generate content that includes unverified diagnoses and treatment recommendations. Always consult professional medical advice.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Disrupting Diffusion: Token-Level Attention Erasure Attack against Diffusion-based Customization
Authors:
Yisu Liu,
Jinyang An,
Wanqian Zhang,
Dayan Wu,
Jingzi Gu,
Zheng Lin,
Weiping Wang
Abstract:
With the development of diffusion-based customization methods like DreamBooth, individuals now have access to train the models that can generate their personalized images. Despite the convenience, malicious users have misused these techniques to create fake images, thereby triggering a privacy security crisis. In light of this, proactive adversarial attacks are proposed to protect users against cu…
▽ More
With the development of diffusion-based customization methods like DreamBooth, individuals now have access to train the models that can generate their personalized images. Despite the convenience, malicious users have misused these techniques to create fake images, thereby triggering a privacy security crisis. In light of this, proactive adversarial attacks are proposed to protect users against customization. The adversarial examples are trained to distort the customization model's outputs and thus block the misuse. In this paper, we propose DisDiff (Disrupting Diffusion), a novel adversarial attack method to disrupt the diffusion model outputs. We first delve into the intrinsic image-text relationships, well-known as cross-attention, and empirically find that the subject-identifier token plays an important role in guiding image generation. Thus, we propose the Cross-Attention Erasure module to explicitly "erase" the indicated attention maps and disrupt the text guidance. Besides,we analyze the influence of the sampling process of the diffusion model on Projected Gradient Descent (PGD) attack and introduce a novel Merit Sampling Scheduler to adaptively modulate the perturbation updating amplitude in a step-aware manner. Our DisDiff outperforms the state-of-the-art methods by 12.75% of FDFR scores and 7.25% of ISM scores across two facial benchmarks and two commonly used prompts on average.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Stacked Intelligent Metasurfaces for Holographic MIMO Aided Cell-Free Networks
Authors:
Qingchao Li,
Mohammed El-Hajjar,
Chao Xu,
Jiancheng An,
Chau Yuen,
Lajos Hanzo
Abstract:
Large-scale multiple-input and multiple-output (MIMO) systems are capable of achieving high date rate. However, given the high hardware cost and excessive power consumption of massive MIMO systems, as a remedy, intelligent metasurfaces have been designed for efficient holographic MIMO (HMIMO) systems. In this paper, we propose a HMIMO architecture based on stacked intelligent metasurfaces (SIM) fo…
▽ More
Large-scale multiple-input and multiple-output (MIMO) systems are capable of achieving high date rate. However, given the high hardware cost and excessive power consumption of massive MIMO systems, as a remedy, intelligent metasurfaces have been designed for efficient holographic MIMO (HMIMO) systems. In this paper, we propose a HMIMO architecture based on stacked intelligent metasurfaces (SIM) for the uplink of cell-free systems, where the SIM is employed at the access points (APs) for improving the spectral- and energy-efficiency. Specifically, we conceive distributed beamforming for SIM-assisted cell-free networks, where both the SIM coefficients and the local receiver combiner vectors of each AP are optimized based on the local channel state information (CSI) for the local detection of each user equipment (UE) information. Afterward, the central processing unit (CPU) fuses the local detections gleaned from all APs to detect the aggregate multi-user signal. Specifically, to design the SIM coefficients and the combining vectors of the APs, a low-complexity layer-by-layer iterative optimization algorithm is proposed for maximizing the equivalent gain of the channel spanning from the UEs to the APs. At the CPU, the weight vector used for combining the local detections from all APs is designed based on the minimum mean square error (MMSE) criterion, where the hardware impairments (HWIs) are also taken into consideration based on their statistics. The simulation results show that the SIM-based HMIMO outperforms the conventional single-layer HMIMO in terms of the achievable rate. We demonstrate that both the HWI of the radio frequency (RF) chains at the APs and the UEs limit the achievable rate in the high signal-to-noise-ratio (SNR) region.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
X-posing Free Speech: Examining the Impact of Moderation Relaxation on Online Social Networks
Authors:
Arvindh Arun,
Saurav Chhatani,
Jisun An,
Ponnurangam Kumaraguru
Abstract:
We investigate the impact of free speech and the relaxation of moderation on online social media platforms using Elon Musk's takeover of Twitter as a case study. By curating a dataset of over 10 million tweets, our study employs a novel framework combining content and network analysis. Our findings reveal a significant increase in the distribution of certain forms of hate content, particularly tar…
▽ More
We investigate the impact of free speech and the relaxation of moderation on online social media platforms using Elon Musk's takeover of Twitter as a case study. By curating a dataset of over 10 million tweets, our study employs a novel framework combining content and network analysis. Our findings reveal a significant increase in the distribution of certain forms of hate content, particularly targeting the LGBTQ+ community and liberals. Network analysis reveals the formation of cohesive hate communities facilitated by influential bridge users, with substantial growth in interactions hinting at increased hate production and diffusion. By tracking the temporal evolution of PageRank, we identify key influencers, primarily self-identified far-right supporters disseminating hate against liberals and woke culture. Ironically, embracing free speech principles appears to have enabled hate speech against the very concept of freedom of expression and free speech itself. Our findings underscore the delicate balance platforms must strike between open expression and robust moderation to curb the proliferation of hate online.
△ Less
Submitted 23 May, 2024; v1 submitted 17 April, 2024;
originally announced April 2024.
-
Contextrast: Contextual Contrastive Learning for Semantic Segmentation
Authors:
Changki Sung,
Wanhee Kim,
Jungho An,
Wooju Lee,
Hyungtae Lim,
Hyun Myung
Abstract:
Despite great improvements in semantic segmentation, challenges persist because of the lack of local/global contexts and the relationship between them. In this paper, we propose Contextrast, a contrastive learning-based semantic segmentation method that allows to capture local/global contexts and comprehend their relationships. Our proposed method comprises two parts: a) contextual contrastive lea…
▽ More
Despite great improvements in semantic segmentation, challenges persist because of the lack of local/global contexts and the relationship between them. In this paper, we propose Contextrast, a contrastive learning-based semantic segmentation method that allows to capture local/global contexts and comprehend their relationships. Our proposed method comprises two parts: a) contextual contrastive learning (CCL) and b) boundary-aware negative (BANE) sampling. Contextual contrastive learning obtains local/global context from multi-scale feature aggregation and inter/intra-relationship of features for better discrimination capabilities. Meanwhile, BANE sampling selects embedding features along the boundaries of incorrectly predicted regions to employ them as harder negative samples on our contrastive learning, resolving segmentation issues along the boundary region by exploiting fine-grained details. We demonstrate that our Contextrast substantially enhances the performance of semantic segmentation networks, outperforming state-of-the-art contrastive learning approaches on diverse public datasets, e.g. Cityscapes, CamVid, PASCAL-C, COCO-Stuff, and ADE20K, without an increase in computational cost during inference.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Learning Deterministic Multi-Clock Timed Automata
Authors:
Yu Teng,
Miaomiao Zhang,
Jie An
Abstract:
We present an algorithm for active learning of deterministic timed automata with multiple clocks. The algorithm is within the querying framework of Angluin's $L^*$ algorithm and follows the idea proposed in existing work on the active learning of deterministic one-clock timed automata. We introduce an equivalence relation over the reset-clocked language of a timed automaton and then transform the…
▽ More
We present an algorithm for active learning of deterministic timed automata with multiple clocks. The algorithm is within the querying framework of Angluin's $L^*$ algorithm and follows the idea proposed in existing work on the active learning of deterministic one-clock timed automata. We introduce an equivalence relation over the reset-clocked language of a timed automaton and then transform the learning problem into learning the corresponding reset-clocked language of the target automaton. Since a reset-clocked language includes the clock reset information which is not observable, we first present the approach of learning from a powerful teacher who can provide reset information by answering reset information queries from the learner. Then we extend the algorithm in a normal teacher situation in which the learner can only ask standard membership query and equivalence query while the learner guesses the reset information. We prove that the learning algorithm terminates and returns a correct deterministic timed automaton. Due to the need of guessing whether the clocks reset at the transitions, the algorithm is of exponential complexity in the size of the target automaton.
△ Less
Submitted 20 May, 2024; v1 submitted 11 April, 2024;
originally announced April 2024.
-
Rematch: Robust and Efficient Matching of Local Knowledge Graphs to Improve Structural and Semantic Similarity
Authors:
Zoher Kachwala,
Jisun An,
Haewoon Kwak,
Filippo Menczer
Abstract:
Knowledge graphs play a pivotal role in various applications, such as question-answering and fact-checking. Abstract Meaning Representation (AMR) represents text as knowledge graphs. Evaluating the quality of these graphs involves matching them structurally to each other and semantically to the source text. Existing AMR metrics are inefficient and struggle to capture semantic similarity. We also l…
▽ More
Knowledge graphs play a pivotal role in various applications, such as question-answering and fact-checking. Abstract Meaning Representation (AMR) represents text as knowledge graphs. Evaluating the quality of these graphs involves matching them structurally to each other and semantically to the source text. Existing AMR metrics are inefficient and struggle to capture semantic similarity. We also lack a systematic evaluation benchmark for assessing structural similarity between AMR graphs. To overcome these limitations, we introduce a novel AMR similarity metric, rematch, alongside a new evaluation for structural similarity called RARE. Among state-of-the-art metrics, rematch ranks second in structural similarity; and first in semantic similarity by 1--5 percentage points on the STS-B and SICK-R benchmarks. Rematch is also five times faster than the next most efficient metric.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
HyperCLOVA X Technical Report
Authors:
Kang Min Yoo,
Jaegeun Han,
Sookyo In,
Heewon Jeon,
Jisu Jeong,
Jaewook Kang,
Hyunwook Kim,
Kyung-Min Kim,
Munhyong Kim,
Sungju Kim,
Donghyun Kwak,
Hanock Kwak,
Se Jung Kwon,
Bado Lee,
Dongsoo Lee,
Gichang Lee,
Jooho Lee,
Baeseong Park,
Seongjin Shin,
Joonsang Yu,
Seolki Baek,
Sumin Byeon,
Eungsup Cho,
Dooseok Choe,
Jeesung Han
, et al. (371 additional authors not shown)
Abstract:
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t…
▽ More
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs.
△ Less
Submitted 13 April, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
Environment-Aware Codebook for RIS-Assisted MU-MISO Communications: Implementation and Performance Analysis
Authors:
Zhiheng Yu,
Jiancheng An,
Lu Gan,
Chau Yuen
Abstract:
Reconfigurable intelligent surface (RIS) provides a new electromagnetic response control solution, which can reshape the characteristics of wireless channels. In this paper, we propose a novel environment-aware codebook protocol for RIS-assisted multi-user multiple-input single-output (MU-MISO) systems. Specifically, we first introduce a channel training protocol which consists of off-line and on-…
▽ More
Reconfigurable intelligent surface (RIS) provides a new electromagnetic response control solution, which can reshape the characteristics of wireless channels. In this paper, we propose a novel environment-aware codebook protocol for RIS-assisted multi-user multiple-input single-output (MU-MISO) systems. Specifically, we first introduce a channel training protocol which consists of off-line and on-line stages. Secondly, we propose an environment-aware codebook generation scheme, which utilizes the statistical channel state information and alternating optimization method to generate codewords offline. Then, in the on-line stage, we use these pre-designed codewords to configure the RIS, and the optimal codeword resulting in the highest sum rate is adopted for assisting in the downlink data transmission. Thirdly, we analyze the theoretical performance of the proposed protocol considering the channel estimation errors. Finally, numerical simulations are provided to verify our theoretical analysis and the performance of the proposed scheme.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
Algorithmic Ways of Seeing: Using Object Detection to Facilitate Art Exploration
Authors:
Louie Søs Meyer,
Johanne Engel Aaen,
Anitamalina Regitse Tranberg,
Peter Kun,
Matthias Freiberger,
Sebastian Risi,
Anders Sundnes Løvlie
Abstract:
This Research through Design paper explores how object detection may be applied to a large digital art museum collection to facilitate new ways of encountering and experiencing art. We present the design and evaluation of an interactive application called SMKExplore, which allows users to explore a museum's digital collection of paintings by browsing through objects detected in the images, as a no…
▽ More
This Research through Design paper explores how object detection may be applied to a large digital art museum collection to facilitate new ways of encountering and experiencing art. We present the design and evaluation of an interactive application called SMKExplore, which allows users to explore a museum's digital collection of paintings by browsing through objects detected in the images, as a novel form of open-ended exploration. We provide three contributions. First, we show how an object detection pipeline can be integrated into a design process for visual exploration. Second, we present the design and development of an app that enables exploration of an art museum's collection. Third, we offer reflections on future possibilities for museums and HCI researchers to incorporate object detection techniques into the digitalization of museums.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
The Dimensions of the Hulls of Conorm Codes from Algebraic Geometry Codes
Authors:
Junmin An,
Jon-Lark Kim
Abstract:
Chara et al. introduced conorm codes defined over algebraic geometry codes, but the hulls of conorm codes were not determined yet. In this paper, we study the dimension of the hull of conorm codes using the method introduced by Camps et al. For an algebraic geometry code $\mathcal{C}:=C_\mathscr{L}(D, G)$, we consider the divisor $\gcd(G, H)$, where $H$ is the divisor satisfying \[C_\mathscr{L}(D,…
▽ More
Chara et al. introduced conorm codes defined over algebraic geometry codes, but the hulls of conorm codes were not determined yet. In this paper, we study the dimension of the hull of conorm codes using the method introduced by Camps et al. For an algebraic geometry code $\mathcal{C}:=C_\mathscr{L}(D, G)$, we consider the divisor $\gcd(G, H)$, where $H$ is the divisor satisfying \[C_\mathscr{L}(D, G)^\perp=C_\mathscr{L}(D, H).\] Given an extension $F'/\mathbb{F}_{q^t}$ of an algebraic function field $F/\mathbb{F}_q$, we assume that the divisor $\gcd(G, H)$ is non-special. If the degree of $\gcd(G, H)$ is greater than $2g-2+{t\over [F':F]}°\text{Diff}(F'/F)$, then we have determined the exact dimension of the hull of the conorm of $\mathcal{C}$. If not, we have determined the lower bound of the dimension of the hull of the conorm of $\mathcal{C}$. We provide some examples for the dimension of the hull of certain conorm codes of AG codes defined over a rational function field.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
ChatGPT Rates Natural Language Explanation Quality Like Humans: But on Which Scales?
Authors:
Fan Huang,
Haewoon Kwak,
Kunwoo Park,
Jisun An
Abstract:
As AI becomes more integral in our lives, the need for transparency and responsibility grows. While natural language explanations (NLEs) are vital for clarifying the reasoning behind AI decisions, evaluating them through human judgments is complex and resource-intensive due to subjectivity and the need for fine-grained ratings. This study explores the alignment between ChatGPT and human assessment…
▽ More
As AI becomes more integral in our lives, the need for transparency and responsibility grows. While natural language explanations (NLEs) are vital for clarifying the reasoning behind AI decisions, evaluating them through human judgments is complex and resource-intensive due to subjectivity and the need for fine-grained ratings. This study explores the alignment between ChatGPT and human assessments across multiple scales (i.e., binary, ternary, and 7-Likert scale). We sample 300 data instances from three NLE datasets and collect 900 human annotations for both informativeness and clarity scores as the text quality measurement. We further conduct paired comparison experiments under different ranges of subjectivity scores, where the baseline comes from 8,346 human annotations. Our results show that ChatGPT aligns better with humans in more coarse-grained scales. Also, paired comparisons and dynamic prompting (i.e., providing semantically similar examples in the prompt) improve the alignment. This research advances our understanding of large language models' capabilities to assess the text explanation quality in different configurations for responsible AI development.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation
Authors:
Jingkun An,
Yinghao Zhu,
Zongjian Li,
Haoran Feng,
Bohua Chen,
Yemin Shi,
Chengwei Pan
Abstract:
Text-to-Image (T2I) diffusion models have achieved remarkable success in image generation. Despite their progress, challenges remain in both prompt-following ability, image quality and lack of high-quality datasets, which are essential for refining these models. As acquiring labeled data is costly, we introduce AGFSync, a framework that enhances T2I diffusion models through Direct Preference Optim…
▽ More
Text-to-Image (T2I) diffusion models have achieved remarkable success in image generation. Despite their progress, challenges remain in both prompt-following ability, image quality and lack of high-quality datasets, which are essential for refining these models. As acquiring labeled data is costly, we introduce AGFSync, a framework that enhances T2I diffusion models through Direct Preference Optimization (DPO) in a fully AI-driven approach. AGFSync utilizes Vision-Language Models (VLM) to assess image quality across style, coherence, and aesthetics, generating feedback data within an AI-driven loop. By applying AGFSync to leading T2I models such as SD v1.4, v1.5, and SDXL, our extensive experiments on the TIFA dataset demonstrate notable improvements in VQA scores, aesthetic evaluations, and performance on the HPSv2 benchmark, consistently outperforming the base models. AGFSync's method of refining T2I diffusion models paves the way for scalable alignment techniques.
△ Less
Submitted 3 April, 2024; v1 submitted 20 March, 2024;
originally announced March 2024.
-
Stacked Intelligent Metasurface Enabled LEO Satellite Communications Relying on Statistical CSI
Authors:
Shining Lin,
Jiancheng An,
Lu Gan,
Mérouane Debbah,
Chau Yuen
Abstract:
Low earth orbit (LEO) satellite communication systems have gained increasing attention as a crucial supplement to terrestrial wireless networks due to their extensive coverage area. This letter presents a novel system design for LEO satellite systems by leveraging stacked intelligent metasurface (SIM) technology. Specifically, the lightweight and energy-efficient SIM is mounted on a satellite to a…
▽ More
Low earth orbit (LEO) satellite communication systems have gained increasing attention as a crucial supplement to terrestrial wireless networks due to their extensive coverage area. This letter presents a novel system design for LEO satellite systems by leveraging stacked intelligent metasurface (SIM) technology. Specifically, the lightweight and energy-efficient SIM is mounted on a satellite to achieve multiuser beamforming directly in the electromagnetic wave domain, which substantially reduces the processing delay and computational load of the satellite compared to the traditional digital beamforming scheme. To overcome the challenges of obtaining instantaneous channel state information (CSI) at the transmitter and maximize the system's performance, a joint power allocation and SIM phase shift optimization problem for maximizing the ergodic sum rate is formulated based on statistical CSI, and an alternating optimization (AO) algorithm is customized to solve it efficiently. Additionally, a user grouping method based on channel correlation and an antenna selection algorithm are proposed to further improve the system performance. Simulation results demonstrate the effectiveness of the proposed SIM-based LEO satellite system design and statistical CSI-based AO algorithm.
△ Less
Submitted 9 March, 2024;
originally announced March 2024.
-
Channel Estimation for Stacked Intelligent Metasurface-Assisted Wireless Networks
Authors:
Xianghao Yao,
Jiancheng An,
Lu Gan,
Marco Di Renzo,
Chau Yuen
Abstract:
Emerging technologies, such as holographic multiple-input multiple-output (HMIMO) and stacked intelligent metasurface (SIM), are driving the development of wireless communication systems. Specifically, the SIM is physically constructed by stacking multiple layers of metasurfaces and has an architecture similar to an artificial neural network (ANN), which can flexibly manipulate the electromagnetic…
▽ More
Emerging technologies, such as holographic multiple-input multiple-output (HMIMO) and stacked intelligent metasurface (SIM), are driving the development of wireless communication systems. Specifically, the SIM is physically constructed by stacking multiple layers of metasurfaces and has an architecture similar to an artificial neural network (ANN), which can flexibly manipulate the electromagnetic waves that propagate through it at the speed of light. This architecture enables the SIM to achieve HMIMO precoding and combining in the wave domain, thus significantly reducing the hardware cost and energy consumption. In this letter, we investigate the channel estimation problem in SIM-assisted multi-user HMIMO communication systems. Since the number of antennas at the base station (BS) is much smaller than the number of meta-atoms per layer of the SIM, it is challenging to acquire the channel state information (CSI) in SIM-assisted multi-user systems. To address this issue, we collect multiple copies of the uplink pilot signals that propagate through the SIM. Furthermore, we leverage the array geometry to identify the subspace that spans arbitrary spatial correlation matrices. Based on partial CSI about the channel statistics, a pair of subspace-based channel estimators are proposed. Additionally, we compute the mean square error (MSE) of the proposed channel estimators and optimize the phase shifts of the SIM to minimize the MSE. Numerical results are illustrated to analyze the effectiveness of the proposed channel estimation schemes.
△ Less
Submitted 9 March, 2024;
originally announced March 2024.
-
Benchmarking zero-shot stance detection with FlanT5-XXL: Insights from training data, prompting, and decoding strategies into its near-SoTA performance
Authors:
Rachith Aiyappa,
Shruthi Senthilmani,
Jisun An,
Haewoon Kwak,
Yong-Yeol Ahn
Abstract:
We investigate the performance of LLM-based zero-shot stance detection on tweets. Using FlanT5-XXL, an instruction-tuned open-source LLM, with the SemEval 2016 Tasks 6A, 6B, and P-Stance datasets, we study the performance and its variations under different prompts and decoding strategies, as well as the potential biases of the model. We show that the zero-shot approach can match or outperform stat…
▽ More
We investigate the performance of LLM-based zero-shot stance detection on tweets. Using FlanT5-XXL, an instruction-tuned open-source LLM, with the SemEval 2016 Tasks 6A, 6B, and P-Stance datasets, we study the performance and its variations under different prompts and decoding strategies, as well as the potential biases of the model. We show that the zero-shot approach can match or outperform state-of-the-art benchmarks, including fine-tuned models. We provide various insights into its performance including the sensitivity to instructions and prompts, the decoding strategies, the perplexity of the prompts, and to negations and oppositions present in prompts. Finally, we ensure that the LLM has not been trained on test datasets, and identify a positivity bias which may partially explain the performance differences across decoding strategie
△ Less
Submitted 29 February, 2024;
originally announced March 2024.
-
Achievable Rate Optimization for Stacked Intelligent Metasurface-Assisted Holographic MIMO Communications
Authors:
Anastasios Papazafeiropoulos,
Jiancheng An,
Pandelis Kourtessis,
Tharmalingam Ratnarajah,
Symeon Chatzinotas
Abstract:
Stacked intelligent metasurfaces (SIM) is a revolutionary technology, which can outperform its single-layer counterparts by performing advanced signal processing relying on wave propagation. In this work, we exploit SIM to enable transmit precoding and receiver combining in holographic multiple-input multiple-output (HMIMO) communications, and we study the achievable rate by formulating a joint op…
▽ More
Stacked intelligent metasurfaces (SIM) is a revolutionary technology, which can outperform its single-layer counterparts by performing advanced signal processing relying on wave propagation. In this work, we exploit SIM to enable transmit precoding and receiver combining in holographic multiple-input multiple-output (HMIMO) communications, and we study the achievable rate by formulating a joint optimization problem of the SIM phase shifts at both sides of the transceiver and the covariance matrix of the transmitted signal. Notably, we propose its solution by means of an iterative optimization algorithm that relies on the projected gradient method, and accounts for all optimization parameters simultaneously. We also obtain the step size guaranteeing the convergence of the proposed algorithm. Simulation results provide fundamental insights such the performance improvements compared to the single-RIS counterpart and conventional MIMO system. Remarkably, the proposed algorithm results in the same achievable rate as the alternating optimization (AO) benchmark but with a less number of iterations.
△ Less
Submitted 8 May, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
From Text to CQL: Bridging Natural Language and Corpus Search Engine
Authors:
Luming Lu,
Jiyuan An,
Yujie Wang,
Liner yang,
Cunliang Kong,
Zhenghao Liu,
Shuo Wang,
Haozhe Lin,
Mingwei Fang,
Yaping Huang,
Erhong Yang
Abstract:
Natural Language Processing (NLP) technologies have revolutionized the way we interact with information systems, with a significant focus on converting natural language queries into formal query languages such as SQL. However, less emphasis has been placed on the Corpus Query Language (CQL), a critical tool for linguistic research and detailed analysis within text corpora. The manual construction…
▽ More
Natural Language Processing (NLP) technologies have revolutionized the way we interact with information systems, with a significant focus on converting natural language queries into formal query languages such as SQL. However, less emphasis has been placed on the Corpus Query Language (CQL), a critical tool for linguistic research and detailed analysis within text corpora. The manual construction of CQL queries is a complex and time-intensive task that requires a great deal of expertise, which presents a notable challenge for both researchers and practitioners. This paper presents the first text-to-CQL task that aims to automate the translation of natural language into CQL. We present a comprehensive framework for this task, including a specifically curated large-scale dataset and methodologies leveraging large language models (LLMs) for effective text-to-CQL task. In addition, we established advanced evaluation metrics to assess the syntactic and semantic accuracy of the generated queries. We created innovative LLM-based conversion approaches and detailed experiments. The results demonstrate the efficacy of our methods and provide insights into the complexities of text-to-CQL task.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
Token-Ensemble Text Generation: On Attacking the Automatic AI-Generated Text Detection
Authors:
Fan Huang,
Haewoon Kwak,
Jisun An
Abstract:
The robustness of AI-content detection models against cultivated attacks (e.g., paraphrasing or word switching) remains a significant concern. This study proposes a novel token-ensemble generation strategy to challenge the robustness of current AI-content detection approaches. We explore the ensemble attack strategy by completing the prompt with the next token generated from random candidate LLMs.…
▽ More
The robustness of AI-content detection models against cultivated attacks (e.g., paraphrasing or word switching) remains a significant concern. This study proposes a novel token-ensemble generation strategy to challenge the robustness of current AI-content detection approaches. We explore the ensemble attack strategy by completing the prompt with the next token generated from random candidate LLMs. We find the token-ensemble approach significantly drops the performance of AI-content detection models (The code and test sets will be released). Our findings reveal that token-ensemble generation poses a vital challenge to current detection models and underlines the need for advancing detection technologies to counter sophisticated adversarial strategies.
△ Less
Submitted 16 February, 2024;
originally announced February 2024.
-
Two-Dimensional Direction-of-Arrival Estimation Using Stacked Intelligent Metasurfaces
Authors:
Jiancheng An,
Chau Yuen,
Yong Liang Guan,
Marco Di Renzo,
Mérouane Debbah,
H. Vincent Poor,
Lajos Hanzo
Abstract:
Stacked intelligent metasurfaces (SIM) are capable of emulating reconfigurable physical neural networks by relying on electromagnetic (EM) waves as carriers. They can also perform various complex computational and signal processing tasks. A SIM is fabricated by densely integrating multiple metasurface layers, each consisting of a large number of small meta-atoms that can control the EM waves passi…
▽ More
Stacked intelligent metasurfaces (SIM) are capable of emulating reconfigurable physical neural networks by relying on electromagnetic (EM) waves as carriers. They can also perform various complex computational and signal processing tasks. A SIM is fabricated by densely integrating multiple metasurface layers, each consisting of a large number of small meta-atoms that can control the EM waves passing through it. In this paper, we harness a SIM for two-dimensional (2D) direction-of-arrival (DOA) estimation. In contrast to the conventional designs, an advanced SIM in front of the receiver array automatically carries out the 2D discrete Fourier transform (DFT) as the incident waves propagate through it. As a result, the receiver array directly observes the angular spectrum of the incoming signal. In this context, the DOA estimates can be readily obtained by using probes to detect the energy distribution on the receiver array. This avoids the need for power-thirsty radio frequency (RF) chains. To enable SIM to perform the 2D DFT, we formulate the optimization problem of minimizing the fitting error between the SIM's EM response and the 2D DFT matrix. Furthermore, a gradient descent algorithm is customized for iteratively updating the phase shift of each meta-atom in SIM. To further improve the DOA estimation accuracy, we configure the phase shift pattern in the zeroth layer of the SIM to generate a set of 2D DFT matrices associated with orthogonal spatial frequency bins. Additionally, we analytically evaluate the performance of the proposed SIM-based DOA estimator by deriving a tight upper bound for the mean square error (MSE). Our numerical simulations verify the capability of a well-trained SIM to perform DOA estimation and corroborate our theoretical analysis. It is demonstrated that a SIM having an optical computational speed achieves an MSE of $10^{-4}$ for DOA estimation.
△ Less
Submitted 13 February, 2024;
originally announced February 2024.
-
Prompting Large Language Models for Zero-Shot Clinical Prediction with Structured Longitudinal Electronic Health Record Data
Authors:
Yinghao Zhu,
Zixiang Wang,
Junyi Gao,
Yuning Tong,
Jingkun An,
Weibin Liao,
Ewen M. Harrison,
Liantao Ma,
Chengwei Pan
Abstract:
The inherent complexity of structured longitudinal Electronic Health Records (EHR) data poses a significant challenge when integrated with Large Language Models (LLMs), which are traditionally tailored for natural language processing. Motivated by the urgent need for swift decision-making during new disease outbreaks, where traditional predictive models often fail due to a lack of historical data,…
▽ More
The inherent complexity of structured longitudinal Electronic Health Records (EHR) data poses a significant challenge when integrated with Large Language Models (LLMs), which are traditionally tailored for natural language processing. Motivated by the urgent need for swift decision-making during new disease outbreaks, where traditional predictive models often fail due to a lack of historical data, this research investigates the adaptability of LLMs, like GPT-4, to EHR data. We particularly focus on their zero-shot capabilities, which enable them to make predictions in scenarios in which they haven't been explicitly trained. In response to the longitudinal, sparse, and knowledge-infused nature of EHR data, our prompting approach involves taking into account specific EHR characteristics such as units and reference ranges, and employing an in-context learning strategy that aligns with clinical contexts. Our comprehensive experiments on the MIMIC-IV and TJH datasets demonstrate that with our elaborately designed prompting framework, LLMs can improve prediction performance in key tasks such as mortality, length-of-stay, and 30-day readmission by about 35\%, surpassing ML models in few-shot settings. Our research underscores the potential of LLMs in enhancing clinical decision-making, especially in urgent healthcare situations like the outbreak of emerging diseases with no labeled data. The code is publicly available at https://github.com/yhzhu99/llm4healthcare for reproducibility.
△ Less
Submitted 10 February, 2024; v1 submitted 25 January, 2024;
originally announced February 2024.
-
Massive Unsourced Random Access for Near-Field Communications
Authors:
Xinyu Xie,
Yongpeng Wu,
Jianping An,
Derrick Wing Kwan Ng,
Chengwen Xing,
Wenjun Zhang
Abstract:
This paper investigates the unsourced random access (URA) problem with a massive multiple-input multiple-output receiver that serves wireless devices in the near-field of radiation. We employ an uncoupled transmission protocol without appending redundancies to the slot-wise encoded messages. To exploit the channel sparsity for block length reduction while facing the collapsed sparse structure in t…
▽ More
This paper investigates the unsourced random access (URA) problem with a massive multiple-input multiple-output receiver that serves wireless devices in the near-field of radiation. We employ an uncoupled transmission protocol without appending redundancies to the slot-wise encoded messages. To exploit the channel sparsity for block length reduction while facing the collapsed sparse structure in the angular domain of near-field channels, we propose a sparse channel sampling method that divides the angle-distance (polar) domain based on the maximum permissible coherence. Decoding starts with retrieving active codewords and channels from each slot. We address the issue by leveraging the structured channel sparsity in the spatial and polar domains and propose a novel turbo-based recovery algorithm. Furthermore, we investigate an off-grid compressed sensing method to refine discretely estimated channel parameters over the continuum that improves the detection performance. Afterward, without the assistance of redundancies, we recouple the separated messages according to the similarity of the users' channel information and propose a modified K-medoids method to handle the constraints and collisions involved in channel clustering. Simulations reveal that via exploiting the channel sparsity, the proposed URA scheme achieves high spectral efficiency and surpasses existing multi-slot-based schemes. Moreover, with more measurements provided by the overcomplete channel sampling, the near-field-suited scheme outperforms its counterpart of the far-field.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching
Authors:
Ling Jiang,
Junwen An,
Huihui Huang,
Qiyi Tang,
Sen Nie,
Shi Wu,
Yuqun Zhang
Abstract:
While third-party libraries are extensively reused to enhance productivity during software development, they can also introduce potential security risks such as vulnerability propagation. Software composition analysis, proposed to identify reused TPLs for reducing such risks, has become an essential procedure within modern DevSecOps. As one of the mainstream SCA techniques, binary-to-source SCA id…
▽ More
While third-party libraries are extensively reused to enhance productivity during software development, they can also introduce potential security risks such as vulnerability propagation. Software composition analysis, proposed to identify reused TPLs for reducing such risks, has become an essential procedure within modern DevSecOps. As one of the mainstream SCA techniques, binary-to-source SCA identifies the third-party source projects contained in binary files via binary source code matching, which is a major challenge in reverse engineering since binary and source code exhibit substantial disparities after compilation. The existing binary-to-source SCA techniques leverage basic syntactic features that suffer from redundancy and lack robustness in the large-scale TPL dataset, leading to inevitable false positives and compromised recall. To mitigate these limitations, we introduce BinaryAI, a novel binary-to-source SCA technique with two-phase binary source code matching to capture both syntactic and semantic code features. First, BinaryAI trains a transformer-based model to produce function-level embeddings and obtain similar source functions for each binary function accordingly. Then by applying the link-time locality to facilitate function matching, BinaryAI detects the reused TPLs based on the ratio of matched source functions. Our experimental results demonstrate the superior performance of BinaryAI in terms of binary source code matching and the downstream SCA task. Specifically, our embedding model outperforms the state-of-the-art model CodeCMR, i.e., achieving 22.54% recall@1 and 0.34 MRR compared with 10.75% and 0.17 respectively. Additionally, BinaryAI outperforms all existing binary-to-source SCA tools in TPL detection, increasing the precision from 73.36% to 85.84% and recall from 59.81% to 64.98% compared with the well-recognized commercial SCA product Black Duck.
△ Less
Submitted 23 January, 2024; v1 submitted 20 January, 2024;
originally announced January 2024.
-
Dynamic Routing for Integrated Satellite-Terrestrial Networks: A Constrained Multi-Agent Reinforcement Learning Approach
Authors:
Yifeng Lyu,
Han Hu,
Rongfei Fan,
Zhi Liu,
Jianping An,
Shiwen Mao
Abstract:
The integrated satellite-terrestrial network (ISTN) system has experienced significant growth, offering seamless communication services in remote areas with limited terrestrial infrastructure. However, designing a routing scheme for ISTN is exceedingly difficult, primarily due to the heightened complexity resulting from the inclusion of additional ground stations, along with the requirement to sat…
▽ More
The integrated satellite-terrestrial network (ISTN) system has experienced significant growth, offering seamless communication services in remote areas with limited terrestrial infrastructure. However, designing a routing scheme for ISTN is exceedingly difficult, primarily due to the heightened complexity resulting from the inclusion of additional ground stations, along with the requirement to satisfy various constraints related to satellite service quality. To address these challenges, we study packet routing with ground stations and satellites working jointly to transmit packets, while prioritizing fast communication and meeting energy efficiency and packet loss requirements. Specifically, we formulate the problem of packet routing with constraints as a max-min problem using the Lagrange method. Then we propose a novel constrained Multi-Agent reinforcement learning (MARL) dynamic routing algorithm named CMADR, which efficiently balances objective improvement and constraint satisfaction during the updating of policy and Lagrange multipliers. Finally, we conduct extensive experiments and an ablation study using the OneWeb and Telesat mega-constellations. Results demonstrate that CMADR reduces the packet delay by a minimum of 21% and 15%, while meeting stringent energy consumption and packet loss rate constraints, outperforming several baseline algorithms.
△ Less
Submitted 22 December, 2023;
originally announced January 2024.
-
Bring Metric Functions into Diffusion Models
Authors:
Jie An,
Zhengyuan Yang,
Jianfeng Wang,
Linjie Li,
Zicheng Liu,
Lijuan Wang,
Jiebo Luo
Abstract:
We introduce a Cascaded Diffusion Model (Cas-DM) that improves a Denoising Diffusion Probabilistic Model (DDPM) by effectively incorporating additional metric functions in training. Metric functions such as the LPIPS loss have been proven highly effective in consistency models derived from the score matching. However, for the diffusion counterparts, the methodology and efficacy of adding extra met…
▽ More
We introduce a Cascaded Diffusion Model (Cas-DM) that improves a Denoising Diffusion Probabilistic Model (DDPM) by effectively incorporating additional metric functions in training. Metric functions such as the LPIPS loss have been proven highly effective in consistency models derived from the score matching. However, for the diffusion counterparts, the methodology and efficacy of adding extra metric functions remain unclear. One major challenge is the mismatch between the noise predicted by a DDPM at each step and the desired clean image that the metric function works well on. To address this problem, we propose Cas-DM, a network architecture that cascades two network modules to effectively apply metric functions to the diffusion model training. The first module, similar to a standard DDPM, learns to predict the added noise and is unaffected by the metric function. The second cascaded module learns to predict the clean image, thereby facilitating the metric function computation. Experiment results show that the proposed diffusion model backbone enables the effective use of the LPIPS loss, leading to state-of-the-art image quality (FID, sFID, IS) on various established benchmarks.
△ Less
Submitted 4 January, 2024;
originally announced January 2024.
-
Video Understanding with Large Language Models: A Survey
Authors:
Yunlong Tang,
Jing Bi,
Siting Xu,
Luchuan Song,
Susan Liang,
Teng Wang,
Daoan Zhang,
Jie An,
Jingyang Lin,
Rongyi Zhu,
Ali Vosoughi,
Chao Huang,
Zeliang Zhang,
Feng Zheng,
Jianguo Zhang,
Ping Luo,
Jiebo Luo,
Chenliang Xu
Abstract:
With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of Large Language Models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of the recent advancements in video understanding harnessing the power of LLMs (Vid-…
▽ More
With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of Large Language Models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of the recent advancements in video understanding harnessing the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended spatial-temporal reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into four main types: LLM-based Video Agents, Vid-LLMs Pretraining, Vid-LLMs Instruction Tuning, and Hybrid Methods. Furthermore, this survey presents a comprehensive study of the tasks, datasets, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.
△ Less
Submitted 3 January, 2024; v1 submitted 28 December, 2023;
originally announced December 2023.
-
Improving Real Estate Appraisal with POI Integration and Areal Embedding
Authors:
Sumin Han,
Youngjun Park,
Sonia Sabir,
Jisun An,
Dongman Lee
Abstract:
Despite advancements in real estate appraisal methods, this study primarily focuses on two pivotal challenges. Firstly, we explore the often-underestimated impact of Points of Interest (POI) on property values, emphasizing the necessity for a comprehensive, data-driven approach to feature selection. Secondly, we integrate road-network-based Areal Embedding to enhance spatial understanding for real…
▽ More
Despite advancements in real estate appraisal methods, this study primarily focuses on two pivotal challenges. Firstly, we explore the often-underestimated impact of Points of Interest (POI) on property values, emphasizing the necessity for a comprehensive, data-driven approach to feature selection. Secondly, we integrate road-network-based Areal Embedding to enhance spatial understanding for real estate appraisal. We first propose a revised method for POI feature extraction, and discuss the impact of each POI for house price appraisal. Then we present the Areal embedding-enabled Masked Multihead Attention-based Spatial Interpolation for House Price Prediction (AMMASI) model, an improvement upon the existing ASI model, which leverages masked multi-head attention on geographic neighbor houses and similar-featured houses. Our model outperforms current baselines and also offers promising avenues for future optimization in real estate appraisal methodologies.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
Stacked Intelligent Metasurface-Aided MIMO Transceiver Design
Authors:
Jiancheng An,
Chau Yuen,
Chao Xu,
Hongbin Li,
Derrick Wing Kwan Ng,
Marco Di Renzo,
Mérouane Debbah,
Lajos Hanzo
Abstract:
Next-generation wireless networks are expected to utilize the limited radio frequency (RF) resources more efficiently with the aid of intelligent transceivers. To this end, we propose a promising transceiver architecture relying on stacked intelligent metasurfaces (SIM). An SIM is constructed by stacking an array of programmable metasurface layers, where each layer consists of a massive number of…
▽ More
Next-generation wireless networks are expected to utilize the limited radio frequency (RF) resources more efficiently with the aid of intelligent transceivers. To this end, we propose a promising transceiver architecture relying on stacked intelligent metasurfaces (SIM). An SIM is constructed by stacking an array of programmable metasurface layers, where each layer consists of a massive number of low-cost passive meta-atoms that individually manipulate the electromagnetic (EM) waves. By appropriately configuring the passive meta-atoms, an SIM is capable of accomplishing advanced computation and signal processing tasks, such as multiple-input multiple-output (MIMO) precoding/combining, multi-user interference mitigation, and radar sensing, as the EM wave propagates through the multiple layers of the metasurface, which effectively reduces both the RF-related energy consumption and processing delay. Inspired by this, we provide an overview of the SIM-aided MIMO transceiver design, which encompasses its hardware architecture and its potential benefits over state-of-the-art solutions. Furthermore, we discuss promising application scenarios and identify the open research challenges associated with the design of advanced SIM architectures for next-generation wireless networks. Finally, numerical results are provided for quantifying the benefits of wave-based signal processing in wireless systems.
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
Mixture-of-Experts for Open Set Domain Adaptation: A Dual-Space Detection Approach
Authors:
Zhenbang Du,
Jiayu An,
Yunlu Tu,
Jiahao Hong,
Dongrui Wu
Abstract:
Open Set Domain Adaptation (OSDA) aims to cope with the distribution and label shifts between the source and target domains simultaneously, performing accurate classification for known classes while identifying unknown class samples in the target domain. Most existing OSDA approaches, depending on the final image feature space of deep models, require manually-tuned thresholds, and may easily miscl…
▽ More
Open Set Domain Adaptation (OSDA) aims to cope with the distribution and label shifts between the source and target domains simultaneously, performing accurate classification for known classes while identifying unknown class samples in the target domain. Most existing OSDA approaches, depending on the final image feature space of deep models, require manually-tuned thresholds, and may easily misclassify unknown samples as known classes. Mixture-of-Experts (MoE) could be a remedy. Within a MoE, different experts handle distinct input features, producing unique expert routing patterns for various classes in a routing feature space. As a result, unknown class samples may display different expert routing patterns to known classes. In this paper, we propose Dual-Space Detection, which exploits the inconsistencies between the image feature space and the routing feature space to detect unknown class samples without any threshold. Graph Router is further introduced to better make use of the spatial information among image patches. Experiments on three different datasets validated the effectiveness and superiority of our approach.
△ Less
Submitted 3 July, 2024; v1 submitted 1 November, 2023;
originally announced November 2023.
-
Stacked Intelligent Metasurface Performs a 2D DFT in the Wave Domain for DOA Estimation
Authors:
Jiancheng An,
Chau Yuen,
Marco Di Renzo,
Merouane Debbah,
H. Vincent Poor,
Lajos Hanzo
Abstract:
Staked intelligent metasurface (SIM) based techniques are developed to perform two-dimensional (2D) direction-of-arrival (DOA) estimation. In contrast to the conventional designs, an advanced SIM in front of the receiving array automatically performs the 2D discrete Fourier transform (DFT) as the incident waves propagate through it. To arrange for the SIM to carry out this task, we design a gradie…
▽ More
Staked intelligent metasurface (SIM) based techniques are developed to perform two-dimensional (2D) direction-of-arrival (DOA) estimation. In contrast to the conventional designs, an advanced SIM in front of the receiving array automatically performs the 2D discrete Fourier transform (DFT) as the incident waves propagate through it. To arrange for the SIM to carry out this task, we design a gradient descent algorithm for iteratively updating the phase shift of each meta-atom in the SIM to minimize the fitting error between the SIM's response and the 2D DFT matrix. To further improve the DOA estimation accuracy, we configure the phase shifts in the input layer of SIM to generate a set of 2D DFT matrices having orthogonal spatial frequency bins. Extensive numerical simulations verify the capability of a well-trained SIM to perform 2D DFT. Specifically, it is demonstrated that the SIM having an optical computational speed achieves an MSE of $10^{-4}$ in 2D DOA estimation.
△ Less
Submitted 15 October, 2023;
originally announced October 2023.
-
Enhancing Stance Classification with Quantified Moral Foundations
Authors:
Hong Zhang,
Prasanta Bhattacharya,
Wei Gao,
Liang Ze Wong,
Brandon Siyuan Loh,
Joseph J. P. Simons,
Jisun An
Abstract:
This study enhances stance detection on social media by incorporating deeper psychological attributes, specifically individuals' moral foundations. These theoretically-derived dimensions aim to provide a comprehensive profile of an individual's moral concerns which, in recent work, has been linked to behaviour in a range of domains, including society, politics, health, and the environment. In this…
▽ More
This study enhances stance detection on social media by incorporating deeper psychological attributes, specifically individuals' moral foundations. These theoretically-derived dimensions aim to provide a comprehensive profile of an individual's moral concerns which, in recent work, has been linked to behaviour in a range of domains, including society, politics, health, and the environment. In this paper, we investigate how moral foundation dimensions can contribute to predicting an individual's stance on a given target. Specifically we incorporate moral foundation features extracted from text, along with message semantic features, to classify stances at both message- and user-levels across a range of targets and models. Our preliminary results suggest that encoding moral foundations can enhance the performance of stance detection tasks and help illuminate the associations between specific moral foundations and online stances on target topics. The results highlight the importance of considering deeper psychological attributes in stance analysis and underscores the role of moral foundations in guiding online social behavior.
△ Less
Submitted 15 October, 2023;
originally announced October 2023.
-
OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation
Authors:
Jie An,
Zhengyuan Yang,
Linjie Li,
Jianfeng Wang,
Kevin Lin,
Zicheng Liu,
Lijuan Wang,
Jiebo Luo
Abstract:
This work investigates a challenging task named open-domain interleaved image-text generation, which generates interleaved texts and images following an input query. We propose a new interleaved generation framework based on prompting large-language models (LLMs) and pre-trained text-to-image (T2I) models, namely OpenLEAF. In OpenLEAF, the LLM generates textual descriptions, coordinates T2I models…
▽ More
This work investigates a challenging task named open-domain interleaved image-text generation, which generates interleaved texts and images following an input query. We propose a new interleaved generation framework based on prompting large-language models (LLMs) and pre-trained text-to-image (T2I) models, namely OpenLEAF. In OpenLEAF, the LLM generates textual descriptions, coordinates T2I models, creates visual prompts for generating images, and incorporates global contexts into the T2I models. This global context improves the entity and style consistencies of images in the interleaved generation. For model assessment, we first propose to use large multi-modal models (LMMs) to evaluate the entity and style consistencies of open-domain interleaved image-text sequences. According to the LMM evaluation on our constructed evaluation set, the proposed interleaved generation framework can generate high-quality image-text content for various domains and applications, such as how-to question answering, storytelling, graphical story rewriting, and webpage/poster generation tasks. Moreover, we validate the effectiveness of the proposed LMM evaluation technique with human assessment. We hope our proposed framework, benchmark, and LMM evaluation could help establish the intriguing interleaved image-text generation task.
△ Less
Submitted 3 November, 2023; v1 submitted 11 October, 2023;
originally announced October 2023.
-
Hybrid Digital-Wave Domain Channel Estimator for Stacked Intelligent Metasurface Enabled Multi-User MISO Systems
Authors:
Qurrat-Ul-Ain Nadeem,
Jiancheng An,
Anas Chaaban
Abstract:
Stacked intelligent metasurface (SIM) is an emerging programmable metasurface architecture that can implement signal processing directly in the electromagnetic wave domain, thereby enabling efficient implementation of ultra-massive multiple-input multiple-output (MIMO) transceivers with a limited number of radio frequency (RF) chains. Channel estimation (CE) is challenging for SIM-enabled communic…
▽ More
Stacked intelligent metasurface (SIM) is an emerging programmable metasurface architecture that can implement signal processing directly in the electromagnetic wave domain, thereby enabling efficient implementation of ultra-massive multiple-input multiple-output (MIMO) transceivers with a limited number of radio frequency (RF) chains. Channel estimation (CE) is challenging for SIM-enabled communication systems due to the multi-layer architecture of SIM, and because we need to estimate large dimensional channels between the SIM and users with a limited number of RF chains. To efficiently solve this problem, we develop a novel hybrid digital-wave domain channel estimator, in which the received training symbols are first processed in the wave domain within the SIM layers, and then processed in the digital domain. The wave domain channel estimator, parametrized by the phase shifts applied by the meta-atoms in all layers, is optimized to minimize the mean squared error (MSE) using a gradient descent algorithm, within which the digital part is optimally updated. For an SIM-enabled multi-user system equipped with 4 RF chains and a 6-layer SIM with 64 meta-atoms each, the proposed estimator yields an MSE that is very close to that achieved by fully digital CE in a massive MIMO system employing 64 RF chains. This high CE accuracy is achieved at the cost of a training overhead that can be reduced by exploiting the potential low rank of channel correlation matrices.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Toward Beamfocusing-Aided Near-Field Communications: Research Advances, Potential, and Challenges
Authors:
Jiancheng An,
Chau Yuen,
Linglong Dai,
Marco Di Renzo,
Merouane Debbah,
Lajos Hanzo
Abstract:
Next-generation mobile networks promise to support high throughput, massive connectivity, and improved energy efficiency. To achieve these ambitious goals, extremely large-scale antenna arrays (ELAAs) and terahertz communications constitute a pair of promising technologies. This will result in future wireless communications occurring in the near-field regions. To accurately portray the channel cha…
▽ More
Next-generation mobile networks promise to support high throughput, massive connectivity, and improved energy efficiency. To achieve these ambitious goals, extremely large-scale antenna arrays (ELAAs) and terahertz communications constitute a pair of promising technologies. This will result in future wireless communications occurring in the near-field regions. To accurately portray the channel characteristics of near-field wireless propagation, spherical wavefront-based models are required and present both opportunities as well as challenges. Following the basics of near-field communications (NFC), we contrast it to conventional far-field communications. Moreover, we cover the key challenges of NFC, including its channel modeling and estimation, near-field beamfocusing, as well as hardware design. Our numerical results demonstrate the potential of NFC in improving the spatial multiplexing gain and positioning accuracy. Finally, a suite of open issues are identified for motivating future research.
△ Less
Submitted 17 September, 2023;
originally announced September 2023.
-
DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices
Authors:
Guanyu Xu,
Zhiwei Hao,
Yong Luo,
Han Hu,
Jianping An,
Shiwen Mao
Abstract:
Recent years have witnessed the great success of vision transformer (ViT), which has achieved state-of-the-art performance on multiple computer vision benchmarks. However, ViT models suffer from vast amounts of parameters and high computation cost, leading to difficult deployment on resource-constrained edge devices. Existing solutions mostly compress ViT models to a compact model but still cannot…
▽ More
Recent years have witnessed the great success of vision transformer (ViT), which has achieved state-of-the-art performance on multiple computer vision benchmarks. However, ViT models suffer from vast amounts of parameters and high computation cost, leading to difficult deployment on resource-constrained edge devices. Existing solutions mostly compress ViT models to a compact model but still cannot achieve real-time inference. To tackle this issue, we propose to explore the divisibility of transformer structure, and decompose the large ViT into multiple small models for collaborative inference at edge devices. Our objective is to achieve fast and energy-efficient collaborative inference while maintaining comparable accuracy compared with large ViTs. To this end, we first propose a collaborative inference framework termed DeViT to facilitate edge deployment by decomposing large ViTs. Subsequently, we design a decomposition-and-ensemble algorithm based on knowledge distillation, termed DEKD, to fuse multiple small decomposed models while dramatically reducing communication overheads, and handle heterogeneous models by developing a feature matching module to promote the imitations of decomposed models from the large ViT. Extensive experiments for three representative ViT backbones on four widely-used datasets demonstrate our method achieves efficient collaborative inference for ViTs and outperforms existing lightweight ViTs, striking a good trade-off between efficiency and accuracy. For example, our DeViTs improves end-to-end latency by 2.89$\times$ with only 1.65% accuracy sacrifice using CIFAR-100 compared to the large ViT, ViT-L/16, on the GPU server. DeDeiTs surpasses the recent efficient ViT, MobileViT-S, by 3.54% in accuracy on ImageNet-1K, while running 1.72$\times$ faster and requiring 55.28% lower energy consumption on the edge device.
△ Less
Submitted 10 September, 2023;
originally announced September 2023.
-
Stacked Intelligent Metasurfaces for Multiuser Downlink Beamforming in the Wave Domain
Authors:
Jiancheng An,
Marco Di Renzo,
Merouane Debbah,
H. Vincent Poor,
Chau Yuen
Abstract:
Intelligent metasurface has recently emerged as a promising technology that enables the customization of wireless environments by harnessing large numbers of inexpensive configurable scattering elements. However, prior studies have predominantly focused on single-layer metasurfaces, which have limitations in terms of the number of beam patterns they can steer accurately due to practical hardware r…
▽ More
Intelligent metasurface has recently emerged as a promising technology that enables the customization of wireless environments by harnessing large numbers of inexpensive configurable scattering elements. However, prior studies have predominantly focused on single-layer metasurfaces, which have limitations in terms of the number of beam patterns they can steer accurately due to practical hardware restrictions. In contrast, this paper introduces a novel stacked intelligent metasurface (SIM) design. Specifically, we investigate the integration of SIM into the downlink of a multiuser multiple-input single-output (MISO) communication system, where a SIM, consisting of a multilayer metasurface structure, is deployed at the base station (BS) to facilitate transmit beamforming in the electromagnetic wave domain. This eliminates the need for conventional digital beamforming and high-resolution digital-to-analog converters at the BS. To this end, we formulate an optimization problem that aims to maximize the sum rate of all user equipments by jointly optimizing the transmit power allocation at the BS and the wave-based beamforming at the SIM, subject to both the transmit power budget and discrete phase shift constraints. Furthermore, we propose a computationally efficient algorithm for solving this joint optimization problem and elaborate on the potential benefits of employing SIM in wireless networks. Finally, the numerical results corroborate the effectiveness of the proposed SIM-enabled wave-based beamforming design and evaluate the performance improvement achieved by the proposed algorithm compared to various benchmark schemes. It is demonstrated that considering the same number of transmit antennas, the proposed SIM-based system achieves about 200\% improvement in terms of sum rate compared to conventional MISO systems.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
Pilot Power Allocation for Channel Estimation in a Multi-RIS Aided Communication System
Authors:
Jiancheng An,
Chau Yuen
Abstract:
Reconfigurable intelligent surface (RIS) is a promising technology that enables the customization of electromagnetic propagation environments in next-generation wireless networks. In this paper, we investigate the optimal pilot power allocation during the channel estimation stage to improve the ergodic channel gain of RIS-assisted systems under practical imperfect channel state information (CSI).…
▽ More
Reconfigurable intelligent surface (RIS) is a promising technology that enables the customization of electromagnetic propagation environments in next-generation wireless networks. In this paper, we investigate the optimal pilot power allocation during the channel estimation stage to improve the ergodic channel gain of RIS-assisted systems under practical imperfect channel state information (CSI). Specifically, we commence by deriving an explicit closed-form expression of the ergodic channel gain of a multi-RIS-aided communication system that takes into account channel estimation errors. Then, we formulate the pilot power allocation problem to maximize the ergodic channel gain under imperfect CSI, subject to the average pilot power constraint. Then, the method of Lagrange multipliers is invoked to obtain the optimal pilot power allocation solution, which indicates that allocating more power to the pilots for estimating the weak reflection channels is capable of effectively improving the ergodic channel gain under imperfect CSI. Finally, extensive simulation results corroborate our theoretical analysis.
△ Less
Submitted 27 August, 2023;
originally announced August 2023.
-
Enhancing Spatiotemporal Traffic Prediction through Urban Human Activity Analysis
Authors:
Sumin Han,
Youngjun Park,
Minji Lee,
Jisun An,
Dongman Lee
Abstract:
Traffic prediction is one of the key elements to ensure the safety and convenience of citizens. Existing traffic prediction models primarily focus on deep learning architectures to capture spatial and temporal correlation. They often overlook the underlying nature of traffic. Specifically, the sensor networks in most traffic datasets do not accurately represent the actual road network exploited by…
▽ More
Traffic prediction is one of the key elements to ensure the safety and convenience of citizens. Existing traffic prediction models primarily focus on deep learning architectures to capture spatial and temporal correlation. They often overlook the underlying nature of traffic. Specifically, the sensor networks in most traffic datasets do not accurately represent the actual road network exploited by vehicles, failing to provide insights into the traffic patterns in urban activities. To overcome these limitations, we propose an improved traffic prediction method based on graph convolution deep learning algorithms. We leverage human activity frequency data from National Household Travel Survey to enhance the inference capability of a causal relationship between activity and traffic patterns. Despite making minimal modifications to the conventional graph convolutional recurrent networks and graph convolutional transformer architectures, our approach achieves state-of-the-art performance without introducing excessive computational overhead.
△ Less
Submitted 20 August, 2023;
originally announced August 2023.
-
DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization
Authors:
Xiaoyu Ye,
Hao Huang,
Jiaqi An,
Yongtao Wang
Abstract:
Stable Diffusion (SD) customization approaches enable users to personalize SD model outputs, greatly enhancing the flexibility and diversity of AI art. However, they also allow individuals to plagiarize specific styles or subjects from copyrighted images, which raises significant concerns about potential copyright infringement. To address this issue, we propose an invisible data-free universal adv…
▽ More
Stable Diffusion (SD) customization approaches enable users to personalize SD model outputs, greatly enhancing the flexibility and diversity of AI art. However, they also allow individuals to plagiarize specific styles or subjects from copyrighted images, which raises significant concerns about potential copyright infringement. To address this issue, we propose an invisible data-free universal adversarial watermark (DUAW), aiming to protect a myriad of copyrighted images from different customization approaches across various versions of SD models. First, DUAW is designed to disrupt the variational autoencoder during SD customization. Second, DUAW operates in a data-free context, where it is trained on synthetic images produced by a Large Language Model (LLM) and a pretrained SD model. This approach circumvents the necessity of directly handling copyrighted images, thereby preserving their confidentiality. Once crafted, DUAW can be imperceptibly integrated into massive copyrighted images, serving as a protective measure by inducing significant distortions in the images generated by customized SD models. Experimental results demonstrate that DUAW can effectively distort the outputs of fine-tuned SD models, rendering them discernible to both human observers and a simple classifier.
△ Less
Submitted 18 August, 2023;
originally announced August 2023.
-
Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation
Authors:
Alexander Martin,
Haitian Zheng,
Jie An,
Jiebo Luo
Abstract:
With a strong understanding of the target domain from natural language, we produce promising results in translating across large domain gaps and bringing skeletons back to life. In this work, we use text-guided latent diffusion models for zero-shot image-to-image translation (I2I) across large domain gaps (longI2I), where large amounts of new visual features and new geometry need to be generated t…
▽ More
With a strong understanding of the target domain from natural language, we produce promising results in translating across large domain gaps and bringing skeletons back to life. In this work, we use text-guided latent diffusion models for zero-shot image-to-image translation (I2I) across large domain gaps (longI2I), where large amounts of new visual features and new geometry need to be generated to enter the target domain. Being able to perform translations across large domain gaps has a wide variety of real-world applications in criminology, astrology, environmental conservation, and paleontology. In this work, we introduce a new task Skull2Animal for translating between skulls and living animals. On this task, we find that unguided Generative Adversarial Networks (GANs) are not capable of translating across large domain gaps. Instead of these traditional I2I methods, we explore the use of guided diffusion and image editing models and provide a new benchmark model, Revive-2I, capable of performing zero-shot I2I via text-prompting latent diffusion models. We find that guidance is necessary for longI2I because, to bridge the large domain gap, prior knowledge about the target domain is needed. In addition, we find that prompting provides the best and most scalable information about the target domain as classifier-guided diffusion models require retraining for specific use cases and lack stronger constraints on the target domain because of the wide variety of images they are trained on.
△ Less
Submitted 14 August, 2023;
originally announced August 2023.
-
Domain-Scalable Unpaired Image Translation via Latent Space Anchoring
Authors:
Siyu Huang,
Jie An,
Donglai Wei,
Zudi Lin,
Jiebo Luo,
Hanspeter Pfister
Abstract:
Unpaired image-to-image translation (UNIT) aims to map images between two visual domains without paired training data. However, given a UNIT model trained on certain domains, it is difficult for current methods to incorporate new domains because they often need to train the full model on both existing and new domains. To address this problem, we propose a new domain-scalable UNIT method, termed as…
▽ More
Unpaired image-to-image translation (UNIT) aims to map images between two visual domains without paired training data. However, given a UNIT model trained on certain domains, it is difficult for current methods to incorporate new domains because they often need to train the full model on both existing and new domains. To address this problem, we propose a new domain-scalable UNIT method, termed as latent space anchoring, which can be efficiently extended to new visual domains and does not need to fine-tune encoders and decoders of existing domains. Our method anchors images of different domains to the same latent space of frozen GANs by learning lightweight encoder and regressor models to reconstruct single-domain images. In the inference phase, the learned encoders and decoders of different domains can be arbitrarily combined to translate images between any two domains without fine-tuning. Experiments on various datasets show that the proposed method achieves superior performance on both standard and domain-scalable UNIT tasks in comparison with the state-of-the-art methods.
△ Less
Submitted 26 June, 2023;
originally announced June 2023.
-
Radio Generation Using Generative Adversarial Networks with An Unrolled Design
Authors:
Weidong Wang,
Jiancheng An,
Hongshu Liao,
Lu Gan,
Chau Yuen
Abstract:
As a revolutionary generative paradigm of deep learning, generative adversarial networks (GANs) have been widely applied in various fields to synthesize realistic data. However, it is challenging for conventional GANs to synthesize raw signal data, especially in some complex cases. In this paper, we develop a novel GAN framework for radio generation called "Radio GAN". Compared to conventional met…
▽ More
As a revolutionary generative paradigm of deep learning, generative adversarial networks (GANs) have been widely applied in various fields to synthesize realistic data. However, it is challenging for conventional GANs to synthesize raw signal data, especially in some complex cases. In this paper, we develop a novel GAN framework for radio generation called "Radio GAN". Compared to conventional methods, it benefits from three key improvements. The first is learning based on sampling points, which aims to model an underlying sampling distribution of radio signals. The second is an unrolled generator design, combined with an estimated pure signal distribution as a prior, which can greatly reduce learning difficulty and effectively improve learning precision. Finally, we present an energy-constrained optimization algorithm to achieve better training stability and convergence. Experimental results with extensive simulations demonstrate that our proposed GAN framework can effectively learn transmitter characteristics and various channel effects, thus accurately modeling for an underlying sampling distribution to synthesize radio signals of high quality.
△ Less
Submitted 24 June, 2023;
originally announced June 2023.
-
A Study on Quantifying Sim2Real Image Gap in Autonomous Driving Simulations Using Lane Segmentation Attention Map Similarity
Authors:
Seongjeong Park,
Jinu Pahk,
Lennart Lorenz Freimuth Jahn,
Yongseob Lim,
Jinung An,
Gyeungho Choi
Abstract:
Autonomous driving simulations require highly realistic images. Our preliminary study found that when the CARLA Simulator image was made more like reality by using DCLGAN, the performance of the lane recognition model improved to levels comparable to real-world driving. It was also confirmed that the vehicle's ability to return to the center of the lane after deviating from it improved significant…
▽ More
Autonomous driving simulations require highly realistic images. Our preliminary study found that when the CARLA Simulator image was made more like reality by using DCLGAN, the performance of the lane recognition model improved to levels comparable to real-world driving. It was also confirmed that the vehicle's ability to return to the center of the lane after deviating from it improved significantly. However, there is currently no agreed-upon metric for quantitatively evaluating the realism of simulation images. To address this issue, based on the idea that FID (Fréchet Inception Distance) measures the feature vector distribution distance using a pre-trained model, this paper proposes a metric that measures the similarity of simulation road images using the attention map from the self-attention distillation process of ENet-SAD. Finally, this paper verified the suitability of the measurement method by applying it to the image of the CARLA map that implemented a realworld autonomous driving test road.
△ Less
Submitted 18 June, 2023;
originally announced June 2023.
-
Cheating off your neighbors: Improving activity recognition through corroboration
Authors:
Haoxiang Yu,
Jingyi An,
Evan King,
Edison Thomaz,
Christine Julien
Abstract:
Understanding the complexity of human activities solely through an individual's data can be challenging. However, in many situations, surrounding individuals are likely performing similar activities, while existing human activity recognition approaches focus almost exclusively on individual measurements and largely ignore the context of the activity. Consider two activities: attending a small grou…
▽ More
Understanding the complexity of human activities solely through an individual's data can be challenging. However, in many situations, surrounding individuals are likely performing similar activities, while existing human activity recognition approaches focus almost exclusively on individual measurements and largely ignore the context of the activity. Consider two activities: attending a small group meeting and working at an office desk. From solely an individual's perspective, it can be difficult to differentiate between these activities as they may appear very similar, even though they are markedly different. Yet, by observing others nearby, it can be possible to distinguish between these activities. In this paper, we propose an approach to enhance the prediction accuracy of an individual's activities by incorporating insights from surrounding individuals. We have collected a real-world dataset from 20 participants with over 58 hours of data including activities such as attending lectures, having meetings, working in the office, and eating together. Compared to observing a single person in isolation, our proposed approach significantly improves accuracy. We regard this work as a first step in collaborative activity recognition, opening new possibilities for understanding human activity in group settings.
△ Less
Submitted 27 May, 2023;
originally announced June 2023.