subscribe to arXiv mailings

Energy Efficient Knapsack Optimization Using Probabilistic Memristor Crossbars

Authors: Jinzhan Li, Suhas Kumar, Su-in Yi

Abstract: Constrained optimization underlies crucial societal problems (for instance, stock trading and bandwidth allocation), but is often computationally hard (complexity grows exponentially with problem size). The big-data era urgently demands low-latency and low-energy optimization at the edge, which cannot be handled by digital processors due to their non-parallel von Neumann architecture. Recent effor… ▽ More Constrained optimization underlies crucial societal problems (for instance, stock trading and bandwidth allocation), but is often computationally hard (complexity grows exponentially with problem size). The big-data era urgently demands low-latency and low-energy optimization at the edge, which cannot be handled by digital processors due to their non-parallel von Neumann architecture. Recent efforts using massively parallel hardware (such as memristor crossbars and quantum processors) employing annealing algorithms, while promising, have handled relatively easy and stable problems with sparse or binary representations (such as the max-cut or traveling salesman problems).However, most real-world applications embody three features, which are encoded in the knapsack problem, and cannot be handled by annealing algorithms - dense and non-binary representations, with destabilizing self-feedback. Here we demonstrate a post-digital-hardware-friendly randomized competitive Ising-inspired (RaCI) algorithm performing knapsack optimization, experimentally implemented on a foundry-manufactured CMOS-integrated probabilistic analog memristor crossbar. Our solution outperforms digital and quantum approaches by over 4 orders of magnitude in energy efficiency. △ Less

Submitted 5 July, 2024; originally announced July 2024.

Comments: 16 pages, 8 figures

arXiv:2407.04295 [pdf, other]

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

Authors: Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, Qi Li

Abstract: Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of "jailbreaking", which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak att… ▽ More Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of "jailbreaking", which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak attack methods exploiting different vulnerabilities in LLMs, the corresponding safety alignment measures are also evolving. In this paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods. For instance, the attack methods are divided into black-box and white-box attacks based on the transparency of the target model. Meanwhile, we classify defense methods into prompt-level and model-level defenses. Additionally, we further subdivide these attack and defense methods into distinct sub-classes and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives. Our findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Above all, although jailbreak remains a significant concern within the community, we believe that our work enhances the understanding of this domain and provides a foundation for developing more secure LLMs. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2406.12802 [pdf, other]

Decentralized Multi-Robot Line-of-Sight Connectivity Maintenance under Uncertainty

Authors: Yupeng Yang, Yiwei Lyu, Yanze Zhang, Sha Yi, Wenhao Luo

Abstract: In this paper, we propose a novel decentralized control method to maintain Line-of-Sight connectivity for multi-robot networks in the presence of Guassian-distributed localization uncertainty. In contrast to most existing work that assumes perfect positional information about robots or enforces overly restrictive rigid formation against uncertainty, our method enables robots to preserve Line-of-Si… ▽ More In this paper, we propose a novel decentralized control method to maintain Line-of-Sight connectivity for multi-robot networks in the presence of Guassian-distributed localization uncertainty. In contrast to most existing work that assumes perfect positional information about robots or enforces overly restrictive rigid formation against uncertainty, our method enables robots to preserve Line-of-Sight connectivity with high probability under unbounded Gaussian-like positional noises while remaining minimally intrusive to the original robots' tasks. This is achieved by a motion coordination framework that jointly optimizes the set of existing Line-of-Sight edges to preserve and control revisions to the nominal task-related controllers, subject to the safety constraints and the corresponding composition of uncertainty-aware Line-of-Sight control constraints. Such compositional control constraints, expressed by our novel notion of probabilistic Line-of-Sight connectivity barrier certificates (PrLOS-CBC) for pairwise robots using control barrier functions, explicitly characterize the deterministic admissible control space for the two robots. The resulting motion ensures Line-of-Sight connectedness for the robot team with high probability. Furthermore, we propose a fully decentralized algorithm that decomposes the motion coordination framework by interleaving the composite constraint specification and solving for the resulting optimization-based controllers. The optimality of our approach is justified by the theoretical proofs. Simulation and real-world experiments results are given to demonstrate the effectiveness of our method. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: Accepted by RSS 2024

arXiv:2406.12225 [pdf, other]

The Solution for CVPR2024 Foundational Few-Shot Object Detection Challenge

Authors: Hongpeng Pan, Shifeng Yi, Shouwei Yang, Lei Qi, Bing Hu, Yi Xu, Yang Yang

Abstract: This report introduces an enhanced method for the Foundational Few-Shot Object Detection (FSOD) task, leveraging the vision-language model (VLM) for object detection. However, on specific datasets, VLM may encounter the problem where the detected targets are misaligned with the target concepts of interest. This misalignment hinders the zero-shot performance of VLM and the application of fine-tunin… ▽ More This report introduces an enhanced method for the Foundational Few-Shot Object Detection (FSOD) task, leveraging the vision-language model (VLM) for object detection. However, on specific datasets, VLM may encounter the problem where the detected targets are misaligned with the target concepts of interest. This misalignment hinders the zero-shot performance of VLM and the application of fine-tuning methods based on pseudo-labels. To address this issue, we propose the VLM+ framework, which integrates the multimodal large language model (MM-LLM). Specifically, we use MM-LLM to generate a series of referential expressions for each category. Based on the VLM predictions and the given annotations, we select the best referential expression for each category by matching the maximum IoU. Subsequently, we use these referential expressions to generate pseudo-labels for all images in the training set and then combine them with the original labeled data to fine-tune the VLM. Additionally, we employ iterative pseudo-label generation and optimization to further enhance the performance of the VLM. Our approach achieve 32.56 mAP in the final test. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: CVPR2024 Foundational Few-Shot Object Detection Challenge

arXiv:2405.11868 [pdf, other]

Towards Graph Contrastive Learning: A Survey and Beyond

Authors: Wei Ju, Yifan Wang, Yifang Qin, Zhengyang Mao, Zhiping Xiao, Junyu Luo, Junwei Yang, Yiyang Gu, Dongjie Wang, Qingqing Long, Siyu Yi, Xiao Luo, Ming Zhang

Abstract: In recent years, deep learning on graphs has achieved remarkable success in various domains. However, the reliance on annotated graph data remains a significant bottleneck due to its prohibitive cost and time-intensive nature. To address this challenge, self-supervised learning (SSL) on graphs has gained increasing attention and has made significant progress. SSL enables machine learning models to… ▽ More In recent years, deep learning on graphs has achieved remarkable success in various domains. However, the reliance on annotated graph data remains a significant bottleneck due to its prohibitive cost and time-intensive nature. To address this challenge, self-supervised learning (SSL) on graphs has gained increasing attention and has made significant progress. SSL enables machine learning models to produce informative representations from unlabeled graph data, reducing the reliance on expensive labeled data. While SSL on graphs has witnessed widespread adoption, one critical component, Graph Contrastive Learning (GCL), has not been thoroughly investigated in the existing literature. Thus, this survey aims to fill this gap by offering a dedicated survey on GCL. We provide a comprehensive overview of the fundamental principles of GCL, including data augmentation strategies, contrastive modes, and contrastive optimization objectives. Furthermore, we explore the extensions of GCL to other aspects of data-efficient graph learning, such as weakly supervised learning, transfer learning, and related scenarios. We also discuss practical applications spanning domains such as drug discovery, genomics analysis, recommender systems, and finally outline the challenges and potential future directions in this field. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.04773 [pdf, other]

Hypergraph-enhanced Dual Semi-supervised Graph Classification

Authors: Wei Ju, Zhengyang Mao, Siyu Yi, Yifang Qin, Yiyang Gu, Zhiping Xiao, Yifan Wang, Xiao Luo, Ming Zhang

Abstract: In this paper, we study semi-supervised graph classification, which aims at accurately predicting the categories of graphs in scenarios with limited labeled graphs and abundant unlabeled graphs. Despite the promising capability of graph neural networks (GNNs), they typically require a large number of costly labeled graphs, while a wealth of unlabeled graphs fail to be effectively utilized. Moreove… ▽ More In this paper, we study semi-supervised graph classification, which aims at accurately predicting the categories of graphs in scenarios with limited labeled graphs and abundant unlabeled graphs. Despite the promising capability of graph neural networks (GNNs), they typically require a large number of costly labeled graphs, while a wealth of unlabeled graphs fail to be effectively utilized. Moreover, GNNs are inherently limited to encoding local neighborhood information using message-passing mechanisms, thus lacking the ability to model higher-order dependencies among nodes. To tackle these challenges, we propose a Hypergraph-Enhanced DuAL framework named HEAL for semi-supervised graph classification, which captures graph semantics from the perspective of the hypergraph and the line graph, respectively. Specifically, to better explore the higher-order relationships among nodes, we design a hypergraph structure learning to adaptively learn complex node dependencies beyond pairwise relations. Meanwhile, based on the learned hypergraph, we introduce a line graph to capture the interaction between hyperedges, thereby better mining the underlying semantic structures. Finally, we develop a relational consistency learning to facilitate knowledge transfer between the two branches and provide better mutual guidance. Extensive experiments on real-world graph datasets verify the effectiveness of the proposed method against existing state-of-the-art methods. △ Less

Submitted 28 May, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

Comments: Accepted by Proceedings of the 41st International Conference on Machine Learning (ICML 2024)

arXiv:2404.01954 [pdf, other]

HyperCLOVA X Technical Report

Authors: Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim, Munhyong Kim, Sungju Kim, Donghyun Kwak, Hanock Kwak, Se Jung Kwon, Bado Lee, Dongsoo Lee, Gichang Lee, Jooho Lee, Baeseong Park, Seongjin Shin, Joonsang Yu, Seolki Baek, Sumin Byeon, Eungsup Cho, Dooseok Choe, Jeesung Han , et al. (371 additional authors not shown)

Abstract: We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t… ▽ More We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs. △ Less

Submitted 13 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: 44 pages; updated authors list and fixed author names

arXiv:2403.13660 [pdf]

ProMamba: Prompt-Mamba for polyp segmentation

Authors: Jianhao Xie, Ruofan Liao, Ziang Zhang, Sida Yi, Yuesheng Zhu, Guibo Luo

Abstract: Detecting polyps through colonoscopy is an important task in medical image segmentation, which provides significant assistance and reference value for clinical surgery. However, accurate segmentation of polyps is a challenging task due to two main reasons. Firstly, polyps exhibit various shapes and colors. Secondly, the boundaries between polyps and their normal surroundings are often unclear. Add… ▽ More Detecting polyps through colonoscopy is an important task in medical image segmentation, which provides significant assistance and reference value for clinical surgery. However, accurate segmentation of polyps is a challenging task due to two main reasons. Firstly, polyps exhibit various shapes and colors. Secondly, the boundaries between polyps and their normal surroundings are often unclear. Additionally, significant differences between different datasets lead to limited generalization capabilities of existing methods. To address these issues, we propose a segmentation model based on Prompt-Mamba, which incorporates the latest Vision-Mamba and prompt technologies. Compared to previous models trained on the same dataset, our model not only maintains high segmentation accuracy on the validation part of the same dataset but also demonstrates superior accuracy on unseen datasets, exhibiting excellent generalization capabilities. Notably, we are the first to apply the Vision-Mamba architecture to polyp segmentation and the first to utilize prompt technology in a polyp segmentation model. Our model efficiently accomplishes segmentation tasks, surpassing previous state-of-the-art methods by an average of 5% across six datasets. Furthermore, we have developed multiple versions of our model with scaled parameter counts, achieving better performance than previous models even with fewer parameters. Our code and trained weights will be released soon. △ Less

Submitted 26 March, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

Comments: 10 pages, 2 figures,3 tabels

arXiv:2403.04468 [pdf, other]

A Survey of Graph Neural Networks in Real world: Imbalance, Noise, Privacy and OOD Challenges

Authors: Wei Ju, Siyu Yi, Yifan Wang, Zhiping Xiao, Zhengyang Mao, Hourun Li, Yiyang Gu, Yifang Qin, Nan Yin, Senzhang Wang, Xinwang Liu, Xiao Luo, Philip S. Yu, Ming Zhang

Abstract: Graph-structured data exhibits universality and widespread applicability across diverse domains, such as social network analysis, biochemistry, financial fraud detection, and network security. Significant strides have been made in leveraging Graph Neural Networks (GNNs) to achieve remarkable success in these areas. However, in real-world scenarios, the training environment for models is often far… ▽ More Graph-structured data exhibits universality and widespread applicability across diverse domains, such as social network analysis, biochemistry, financial fraud detection, and network security. Significant strides have been made in leveraging Graph Neural Networks (GNNs) to achieve remarkable success in these areas. However, in real-world scenarios, the training environment for models is often far from ideal, leading to substantial performance degradation of GNN models due to various unfavorable factors, including imbalance in data distribution, the presence of noise in erroneous data, privacy protection of sensitive information, and generalization capability for out-of-distribution (OOD) scenarios. To tackle these issues, substantial efforts have been devoted to improving the performance of GNN models in practical real-world scenarios, as well as enhancing their reliability and robustness. In this paper, we present a comprehensive survey that systematically reviews existing GNN models, focusing on solutions to the four mentioned real-world challenges including imbalance, noise, privacy, and OOD in practical scenarios that many existing reviews have not considered. Specifically, we first highlight the four key challenges faced by existing GNNs, paving the way for our exploration of real-world GNN models. Subsequently, we provide detailed discussions on these four aspects, dissecting how these solutions contribute to enhancing the reliability and robustness of GNN models. Last but not least, we outline promising directions and offer future perspectives in the field. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2403.01091 [pdf, other]

COOL: A Conjoint Perspective on Spatio-Temporal Graph Neural Network for Traffic Forecasting

Authors: Wei Ju, Yusheng Zhao, Yifang Qin, Siyu Yi, Jingyang Yuan, Zhiping Xiao, Xiao Luo, Xiting Yan, Ming Zhang

Abstract: This paper investigates traffic forecasting, which attempts to forecast the future state of traffic based on historical situations. This problem has received ever-increasing attention in various scenarios and facilitated the development of numerous downstream applications such as urban planning and transportation management. However, the efficacy of existing methods remains sub-optimal due to thei… ▽ More This paper investigates traffic forecasting, which attempts to forecast the future state of traffic based on historical situations. This problem has received ever-increasing attention in various scenarios and facilitated the development of numerous downstream applications such as urban planning and transportation management. However, the efficacy of existing methods remains sub-optimal due to their tendency to model temporal and spatial relationships independently, thereby inadequately accounting for complex high-order interactions of both worlds. Moreover, the diversity of transitional patterns in traffic forecasting makes them challenging to capture for existing approaches, warranting a deeper exploration of their diversity. Toward this end, this paper proposes Conjoint Spatio-Temporal graph neural network (abbreviated as COOL), which models heterogeneous graphs from prior and posterior information to conjointly capture high-order spatio-temporal relationships. On the one hand, heterogeneous graphs connecting sequential observation are constructed to extract composite spatio-temporal relationships via prior message passing. On the other hand, we model dynamic relationships using constructed affinity and penalty graphs, which guide posterior message passing to incorporate complementary semantic information into node representations. Moreover, to capture diverse transitional properties to enhance traffic forecasting, we propose a conjoint self-attention decoder that models diverse temporal patterns from both multi-rank and multi-scale views. Experimental results on four popular benchmark datasets demonstrate that our proposed COOL provides state-of-the-art performance compared with the competitive baselines. △ Less

Submitted 1 March, 2024; originally announced March 2024.

Comments: Accepted by Information Fusion 2024

arXiv:2402.00447 [pdf, ps, other]

A Survey of Data-Efficient Graph Learning

Authors: Wei Ju, Siyu Yi, Yifan Wang, Qingqing Long, Junyu Luo, Zhiping Xiao, Ming Zhang

Abstract: Graph-structured data, prevalent in domains ranging from social networks to biochemical analysis, serve as the foundation for diverse real-world systems. While graph neural networks demonstrate proficiency in modeling this type of data, their success is often reliant on significant amounts of labeled data, posing a challenge in practical scenarios with limited annotation resources. To tackle this… ▽ More Graph-structured data, prevalent in domains ranging from social networks to biochemical analysis, serve as the foundation for diverse real-world systems. While graph neural networks demonstrate proficiency in modeling this type of data, their success is often reliant on significant amounts of labeled data, posing a challenge in practical scenarios with limited annotation resources. To tackle this problem, tremendous efforts have been devoted to enhancing graph machine learning performance under low-resource settings by exploring various approaches to minimal supervision. In this paper, we introduce a novel concept of Data-Efficient Graph Learning (DEGL) as a research frontier, and present the first survey that summarizes the current progress of DEGL. We initiate by highlighting the challenges inherent in training models with large labeled data, paving the way for our exploration into DEGL. Next, we systematically review recent advances on this topic from several key aspects, including self-supervised graph learning, semi-supervised graph learning, and few-shot graph learning. Also, we state promising directions for future research, contributing to the evolution of graph machine learning. △ Less

Submitted 19 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

Comments: Accepted by Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI 2024)

arXiv:2401.08718 [pdf, other]

Investigating Fouling Efficiency in Football Using Expected Booking (xB) Model

Authors: Adnan Azmat, Su Su Yi

Abstract: This paper introduces the Expected Booking (xB) model, a novel metric designed to estimate the likelihood of a foul resulting in a yellow card in football. Through three iterative experiments, employing ensemble methods, the model demonstrates improved performance with additional features and an expanded dataset. Analysis of FIFA World Cup 2022 data validates the model's efficacy in providing insi… ▽ More This paper introduces the Expected Booking (xB) model, a novel metric designed to estimate the likelihood of a foul resulting in a yellow card in football. Through three iterative experiments, employing ensemble methods, the model demonstrates improved performance with additional features and an expanded dataset. Analysis of FIFA World Cup 2022 data validates the model's efficacy in providing insights into team and player fouling tactics, aligning with actual defensive performance. The xB model addresses a gap in fouling efficiency examination, emphasizing defensive strategies which often overlooked. Further enhancements are suggested through the incorporation of comprehensive data and spatial features. △ Less

Submitted 16 January, 2024; originally announced January 2024.

arXiv:2311.15210 [pdf, other]

Topology combined machine learning for consonant recognition

Authors: Pingyao Feng, Siheng Yi, Qingrui Qu, Zhiwang Yu, Yifei Zhu

Abstract: In artificial-intelligence-aided signal processing, existing deep learning models often exhibit a black-box structure, and their validity and comprehensibility remain elusive. The integration of topological methods, despite its relatively nascent application, serves a dual purpose of making models more interpretable as well as extracting structural information from time-dependent data for smarter… ▽ More In artificial-intelligence-aided signal processing, existing deep learning models often exhibit a black-box structure, and their validity and comprehensibility remain elusive. The integration of topological methods, despite its relatively nascent application, serves a dual purpose of making models more interpretable as well as extracting structural information from time-dependent data for smarter learning. Here, we provide a transparent and broadly applicable methodology, TopCap, to capture the most salient topological features inherent in time series for machine learning. Rooted in high-dimensional ambient spaces, TopCap is capable of capturing features rarely detected in datasets with low intrinsic dimensionality. Applying time-delay embedding and persistent homology, we obtain descriptors which encapsulate information such as the vibration of a time series, in terms of its variability of frequency, amplitude, and average line, demonstrated with simulated data. This information is then vectorised and fed into multiple machine learning algorithms such as k-nearest neighbours and support vector machine. Notably, in classifying voiced and voiceless consonants, TopCap achieves an accuracy exceeding 96% and is geared towards designing topological convolutional layers for deep learning of speech and audio signals. △ Less

Submitted 26 November, 2023; originally announced November 2023.

arXiv:2310.04162 [pdf, other]

Light-LOAM: A Lightweight LiDAR Odometry and Mapping based on Graph-Matching

Authors: Shiquan Yi, Yang Lyu, Lin Hua, Quan Pan, Chunhui Zhao

Abstract: Simultaneous Localization and Mapping (SLAM) plays an important role in robot autonomy. Reliability and efficiency are the two most valued features for applying SLAM in robot applications. In this paper, we consider achieving a reliable LiDAR-based SLAM function in computation-limited platforms, such as quadrotor UAVs based on graph-based point cloud association. First, contrary to most works sele… ▽ More Simultaneous Localization and Mapping (SLAM) plays an important role in robot autonomy. Reliability and efficiency are the two most valued features for applying SLAM in robot applications. In this paper, we consider achieving a reliable LiDAR-based SLAM function in computation-limited platforms, such as quadrotor UAVs based on graph-based point cloud association. First, contrary to most works selecting salient features for point cloud registration, we propose a non-conspicuous feature selection strategy for reliability and robustness purposes. Then a two-stage correspondence selection method is used to register the point cloud, which includes a KD-tree-based coarse matching followed by a graph-based matching method that uses geometric consistency to vote out incorrect correspondences. Additionally, we propose an odometry approach where the weight optimizations are guided by vote results from the aforementioned geometric consistency graph. In this way, the optimization of LiDAR odometry rapidly converges and evaluates a fairly accurate transformation resulting in the back-end module efficiently finishing the mapping task. Finally, we evaluate our proposed framework on the KITTI odometry dataset and real-world environments. Experiments show that our SLAM system achieves a comparative level or higher level of accuracy with more balanced computation efficiency compared with the mainstream LiDAR-based SLAM solutions. △ Less

Submitted 6 October, 2023; originally announced October 2023.

arXiv:2310.02792 [pdf, other]

doi 10.1109/TMI.2024.3419780

Continuous 3D Myocardial Motion Tracking via Echocardiography

Authors: Chengkang Shen, Hao Zhu, You Zhou, Yu Liu, Si Yi, Lili Dong, Weipeng Zhao, David J. Brady, Xun Cao, Zhan Ma, Yi Lin

Abstract: Myocardial motion tracking stands as an essential clinical tool in the prevention and detection of cardiovascular diseases (CVDs), the foremost cause of death globally. However, current techniques suffer from incomplete and inaccurate motion estimation of the myocardium in both spatial and temporal dimensions, hindering the early identification of myocardial dysfunction. To address these challenge… ▽ More Myocardial motion tracking stands as an essential clinical tool in the prevention and detection of cardiovascular diseases (CVDs), the foremost cause of death globally. However, current techniques suffer from incomplete and inaccurate motion estimation of the myocardium in both spatial and temporal dimensions, hindering the early identification of myocardial dysfunction. To address these challenges, this paper introduces the Neural Cardiac Motion Field (NeuralCMF). NeuralCMF leverages implicit neural representation (INR) to model the 3D structure and the comprehensive 6D forward/backward motion of the heart. This method surpasses pixel-wise limitations by offering the capability to continuously query the precise shape and motion of the myocardium at any specific point throughout the cardiac cycle, enhancing the detailed analysis of cardiac dynamics beyond traditional speckle tracking. Notably, NeuralCMF operates without the need for paired datasets, and its optimization is self-supervised through the physics knowledge priors in both space and time dimensions, ensuring compatibility with both 2D and 3D echocardiogram video inputs. Experimental validations across three representative datasets support the robustness and innovative nature of the NeuralCMF, marking significant advantages over existing state-of-the-art methods in cardiac imaging and motion tracking. △ Less

Submitted 27 June, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

Comments: 18 pages, 11 figures

Journal ref: IEEE Transactions on Medical Imaging, June 2024

arXiv:2309.05287 [pdf, other]

Addressing Feature Imbalance in Sound Source Separation

Authors: Jaechang Kim, Jeongyeon Hwang, Soheun Yi, Jaewoong Cho, Jungseul Ok

Abstract: Neural networks often suffer from a feature preference problem, where they tend to overly rely on specific features to solve a task while disregarding other features, even if those neglected features are essential for the task. Feature preference problems have primarily been investigated in classification task. However, we observe that feature preference occurs in high-dimensional regression task,… ▽ More Neural networks often suffer from a feature preference problem, where they tend to overly rely on specific features to solve a task while disregarding other features, even if those neglected features are essential for the task. Feature preference problems have primarily been investigated in classification task. However, we observe that feature preference occurs in high-dimensional regression task, specifically, source separation. To mitigate feature preference in source separation, we propose FEAture BAlancing by Suppressing Easy feature (FEABASE). This approach enables efficient data utilization by learning hidden information about the neglected feature. We evaluate our method in a multi-channel source separation task, where feature preference between spatial feature and timbre feature appears. △ Less

Submitted 4 October, 2023; v1 submitted 11 September, 2023; originally announced September 2023.

arXiv:2309.04694 [pdf, other]

Redundancy-Free Self-Supervised Relational Learning for Graph Clustering

Authors: Si-Yu Yi, Wei Ju, Yifang Qin, Xiao Luo, Luchen Liu, Yong-Dao Zhou, Ming Zhang

Abstract: Graph clustering, which learns the node representations for effective cluster assignments, is a fundamental yet challenging task in data analysis and has received considerable attention accompanied by graph neural networks in recent years. However, most existing methods overlook the inherent relational information among the non-independent and non-identically distributed nodes in a graph. Due to t… ▽ More Graph clustering, which learns the node representations for effective cluster assignments, is a fundamental yet challenging task in data analysis and has received considerable attention accompanied by graph neural networks in recent years. However, most existing methods overlook the inherent relational information among the non-independent and non-identically distributed nodes in a graph. Due to the lack of exploration of relational attributes, the semantic information of the graph-structured data fails to be fully exploited which leads to poor clustering performance. In this paper, we propose a novel self-supervised deep graph clustering method named Relational Redundancy-Free Graph Clustering (R$^2$FGC) to tackle the problem. It extracts the attribute- and structure-level relational information from both global and local views based on an autoencoder and a graph autoencoder. To obtain effective representations of the semantic information, we preserve the consistent relation among augmented nodes, whereas the redundant relation is further reduced for learning discriminative embeddings. In addition, a simple yet valid strategy is utilized to alleviate the over-smoothing issue. Extensive experiments are performed on widely used benchmark datasets to validate the superiority of our R$^2$FGC over state-of-the-art baselines. Our codes are available at https://github.com/yisiyu95/R2FGC. △ Less

Submitted 9 September, 2023; originally announced September 2023.

Comments: Accepted by IEEE Transactions on Neural Networks and Learning Systems (TNNLS 2024)

arXiv:2309.00962 [pdf, other]

NTU4DRadLM: 4D Radar-centric Multi-Modal Dataset for Localization and Mapping

Authors: Jun Zhang, Huayang Zhuge, Yiyao Liu, Guohao Peng, Zhenyu Wu, Haoyuan Zhang, Qiyang Lyu, Heshan Li, Chunyang Zhao, Dogan Kircali, Sanat Mharolkar, Xun Yang, Su Yi, Yuanzhe Wang, Danwei Wang

Abstract: Simultaneous Localization and Mapping (SLAM) is moving towards a robust perception age. However, LiDAR- and visual- SLAM may easily fail in adverse conditions (rain, snow, smoke and fog, etc.). In comparison, SLAM based on 4D Radar, thermal camera and IMU can work robustly. But only a few literature can be found. A major reason is the lack of related datasets, which seriously hinders the research.… ▽ More Simultaneous Localization and Mapping (SLAM) is moving towards a robust perception age. However, LiDAR- and visual- SLAM may easily fail in adverse conditions (rain, snow, smoke and fog, etc.). In comparison, SLAM based on 4D Radar, thermal camera and IMU can work robustly. But only a few literature can be found. A major reason is the lack of related datasets, which seriously hinders the research. Even though some datasets are proposed based on 4D radar in past four years, they are mainly designed for object detection, rather than SLAM. Furthermore, they normally do not include thermal camera. Therefore, in this paper, NTU4DRadLM is presented to meet this requirement. The main characteristics are: 1) It is the only dataset that simultaneously includes all 6 sensors: 4D radar, thermal camera, IMU, 3D LiDAR, visual camera and RTK GPS. 2) Specifically designed for SLAM tasks, which provides fine-tuned ground truth odometry and intentionally formulated loop closures. 3) Considered both low-speed robot platform and fast-speed unmanned vehicle platform. 4) Covered structured, unstructured and semi-structured environments. 5) Considered both middle- and large- scale outdoor environments, i.e., the 6 trajectories range from 246m to 6.95km. 6) Comprehensively evaluated three types of SLAM algorithms. Totally, the dataset is around 17.6km, 85mins, 50GB and it will be accessible from this link: https://github.com/junzhang2016/NTU4DRadLM △ Less

Submitted 2 September, 2023; originally announced September 2023.

Comments: 2023 IEEE International Intelligent Transportation Systems Conference (ITSC 2023)

arXiv:2308.16609 [pdf, other]

Towards Long-Tailed Recognition for Graph Classification via Collaborative Experts

Authors: Siyu Yi, Zhengyang Mao, Wei Ju, Yongdao Zhou, Luchen Liu, Xiao Luo, Ming Zhang

Abstract: Graph classification, aiming at learning the graph-level representations for effective class assignments, has received outstanding achievements, which heavily relies on high-quality datasets that have balanced class distribution. In fact, most real-world graph data naturally presents a long-tailed form, where the head classes occupy much more samples than the tail classes, it thus is essential to… ▽ More Graph classification, aiming at learning the graph-level representations for effective class assignments, has received outstanding achievements, which heavily relies on high-quality datasets that have balanced class distribution. In fact, most real-world graph data naturally presents a long-tailed form, where the head classes occupy much more samples than the tail classes, it thus is essential to study the graph-level classification over long-tailed data while still remaining largely unexplored. However, most existing long-tailed learning methods in visions fail to jointly optimize the representation learning and classifier training, as well as neglect the mining of the hard-to-classify classes. Directly applying existing methods to graphs may lead to sub-optimal performance, since the model trained on graphs would be more sensitive to the long-tailed distribution due to the complex topological characteristics. Hence, in this paper, we propose a novel long-tailed graph-level classification framework via Collaborative Multi-expert Learning (CoMe) to tackle the problem. To equilibrate the contributions of head and tail classes, we first develop balanced contrastive learning from the view of representation learning, and then design an individual-expert classifier training based on hard class mining. In addition, we execute gated fusion and disentangled knowledge distillation among the multiple experts to promote the collaboration in a multi-expert framework. Comprehensive experiments are performed on seven widely-used benchmark datasets to demonstrate the superiority of our method CoMe over state-of-the-art baselines. △ Less

Submitted 5 September, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

Comments: Accepted by IEEE Transactions on Big Data (TBD 2024)

arXiv:2308.10058 [pdf]

R-C-P Method: An Autonomous Volume Calculation Method Using Image Processing and Machine Vision

Authors: MA Muktadir, Sydney Parker, Sun Yi

Abstract: Machine vision and image processing are often used with sensors for situation awareness in autonomous systems, from industrial robots to self-driving cars. The 3D depth sensors, such as LiDAR (Light Detection and Ranging), Radar, are great invention for autonomous systems. Due to the complexity of the setup, LiDAR may not be suitable for some operational environments, for example, a space environm… ▽ More Machine vision and image processing are often used with sensors for situation awareness in autonomous systems, from industrial robots to self-driving cars. The 3D depth sensors, such as LiDAR (Light Detection and Ranging), Radar, are great invention for autonomous systems. Due to the complexity of the setup, LiDAR may not be suitable for some operational environments, for example, a space environment. This study was motivated by a desire to get real-time volumetric and change information with multiple 2D cameras instead of a depth camera. Two cameras were used to measure the dimensions of a rectangular object in real-time. The R-C-P (row-column-pixel) method is developed using image processing and edge detection. In addition to the surface areas, the R-C-P method also detects discontinuous edges or volumes. Lastly, experimental work is presented for illustration of the R-C-P method, which provides the equations for calculating surface area dimensions. Using the equations with given distance information between the object and the camera, the vision system provides the dimensions of actual objects. △ Less

Submitted 3 February, 2024; v1 submitted 19 August, 2023; originally announced August 2023.

arXiv:2307.12882 [pdf, other]

doi 10.1145/3588001.3609364

FoodWise: Food Waste Reduction and Behavior Change on Campus with Data Visualization and Gamification

Authors: Yue Yu, Sophia Yi, Xi Nan, Leo Yu-Ho Lo, Kento Shigyo, Liwenhan Xie, Jeffry Wicaksana, Kwang-Ting Cheng, Huamin Qu

Abstract: Food waste presents a substantial challenge with significant environmental and economic ramifications, and its severity on campus environments is of particular concern. In response to this, we introduce FoodWise, a dual-component system tailored to inspire and incentivize campus communities to reduce food waste. The system consists of a data storytelling dashboard that graphically displays food wa… ▽ More Food waste presents a substantial challenge with significant environmental and economic ramifications, and its severity on campus environments is of particular concern. In response to this, we introduce FoodWise, a dual-component system tailored to inspire and incentivize campus communities to reduce food waste. The system consists of a data storytelling dashboard that graphically displays food waste information from university canteens, coupled with a mobile web application that encourages users to log their food waste reduction actions and rewards active participants for their efforts. Deployed during a two-week food-saving campaign at The Hong Kong University of Science and Technology (HKUST) in March 2023, FoodWise engaged over 200 participants from the university community, resulting in the logging of over 800 daily food-saving actions. Feedback collected post-campaign underscores the system's efficacy in elevating user consciousness about food waste and prompting behavioral shifts towards a more sustainable campus. This paper also provides insights for enhancing our system, contributing to a broader discourse on sustainable campus initiatives. △ Less

Submitted 27 July, 2023; v1 submitted 24 July, 2023; originally announced July 2023.

Comments: Accepted in ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies (COMPASS) 2023

arXiv:2307.05906 [pdf, other]

Mini-Batch Optimization of Contrastive Loss

Authors: Jaewoong Cho, Kartik Sreenivasan, Keon Lee, Kyunghoo Mun, Soheun Yi, Jeong-Gwan Lee, Anna Lee, Jy-yong Sohn, Dimitris Papailiopoulos, Kangwook Lee

Abstract: Contrastive learning has gained significant attention as a method for self-supervised learning. The contrastive loss function ensures that embeddings of positive sample pairs (e.g., different samples from the same class or different views of the same object) are similar, while embeddings of negative pairs are dissimilar. Practical constraints such as large memory requirements make it challenging t… ▽ More Contrastive learning has gained significant attention as a method for self-supervised learning. The contrastive loss function ensures that embeddings of positive sample pairs (e.g., different samples from the same class or different views of the same object) are similar, while embeddings of negative pairs are dissimilar. Practical constraints such as large memory requirements make it challenging to consider all possible positive and negative pairs, leading to the use of mini-batch optimization. In this paper, we investigate the theoretical aspects of mini-batch optimization in contrastive learning. We show that mini-batch optimization is equivalent to full-batch optimization if and only if all $\binom{N}{B}$ mini-batches are selected, while sub-optimality may arise when examining only a subset. We then demonstrate that utilizing high-loss mini-batches can speed up SGD convergence and propose a spectral clustering-based approach for identifying these high-loss mini-batches. Our experimental results validate our theoretical findings and demonstrate that our proposed algorithm outperforms vanilla SGD in practically relevant settings, providing a better understanding of mini-batch optimization in contrastive learning. △ Less

Submitted 12 July, 2023; originally announced July 2023.

arXiv:2307.05358 [pdf, other]

Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators

Authors: Sikai Bai, Shuaicheng Li, Weiming Zhuang, Jie Zhang, Song Guo, Kunlin Yang, Jun Hou, Shuai Zhang, Junyu Gao, Shuai Yi

Abstract: Federated learning has become a popular method to learn from decentralized heterogeneous data. Federated semi-supervised learning (FSSL) emerges to train models from a small fraction of labeled data due to label scarcity on decentralized clients. Existing FSSL methods assume independent and identically distributed (IID) labeled data across clients and consistent class distribution between labeled… ▽ More Federated learning has become a popular method to learn from decentralized heterogeneous data. Federated semi-supervised learning (FSSL) emerges to train models from a small fraction of labeled data due to label scarcity on decentralized clients. Existing FSSL methods assume independent and identically distributed (IID) labeled data across clients and consistent class distribution between labeled and unlabeled data within a client. This work studies a more practical and challenging scenario of FSSL, where data distribution is different not only across clients but also within a client between labeled and unlabeled data. To address this challenge, we propose a novel FSSL framework with dual regulators, FedDure. FedDure lifts the previous assumption with a coarse-grained regulator (C-reg) and a fine-grained regulator (F-reg): C-reg regularizes the updating of the local model by tracking the learning effect on labeled data distribution; F-reg learns an adaptive weighting scheme tailored for unlabeled instances in each client. We further formulate the client model training as bi-level optimization that adaptively optimizes the model in the client with two regulators. Theoretically, we show the convergence guarantee of the dual regulators. Empirically, we demonstrate that FedDure is superior to the existing methods across a wide range of settings, notably by more than 11 on CIFAR-10 and CINIC-10 datasets. △ Less

Submitted 11 March, 2024; v1 submitted 11 July, 2023; originally announced July 2023.

Journal ref: The 38th Annual AAAI Conference on Artificial Intelligence, 2024

arXiv:2306.16265 [pdf, other]

Reconfigurable Robot Control Using Flexible Coupling Mechanisms

Authors: Sha Yi, Katia Sycara, Zeynep Temel

Abstract: Reconfigurable robot swarms are capable of connecting with each other to form complex structures. Current mechanical or magnetic connection mechanisms can be complicated to manufacture, consume high power, have a limited load-bearing capacity, or can only form rigid structures. In this paper, we present our low-cost soft anchor design that enables flexible coupling and decoupling between robots. O… ▽ More Reconfigurable robot swarms are capable of connecting with each other to form complex structures. Current mechanical or magnetic connection mechanisms can be complicated to manufacture, consume high power, have a limited load-bearing capacity, or can only form rigid structures. In this paper, we present our low-cost soft anchor design that enables flexible coupling and decoupling between robots. Our asymmetric anchor requires minimal force to be pushed into the opening of another robot while having a strong pulling force so that the connection between robots can be secured. To maintain this flexible coupling mechanism as an assembled structure, we present our Model Predictive Control (MPC) frameworks with polygon constraints to model the geometric relationship between robots. We conducted experiments on the soft anchor to obtain its force profile, which informed the three-bar linkage model of the anchor in the simulations. We show that the proposed mechanism and MPC frameworks enable the robots to couple, decouple, and perform various behaviors in both the simulation environment and hardware platform. Our code is available at https://github.com/ZoomLabCMU/puzzlebot_anchor . Video is available at https://www.youtube.com/watch?v=R3gFplorCJg . △ Less

Submitted 28 June, 2023; originally announced June 2023.

arXiv:2306.10503 [pdf, other]

A Survey on User-Space Storage and Its Implementations

Authors: Junzhe Li, Xiurui Pan, Shushu Yi, Jie Zhang

Abstract: The storage stack in the traditional operating system is primarily optimized towards improving the CPU utilization and hiding the long I/O latency imposed by the slow I/O devices such as hard disk drivers (HDDs). However, the emerging storage media experience significant technique shifts in the past decade, which exhibit high bandwidth and low latency. These high-performance storage devices, unfor… ▽ More The storage stack in the traditional operating system is primarily optimized towards improving the CPU utilization and hiding the long I/O latency imposed by the slow I/O devices such as hard disk drivers (HDDs). However, the emerging storage media experience significant technique shifts in the past decade, which exhibit high bandwidth and low latency. These high-performance storage devices, unfortunately, suffer from the huge overheads imposed by the system software including the long storage stack and the frequent context switch between the user and kernel modes. Many researchers have investigated huge efforts in addressing this challenge by constructing a direct software path between a user process and the underlying storage devices. We revisit such novel designs in the prior work and present a survey in this paper. Specifically, we classify the former research into three categories according to their commonalities. We then present the designs of each category based on the timeline and analyze their uniqueness and contributions. This paper also reviews the applications that exploit the characteristics of theses designs. Given that the user-space storage is a growing research field, we believe this paper can be an inspiration for future researchers, who are interested in the user-space storage system designs. △ Less

Submitted 18 June, 2023; originally announced June 2023.

arXiv:2304.13017 [pdf, other]

DuETT: Dual Event Time Transformer for Electronic Health Records

Authors: Alex Labach, Aslesha Pokhrel, Xiao Shi Huang, Saba Zuberi, Seung Eun Yi, Maksims Volkovs, Tomi Poutanen, Rahul G. Krishnan

Abstract: Electronic health records (EHRs) recorded in hospital settings typically contain a wide range of numeric time series data that is characterized by high sparsity and irregular observations. Effective modelling for such data must exploit its time series nature, the semantic relationship between different types of observations, and information in the sparsity structure of the data. Self-supervised Tr… ▽ More Electronic health records (EHRs) recorded in hospital settings typically contain a wide range of numeric time series data that is characterized by high sparsity and irregular observations. Effective modelling for such data must exploit its time series nature, the semantic relationship between different types of observations, and information in the sparsity structure of the data. Self-supervised Transformers have shown outstanding performance in a variety of structured tasks in NLP and computer vision. But multivariate time series data contains structured relationships over two dimensions: time and recorded event type, and straightforward applications of Transformers to time series data do not leverage this distinct structure. The quadratic scaling of self-attention layers can also significantly limit the input sequence length without appropriate input engineering. We introduce the DuETT architecture, an extension of Transformers designed to attend over both time and event type dimensions, yielding robust representations from EHR data. DuETT uses an aggregated input where sparse time series are transformed into a regular sequence with fixed length; this lowers the computational complexity relative to previous EHR Transformer models and, more importantly, enables the use of larger and deeper neural networks. When trained with self-supervised prediction tasks, that provide rich and informative signals for model pre-training, our model outperforms state-of-the-art deep learning models on multiple downstream tasks from the MIMIC-IV and PhysioNet-2012 EHR datasets. △ Less

Submitted 15 August, 2023; v1 submitted 25 April, 2023; originally announced April 2023.

Comments: Accepted at MLHC 2023, camera-ready version

arXiv:2304.09247 [pdf, other]

SigSegment: A Signal-Based Segmentation Algorithm for Identifying Anomalous Driving Behaviours in Naturalistic Driving Videos

Authors: Kelvin Kwakye, Younho Seong, Armstrong Aboah, Sun Yi

Abstract: In recent years, distracted driving has garnered considerable attention as it continues to pose a significant threat to public safety on the roads. This has increased the need for innovative solutions that can identify and eliminate distracted driving behavior before it results in fatal accidents. In this paper, we propose a Signal-Based anomaly detection algorithm that segments videos into anomal… ▽ More In recent years, distracted driving has garnered considerable attention as it continues to pose a significant threat to public safety on the roads. This has increased the need for innovative solutions that can identify and eliminate distracted driving behavior before it results in fatal accidents. In this paper, we propose a Signal-Based anomaly detection algorithm that segments videos into anomalies and non-anomalies using a deep CNN-LSTM classifier to precisely estimate the start and end times of an anomalous driving event. In the phase of anomaly detection and analysis, driver pose background estimation, mask extraction, and signal activity spikes are utilized. A Deep CNN-LSTM classifier was applied to candidate anomalies to detect and classify final anomalies. The proposed method achieved an overlap score of 0.5424 and ranked 9th on the public leader board in the AI City Challenge 2023, according to experimental validation results. △ Less

Submitted 13 April, 2023; originally announced April 2023.

arXiv:2304.09131 [pdf, other]

Variational Relational Point Completion Network for Robust 3D Classification

Authors: Liang Pan, Xinyi Chen, Zhongang Cai, Junzhe Zhang, Haiyu Zhao, Shuai Yi, Ziwei Liu

Abstract: Real-scanned point clouds are often incomplete due to viewpoint, occlusion, and noise, which hampers 3D geometric modeling and perception. Existing point cloud completion methods tend to generate global shape skeletons and hence lack fine local details. Furthermore, they mostly learn a deterministic partial-to-complete mapping, but overlook structural relations in man-made objects. To tackle these… ▽ More Real-scanned point clouds are often incomplete due to viewpoint, occlusion, and noise, which hampers 3D geometric modeling and perception. Existing point cloud completion methods tend to generate global shape skeletons and hence lack fine local details. Furthermore, they mostly learn a deterministic partial-to-complete mapping, but overlook structural relations in man-made objects. To tackle these challenges, this paper proposes a variational framework, Variational Relational point Completion Network (VRCNet) with two appealing properties: 1) Probabilistic Modeling. In particular, we propose a dual-path architecture to enable principled probabilistic modeling across partial and complete clouds. One path consumes complete point clouds for reconstruction by learning a point VAE. The other path generates complete shapes for partial point clouds, whose embedded distribution is guided by distribution obtained from the reconstruction path during training. 2) Relational Enhancement. Specifically, we carefully design point self-attention kernel and point selective kernel module to exploit relational point features, which refines local shape details conditioned on the coarse completion. In addition, we contribute multi-view partial point cloud datasets (MVP and MVP-40 dataset) containing over 200,000 high-quality scans, which render partial 3D shapes from 26 uniformly distributed camera poses for each 3D CAD model. Extensive experiments demonstrate that VRCNet outperforms state-of-the-art methods on all standard point cloud completion benchmarks. Notably, VRCNet shows great generalizability and robustness on real-world point cloud scans. Moreover, we can achieve robust 3D classification for partial point clouds with the help of VRCNet, which can highly increase classification accuracy. △ Less

Submitted 18 April, 2023; originally announced April 2023.

Comments: 12 pages, 10 figures, accepted by PAMI. project webpage: https://mvp-dataset.github.io/. arXiv admin note: substantial text overlap with arXiv:2104.10154

arXiv:2211.10582 [pdf, other]

Linear RNNs Provably Learn Linear Dynamic Systems

Authors: Lifu Wang, Tianyu Wang, Shengwei Yi, Bo Shen, Bo Hu, Xing Cao

Abstract: We study the learning ability of linear recurrent neural networks with Gradient Descent. We prove the first theoretical guarantee on linear RNNs to learn any stable linear dynamic system using any a large type of loss functions. For an arbitrary stable linear system with a parameter $ρ_C$ related to the transition matrix $C$, we show that despite the non-convexity of the parameter optimization los… ▽ More We study the learning ability of linear recurrent neural networks with Gradient Descent. We prove the first theoretical guarantee on linear RNNs to learn any stable linear dynamic system using any a large type of loss functions. For an arbitrary stable linear system with a parameter $ρ_C$ related to the transition matrix $C$, we show that despite the non-convexity of the parameter optimization loss if the width of the RNN is large enough (and the required width in hidden layers does not rely on the length of the input sequence), a linear RNN can provably learn any stable linear dynamic system with the sample and time complexity polynomial in $\frac{1}{1-ρ_C}$. Our results provide the first theoretical guarantee to learn a linear RNN and demonstrate how can the recurrent structure help to learn a dynamic system. △ Less

Submitted 22 October, 2023; v1 submitted 18 November, 2022; originally announced November 2022.

Comments: 14 pages

arXiv:2211.04454 [pdf, other]

SLATE: A Sequence Labeling Approach for Task Extraction from Free-form Inked Content

Authors: Apurva Gandhi, Ryan Serrao, Biyi Fang, Gilbert Antonius, Jenna Hong, Tra My Nguyen, Sheng Yi, Ehi Nosakhare, Irene Shaffer, Soundararajan Srinivasan, Vivek Gupta

Abstract: We present SLATE, a sequence labeling approach for extracting tasks from free-form content such as digitally handwritten (or "inked") notes on a virtual whiteboard. Our approach allows us to create a single, low-latency model to simultaneously perform sentence segmentation and classification of these sentences into task/non-task sentences. SLATE greatly outperforms a baseline two-model (sentence s… ▽ More We present SLATE, a sequence labeling approach for extracting tasks from free-form content such as digitally handwritten (or "inked") notes on a virtual whiteboard. Our approach allows us to create a single, low-latency model to simultaneously perform sentence segmentation and classification of these sentences into task/non-task sentences. SLATE greatly outperforms a baseline two-model (sentence segmentation followed by classification model) approach, achieving a task F1 score of 84.4%, a sentence segmentation (boundary similarity) score of 88.4% and three times lower latency compared to the baseline. Furthermore, we provide insights into tackling challenges of performing NLP on the inking domain. We release both our code and dataset for this novel task. △ Less

Submitted 17 November, 2022; v1 submitted 8 November, 2022; originally announced November 2022.

Comments: Accepted at EMNLP 2022 as an Industry Track paper

arXiv:2210.04024 [pdf, other]

Demand Layering for Real-Time DNN Inference with Minimized Memory Usage

Authors: Mingoo Ji, Saehanseul Yi, Changjin Koo, Sol Ahn, Dongjoo Seo, Nikil Dutt, Jong-Chan Kim

Abstract: When executing a deep neural network (DNN), its model parameters are loaded into GPU memory before execution, incurring a significant GPU memory burden. There are studies that reduce GPU memory usage by exploiting CPU memory as a swap device. However, this approach is not applicable in most embedded systems with integrated GPUs where CPU and GPU share a common memory. In this regard, we present De… ▽ More When executing a deep neural network (DNN), its model parameters are loaded into GPU memory before execution, incurring a significant GPU memory burden. There are studies that reduce GPU memory usage by exploiting CPU memory as a swap device. However, this approach is not applicable in most embedded systems with integrated GPUs where CPU and GPU share a common memory. In this regard, we present Demand Layering, which employs a fast solid-state drive (SSD) as a co-running partner of a GPU and exploits the layer-by-layer execution of DNNs. In our approach, a DNN is loaded and executed in a layer-by-layer manner, minimizing the memory usage to the order of a single layer. Also, we developed a pipeline architecture that hides most additional delays caused by the interleaved parameter loadings alongside layer executions. Our implementation shows a 96.5% memory reduction with just 14.8% delay overhead on average for representative DNNs. Furthermore, by exploiting the memory-delay tradeoff, near-zero delay overhead (under 1 ms) can be achieved with a slightly increased memory usage (still an 88.4% reduction), showing the great potential of Demand Layering. △ Less

Submitted 8 October, 2022; originally announced October 2022.

Comments: 14 pages, 16 figures. Accepted to the 43rd IEEE Real-Time Systems Symposium (RTSS), 2022

arXiv:2208.07137 [pdf, other]

An Empirical Study of Pseudo-Labeling for Image-based 3D Object Detection

Authors: Xinzhu Ma, Yuan Meng, Yinmin Zhang, Lei Bai, Jun Hou, Shuai Yi, Wanli Ouyang

Abstract: Image-based 3D detection is an indispensable component of the perception system for autonomous driving. However, it still suffers from the unsatisfying performance, one of the main reasons for which is the limited training data. Unfortunately, annotating the objects in the 3D space is extremely time/resource-consuming, which makes it hard to extend the training set arbitrarily. In this work, we fo… ▽ More Image-based 3D detection is an indispensable component of the perception system for autonomous driving. However, it still suffers from the unsatisfying performance, one of the main reasons for which is the limited training data. Unfortunately, annotating the objects in the 3D space is extremely time/resource-consuming, which makes it hard to extend the training set arbitrarily. In this work, we focus on the semi-supervised manner and explore the feasibility of a cheaper alternative, i.e. pseudo-labeling, to leverage the unlabeled data. For this purpose, we conduct extensive experiments to investigate whether the pseudo-labels can provide effective supervision for the baseline models under varying settings. The experimental results not only demonstrate the effectiveness of the pseudo-labeling mechanism for image-based 3D detection (e.g. under monocular setting, we achieve 20.23 AP for moderate level on the KITTI-3D testing set without bells and whistles, improving the baseline model by 6.03 AP), but also show several interesting and noteworthy findings (e.g. the models trained with pseudo-labels perform better than that trained with ground-truth annotations based on the same training data). We hope this work can provide insights for the image-based 3D detection community under a semi-supervised setting. The codes, pseudo-labels, and pre-trained models will be publicly available. △ Less

Submitted 15 August, 2022; originally announced August 2022.

Comments: tech report

arXiv:2208.00173 [pdf, other]

A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond

Authors: Chaoning Zhang, Chenshuang Zhang, Junha Song, John Seon Keun Yi, Kang Zhang, In So Kweon

Abstract: Masked autoencoders are scalable vision learners, as the title of MAE \cite{he2022masked}, which suggests that self-supervised learning (SSL) in vision might undertake a similar trajectory as in NLP. Specifically, generative pretext tasks with the masked prediction (e.g., BERT) have become a de facto standard SSL practice in NLP. By contrast, early attempts at generative methods in vision have bee… ▽ More Masked autoencoders are scalable vision learners, as the title of MAE \cite{he2022masked}, which suggests that self-supervised learning (SSL) in vision might undertake a similar trajectory as in NLP. Specifically, generative pretext tasks with the masked prediction (e.g., BERT) have become a de facto standard SSL practice in NLP. By contrast, early attempts at generative methods in vision have been buried by their discriminative counterparts (like contrastive learning); however, the success of mask image modeling has revived the masking autoencoder (often termed denoising autoencoder in the past). As a milestone to bridge the gap with BERT in NLP, masked autoencoder has attracted unprecedented attention for SSL in vision and beyond. This work conducts a comprehensive survey of masked autoencoders to shed insight on a promising direction of SSL. As the first to review SSL with masked autoencoders, this work focuses on its application in vision by discussing its historical developments, recent progress, and implications for diverse applications. △ Less

Submitted 30 July, 2022; originally announced August 2022.

Comments: First survey on masked autoencoder (under progress)

arXiv:2207.11965 [pdf, other]

Machine-checked executable semantics of Stateflow

Authors: Shicheng Yi, Shuling Wang, Bohua Zhan, Naijun Zhan

Abstract: Simulink is a widely used model-based development environment for embedded systems. Stateflow is a component of Simulink for modeling event-driven control via hierarchical state machines and flow charts. However, Stateflow lacks an official formal semantics, making it difficult to formally prove properties of its models in safety-critical applications. In this paper, we define a formal semantics f… ▽ More Simulink is a widely used model-based development environment for embedded systems. Stateflow is a component of Simulink for modeling event-driven control via hierarchical state machines and flow charts. However, Stateflow lacks an official formal semantics, making it difficult to formally prove properties of its models in safety-critical applications. In this paper, we define a formal semantics for a large subset of Stateflow, covering complex features such as hierarchical states and transitions, event broadcasts, early return, temporal operators, and so on. The semantics is formalized in Isabelle/HOL and proved to be deterministic. We implement a tactic for automatic execution of the semantics in Isabelle, as well as a translator in Python transforming Stateflow models to the syntax in Isabelle. Using these tools, we validate the semantics against a collection of examples illustrating the features we cover. △ Less

Submitted 25 July, 2022; originally announced July 2022.

Comments: 26 pages

arXiv:2207.05701 [pdf, other]

doi 10.11114/aef.v9i3.5610

Autoencoding Conditional GAN for Portfolio Allocation Diversification

Authors: Jun Lu, Shao Yi

Abstract: Over the decades, the Markowitz framework has been used extensively in portfolio analysis though it puts too much emphasis on the analysis of the market uncertainty rather than on the trend prediction. While generative adversarial network (GAN) and conditional GAN (CGAN) have been explored to generate financial time series and extract features that can help portfolio analysis. The limitation of th… ▽ More Over the decades, the Markowitz framework has been used extensively in portfolio analysis though it puts too much emphasis on the analysis of the market uncertainty rather than on the trend prediction. While generative adversarial network (GAN) and conditional GAN (CGAN) have been explored to generate financial time series and extract features that can help portfolio analysis. The limitation of the CGAN framework stands in putting too much emphasis on generating series rather than keeping features that can help this generator. In this paper, we introduce an autoencoding CGAN (ACGAN) based on deep generative models that learns the internal trend of historical data while modeling market uncertainty and future trends. We evaluate the model on several real-world datasets from both the US and Europe markets, and show that the proposed ACGAN model leads to better portfolio allocation and generates series that are closer to true data compared to the existing Markowitz and CGAN approaches. △ Less

Submitted 17 June, 2022; originally announced July 2022.

Journal ref: Applied Economics and Finance 9 (3), 55-68, 2022

arXiv:2207.01909 [pdf, other]

StyleFlow For Content-Fixed Image to Image Translation

Authors: Weichen Fan, Jinghuan Chen, Jiabin Ma, Jun Hou, Shuai Yi

Abstract: Image-to-image (I2I) translation is a challenging topic in computer vision. We divide this problem into three tasks: strongly constrained translation, normally constrained translation, and weakly constrained translation. The constraint here indicates the extent to which the content or semantic information in the original image is preserved. Although previous approaches have achieved good performan… ▽ More Image-to-image (I2I) translation is a challenging topic in computer vision. We divide this problem into three tasks: strongly constrained translation, normally constrained translation, and weakly constrained translation. The constraint here indicates the extent to which the content or semantic information in the original image is preserved. Although previous approaches have achieved good performance in weakly constrained tasks, they failed to fully preserve the content in both strongly and normally constrained tasks, including photo-realism synthesis, style transfer, and colorization, etc. To achieve content-preserving transfer in strongly constrained and normally constrained tasks, we propose StyleFlow, a new I2I translation model that consists of normalizing flows and a novel Style-Aware Normalization (SAN) module. With the invertible network structure, StyleFlow first projects input images into deep feature space in the forward pass, while the backward pass utilizes the SAN module to perform content-fixed feature transformation and then projects back to image space. Our model supports both image-guided translation and multi-modal synthesis. We evaluate our model in several I2I translation benchmarks, and the results show that the proposed model has advantages over previous methods in both strongly constrained and normally constrained tasks. △ Less

Submitted 5 July, 2022; originally announced July 2022.

arXiv:2206.10157 [pdf, other]

Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning

Authors: Shuaicheng Li, Feng Zhang, Kunlin Yang, Lingbo Liu, Shinan Liu, Jun Hou, Shuai Yi

Abstract: Video highlight detection is a crucial yet challenging problem that aims to identify the interesting moments in untrimmed videos. The key to this task lies in effective video representations that jointly pursue two goals, \textit{i.e.}, cross-modal representation learning and fine-grained feature discrimination. In this paper, these two challenges are tackled by not only enriching intra-modality a… ▽ More Video highlight detection is a crucial yet challenging problem that aims to identify the interesting moments in untrimmed videos. The key to this task lies in effective video representations that jointly pursue two goals, \textit{i.e.}, cross-modal representation learning and fine-grained feature discrimination. In this paper, these two challenges are tackled by not only enriching intra-modality and cross-modality relations for representation modeling but also shaping the features in a discriminative manner. Our proposed method mainly leverages the intra-modality encoding and cross-modality co-occurrence encoding for fully representation modeling. Specifically, intra-modality encoding augments the modality-wise features and dampens irrelevant modality via within-modality relation learning in both audio and visual signals. Meanwhile, cross-modality co-occurrence encoding focuses on the co-occurrence inter-modality relations and selectively captures effective information among multi-modality. The multi-modal representation is further enhanced by the global information abstracted from the local context. In addition, we enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning (HPCL) scheme. A hard-pairs sampling strategy is further employed to mine the hard samples for improving feature discrimination in HPCL. Extensive experiments conducted on two benchmarks demonstrate the effectiveness and superiority of our proposed methods compared to other state-of-the-art methods. △ Less

Submitted 21 June, 2022; originally announced June 2022.

arXiv:2206.06193 [pdf, other]

doi 10.1145/3478513.3480498

Differentiable Transient Rendering

Authors: Shinyoung Yi, Donggun Kim, Kiseok Choi, Adrian Jarabo, Diego Gutierrez, Min H. Kim

Abstract: Recent differentiable rendering techniques have become key tools to tackle many inverse problems in graphics and vision. Existing models, however, assume steady-state light transport, i.e., infinite speed of light. While this is a safe assumption for many applications, recent advances in ultrafast imaging leverage the wealth of information that can be extracted from the exact time of flight of lig… ▽ More Recent differentiable rendering techniques have become key tools to tackle many inverse problems in graphics and vision. Existing models, however, assume steady-state light transport, i.e., infinite speed of light. While this is a safe assumption for many applications, recent advances in ultrafast imaging leverage the wealth of information that can be extracted from the exact time of flight of light. In this context, physically-based transient rendering allows to efficiently simulate and analyze light transport considering that the speed of light is indeed finite. In this paper, we introduce a novel differentiable transient rendering framework, to help bring the potential of differentiable approaches into the transient regime. To differentiate the transient path integral we need to take into account that scattering events at path vertices are no longer independent; instead, tracking the time of flight of light requires treating such scattering events at path vertices jointly as a multidimensional, evolving manifold. We thus turn to the generalized transport theorem, and introduce a novel correlated importance term, which links the time-integrated contribution of a path to its light throughput, and allows us to handle discontinuities in the light and sensor functions. Last, we present results in several challenging scenarios where the time of flight of light plays an important role such as optimizing indices of refraction, non-line-of-sight tracking with nonplanar relay walls, and non-line-of-sight tracking around two corners. △ Less

Submitted 13 June, 2022; originally announced June 2022.

Journal ref: ACM Transactions on Graphics 40, 6, Article 286 (December 2021)

arXiv:2206.06067 [pdf, other]

Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation

Authors: Zengyu Qiu, Xinzhu Ma, Kunlin Yang, Chunya Liu, Jun Hou, Shuai Yi, Wanli Ouyang

Abstract: Knowledge distillation (KD) has shown very promising capabilities in transferring learning representations from large models (teachers) to small models (students). However, as the capacity gap between students and teachers becomes larger, existing KD methods fail to achieve better results. Our work shows that the `prior knowledge' is vital to KD, especially when applying large teachers. Particular… ▽ More Knowledge distillation (KD) has shown very promising capabilities in transferring learning representations from large models (teachers) to small models (students). However, as the capacity gap between students and teachers becomes larger, existing KD methods fail to achieve better results. Our work shows that the `prior knowledge' is vital to KD, especially when applying large teachers. Particularly, we propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation. This means that our method also takes the teacher's feature as `input', not just `target'. Besides, we dynamically adjust the ratio of the prior knowledge during the training phase according to the feature gap, thus guiding the student in an appropriate difficulty. To evaluate the proposed method, we conduct extensive experiments on two image classification benchmarks (i.e. CIFAR100 and ImageNet) and an object detection benchmark (i.e. MS COCO. The results demonstrate the superiority of our method in performance under varying settings. Besides, our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers. More importantly, DPK provides a fast solution in teacher model selection for any given model. △ Less

Submitted 23 March, 2023; v1 submitted 13 June, 2022; originally announced June 2022.

Comments: ICLR'23 accepted

arXiv:2205.10507 [pdf]

Travel Time, Distance and Costs Optimization for Paratransit Operations using Graph Convolutional Neural Network

Authors: Kelvin Kwakye, Younho Seong, Sun Yi

Abstract: The provision of paratransit services is one option to meet the transportation needs of Vulnerable Road Users (VRUs). Like any other means of transportation, paratransit has obstacles such as high operational costs and longer trip times. As a result, customers are dissatisfied, and paratransit operators have a low approval rating. Researchers have undertaken various studies over the years to bette… ▽ More The provision of paratransit services is one option to meet the transportation needs of Vulnerable Road Users (VRUs). Like any other means of transportation, paratransit has obstacles such as high operational costs and longer trip times. As a result, customers are dissatisfied, and paratransit operators have a low approval rating. Researchers have undertaken various studies over the years to better understand the travel behaviors of paratransit customers and how they are operated. According to the findings of these researches, paratransit operators confront the challenge of determining the optimal route for their trips in order to save travel time. Depending on the nature of the challenge, most research used different optimization techniques to solve these routing problems. As a result, the goal of this study is to use Graph Convolutional Neural Networks (GCNs) to assist paratransit operators in researching various operational scenarios in a strategic setting in order to optimize routing, minimize operating costs and minimize their users' travel time. The study was carried out by using a randomized simulated dataset to help determine the decision to make in terms of fleet composition and capacity under different situations. For the various scenarios investigated, the GCN assisted in determining the minimum optimal gap. △ Less

Submitted 21 May, 2022; originally announced May 2022.

arXiv:2204.08970 [pdf, other]

Rendering Nighttime Image Via Cascaded Color and Brightness Compensation

Authors: Zhihao Li, Si Yi, Zhan Ma

Abstract: Image signal processing (ISP) is crucial for camera imaging, and neural networks (NN) solutions are extensively deployed for daytime scenes. The lack of sufficient nighttime image dataset and insights on nighttime illumination characteristics poses a great challenge for high-quality rendering using existing NN ISPs. To tackle it, we first built a high-resolution nighttime RAW-RGB (NR2R) dataset wi… ▽ More Image signal processing (ISP) is crucial for camera imaging, and neural networks (NN) solutions are extensively deployed for daytime scenes. The lack of sufficient nighttime image dataset and insights on nighttime illumination characteristics poses a great challenge for high-quality rendering using existing NN ISPs. To tackle it, we first built a high-resolution nighttime RAW-RGB (NR2R) dataset with white balance and tone mapping annotated by expert professionals. Meanwhile, to best capture the characteristics of nighttime illumination light sources, we develop the CBUnet, a two-stage NN ISP to cascade the compensation of color and brightness attributes. Experiments show that our method has better visual quality compared to traditional ISP pipeline, and is ranked at the second place in the NTIRE 2022 Night Photography Rendering Challenge for two tracks by respective People's and Professional Photographer's choices. The code and relevant materials are avaiable on our website: https://njuvision.github.io/CBUnet. △ Less

Submitted 21 April, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

Comments: Accepted by NTIRE 2022 (CVPR Workshop)

arXiv:2204.04382 [pdf, other]

Federated Unsupervised Domain Adaptation for Face Recognition

Authors: Weiming Zhuang, Xin Gan, Yonggang Wen, Xuesen Zhang, Shuai Zhang, Shuai Yi

Abstract: Given labeled data in a source domain, unsupervised domain adaptation has been widely adopted to generalize models for unlabeled data in a target domain, whose data distributions are different. However, existing works are inapplicable to face recognition under privacy constraints because they require sharing of sensitive face images between domains. To address this problem, we propose federated un… ▽ More Given labeled data in a source domain, unsupervised domain adaptation has been widely adopted to generalize models for unlabeled data in a target domain, whose data distributions are different. However, existing works are inapplicable to face recognition under privacy constraints because they require sharing of sensitive face images between domains. To address this problem, we propose federated unsupervised domain adaptation for face recognition, FedFR. FedFR jointly optimizes clustering-based domain adaptation and federated learning to elevate performance on the target domain. Specifically, for unlabeled data in the target domain, we enhance a clustering algorithm with distance constrain to improve the quality of predicted pseudo labels. Besides, we propose a new domain constraint loss (DCL) to regularize source domain training in federated learning. Extensive experiments on a newly constructed benchmark demonstrate that FedFR outperforms the baseline and classic methods on the target domain by 3% to 14% on different evaluation metrics. △ Less

Submitted 9 April, 2022; originally announced April 2022.

Comments: ICME'22. arXiv admin note: substantial text overlap with arXiv:2105.07606

arXiv:2203.12456 [pdf, other]

doi 10.11114/aef.v9i2.5507

Reducing overestimating and underestimating volatility via the augmented blending-ARCH model

Authors: Jun Lu, Shao Yi

Abstract: SVR-GARCH model tends to "backward eavesdrop" when forecasting the financial time series volatility in which case it tends to simply produce the prediction by deviating the previous volatility. Though the SVR-GARCH model has achieved good performance in terms of various performance measurements, trading opportunities, peak or trough behaviors in the time series are all hampered by underestimating… ▽ More SVR-GARCH model tends to "backward eavesdrop" when forecasting the financial time series volatility in which case it tends to simply produce the prediction by deviating the previous volatility. Though the SVR-GARCH model has achieved good performance in terms of various performance measurements, trading opportunities, peak or trough behaviors in the time series are all hampered by underestimating or overestimating the volatility. We propose a blending ARCH (BARCH) and an augmented BARCH (aBARCH) model to overcome this kind of problem and make the prediction towards better peak or trough behaviors. The method is illustrated using real data sets including SH300 and S&P500. The empirical results obtained suggest that the augmented and blending models improve the volatility forecasting ability. △ Less

Submitted 15 March, 2022; originally announced March 2022.

Journal ref: Applied Economics and Finance 9 (2), 48-59, 2022

arXiv:2202.13461 [pdf, other]

Configuration Control for Physical Coupling of Heterogeneous Robot Swarms

Authors: Sha Yi, Zeynep Temel, Katia Sycara

Abstract: In this paper, we present a heterogeneous robot swarm system that can physically couple with each other to form functional structures and dynamically decouple to perform individual tasks. The connection between robots can be formed with a passive coupling mechanism, ensuring minimum energy consumption during coupling and decoupling behavior. The heterogeneity of the system enables the robots to pe… ▽ More In this paper, we present a heterogeneous robot swarm system that can physically couple with each other to form functional structures and dynamically decouple to perform individual tasks. The connection between robots can be formed with a passive coupling mechanism, ensuring minimum energy consumption during coupling and decoupling behavior. The heterogeneity of the system enables the robots to perform structural enhancement configurations based on specific environmental requirements. We propose a connection-pair oriented configuration control algorithm to form different assemblies. We show experiments of up to nine robots performing the coupling, gap-crossing, and decoupling behaviors. △ Less

Submitted 1 March, 2022; v1 submitted 27 February, 2022; originally announced February 2022.

arXiv:2202.02686 [pdf, other]

doi 10.1109/ICRA48506.2021.9561610

PuzzleBots: Physical Coupling of Robot Swarms

Authors: Sha Yi, Zeynep Temel, Katia Sycara

Abstract: Robot swarms have been shown to improve the ability of individual robots by inter-robot collaboration. In this paper, we present the PuzzleBots - a low-cost robotic swarm system where robots can physically couple with each other to form functional structures with minimum energy consumption while maintaining individual mobility to navigate within the environment. Each robot has knobs and holes alon… ▽ More Robot swarms have been shown to improve the ability of individual robots by inter-robot collaboration. In this paper, we present the PuzzleBots - a low-cost robotic swarm system where robots can physically couple with each other to form functional structures with minimum energy consumption while maintaining individual mobility to navigate within the environment. Each robot has knobs and holes along the sides of its body so that the robots can couple by inserting the knobs into the holes. We present the characterization of knob design and the result of gap-crossing behavior with up to nine robots. We show with hardware experiments that the robots are able to couple with each other to cross gaps and decouple to perform individual tasks. We anticipate the PuzzleBots will be useful in unstructured environments as individuals and coupled systems in real-world applications. △ Less

Submitted 5 February, 2022; originally announced February 2022.

Journal ref: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 8742-8748

arXiv:2201.07459 [pdf, other]

PT4AL: Using Self-Supervised Pretext Tasks for Active Learning

Authors: John Seon Keun Yi, Minseok Seo, Jongchan Park, Dong-Geol Choi

Abstract: Labeling a large set of data is expensive. Active learning aims to tackle this problem by asking to annotate only the most informative data from the unlabeled set. We propose a novel active learning approach that utilizes self-supervised pretext tasks and a unique data sampler to select data that are both difficult and representative. We discover that the loss of a simple self-supervised pretext t… ▽ More Labeling a large set of data is expensive. Active learning aims to tackle this problem by asking to annotate only the most informative data from the unlabeled set. We propose a novel active learning approach that utilizes self-supervised pretext tasks and a unique data sampler to select data that are both difficult and representative. We discover that the loss of a simple self-supervised pretext task, such as rotation prediction, is closely correlated to the downstream task loss. Before the active learning iterations, the pretext task learner is trained on the unlabeled set, and the unlabeled data are sorted and split into batches by their pretext task losses. In each active learning iteration, the main task model is used to sample the most uncertain data in a batch to be annotated. We evaluate our method on various image classification and segmentation benchmarks and achieve compelling performances on CIFAR10, Caltech-101, ImageNet, and Cityscapes. We further show that our method performs well on imbalanced datasets, and can be an effective solution to the cold-start problem where active learning performance is affected by the randomly sampled initial labeled set. △ Less

Submitted 26 July, 2022; v1 submitted 19 January, 2022; originally announced January 2022.

Comments: Code is available at https://github.com/johnsk95/PT4AL Updated for ECCV 2022 submission

arXiv:2201.04019 [pdf, other]

Pyramid Fusion Transformer for Semantic Segmentation

Authors: Zipeng Qin, Jianbo Liu, Xiaolin Zhang, Maoqing Tian, Aojun Zhou, Shuai Yi, Hongsheng Li

Abstract: The recently proposed MaskFormer gives a refreshed perspective on the task of semantic segmentation: it shifts from the popular pixel-level classification paradigm to a mask-level classification method. In essence, it generates paired probabilities and masks corresponding to category segments and combines them during inference for the segmentation maps. In our study, we find that per-mask classifi… ▽ More The recently proposed MaskFormer gives a refreshed perspective on the task of semantic segmentation: it shifts from the popular pixel-level classification paradigm to a mask-level classification method. In essence, it generates paired probabilities and masks corresponding to category segments and combines them during inference for the segmentation maps. In our study, we find that per-mask classification decoder on top of a single-scale feature is not effective enough to extract reliable probability or mask. To mine for rich semantic information across the feature pyramid, we propose a transformer-based Pyramid Fusion Transformer (PFT) for per-mask approach semantic segmentation with multi-scale features. The proposed transformer decoder performs cross-attention between the learnable queries and each spatial feature from the feature pyramid in parallel and uses cross-scale inter-query attention to exchange complimentary information. We achieve competitive performance on three widely used semantic segmentation datasets. In particular, on ADE20K validation set, our result with Swin-B backbone surpasses that of MaskFormer's with a much larger Swin-L backbone in both single-scale and multi-scale inference, achieving 54.1 mIoU and 55.7 mIoU respectively. Using a Swin-L backbone, we achieve single-scale 56.1 mIoU and multi-scale 57.4 mIoU, obtaining state-of-the-art performance on the dataset. Extensive experiments on three widely used semantic segmentation datasets verify the effectiveness of our proposed method. △ Less

Submitted 30 May, 2023; v1 submitted 11 January, 2022; originally announced January 2022.

arXiv:2201.01901 [pdf, other]

Incremental Object Grounding Using Scene Graphs

Authors: John Seon Keun Yi, Yoonwoo Kim, Sonia Chernova

Abstract: Object grounding tasks aim to locate the target object in an image through verbal communications. Understanding human command is an important process needed for effective human-robot communication. However, this is challenging because human commands can be ambiguous and erroneous. This paper aims to disambiguate the human's referring expressions by allowing the agent to ask relevant questions base… ▽ More Object grounding tasks aim to locate the target object in an image through verbal communications. Understanding human command is an important process needed for effective human-robot communication. However, this is challenging because human commands can be ambiguous and erroneous. This paper aims to disambiguate the human's referring expressions by allowing the agent to ask relevant questions based on semantic data obtained from scene graphs. We test if our agent can use relations between objects from a scene graph to ask semantically relevant questions that can disambiguate the original user command. In this paper, we present Incremental Grounding using Scene Graphs (IGSG), a disambiguation model that uses semantic data from an image scene graph and linguistic structures from a language scene graph to ground objects based on human command. Compared to the baseline, IGSG shows promising results in complex real-world scenes where there are multiple identical target objects. IGSG can effectively disambiguate ambiguous or wrong referring expressions by asking disambiguating questions back to the user. △ Less

Submitted 13 November, 2022; v1 submitted 5 January, 2022; originally announced January 2022.

arXiv:2112.12322 [pdf]

doi 10.1038/s41467-022-35723-2

High-order tensor flow processing using integrated photonic circuits

Authors: Shaofu Xu, Jing Wang, Sicheng Yi, Weiwen Zou

Abstract: Tensor analytics lays mathematical basis for the prosperous promotion of multiway signal processing. To increase computing throughput, mainstream processors transform tensor convolutions to matrix multiplications to enhance parallelism of computing. However, such order-reducing transformation produces data duplicates and consumes additional memory. Here, we demonstrate an integrated photonic tenso… ▽ More Tensor analytics lays mathematical basis for the prosperous promotion of multiway signal processing. To increase computing throughput, mainstream processors transform tensor convolutions to matrix multiplications to enhance parallelism of computing. However, such order-reducing transformation produces data duplicates and consumes additional memory. Here, we demonstrate an integrated photonic tensor flow processor without tensor-matrix transformation, which outputs the convolved tensor as the input tensor 'flows' through the processor. The hybrid manipulation of optical dimensions of wavelength, time, and space enables the direct representation and processing of high-order tensors in optical domain. In the proof-of-concept experiment, processing of multi-channel images and videos is accomplished at the frequency of 20 GHz. A convolutional neural network is demonstrated on the processor, which achieves an accuracy of 97.9 percent on action recognition. △ Less

Submitted 22 December, 2021; originally announced December 2021.

arXiv:2109.07154 [pdf, other]

Can Language Models be Biomedical Knowledge Bases?

Authors: Mujeen Sung, Jinhyuk Lee, Sean Yi, Minji Jeon, Sungdong Kim, Jaewoo Kang

Abstract: Pre-trained language models (LMs) have become ubiquitous in solving various natural language processing (NLP) tasks. There has been increasing interest in what knowledge these LMs contain and how we can extract that knowledge, treating LMs as knowledge bases (KBs). While there has been much work on probing LMs in the general domain, there has been little attention to whether these powerful LMs can… ▽ More Pre-trained language models (LMs) have become ubiquitous in solving various natural language processing (NLP) tasks. There has been increasing interest in what knowledge these LMs contain and how we can extract that knowledge, treating LMs as knowledge bases (KBs). While there has been much work on probing LMs in the general domain, there has been little attention to whether these powerful LMs can be used as domain-specific KBs. To this end, we create the BioLAMA benchmark, which is comprised of 49K biomedical factual knowledge triples for probing biomedical LMs. We find that biomedical LMs with recently proposed probing methods can achieve up to 18.51% Acc@5 on retrieving biomedical knowledge. Although this seems promising given the task difficulty, our detailed analyses reveal that most predictions are highly correlated with prompt templates without any subjects, hence producing similar results on each relation and hindering their capabilities to be used as domain-specific KBs. We hope that BioLAMA can serve as a challenging benchmark for biomedical factual probing. △ Less

Submitted 15 September, 2021; originally announced September 2021.

Comments: EMNLP 2021. Code available at https://github.com/dmis-lab/BioLAMA

Showing 1–50 of 118 results for author: Yi, S