subscribe to arXiv mailings

UFQA: Utility guided Fingerphoto Quality Assessment

Authors: Amol S. Joshi, Ali Dabouei, Jeremy Dawson, Nasser Nasrabadi

Abstract: Quality assessment of fingerprints captured using digital cameras and smartphones, also called fingerphotos, is a challenging problem in biometric recognition systems. As contactless biometric modalities are gaining more attention, their reliability should also be improved. Many factors, such as illumination, image contrast, camera angle, etc., in fingerphoto acquisition introduce various types of… ▽ More Quality assessment of fingerprints captured using digital cameras and smartphones, also called fingerphotos, is a challenging problem in biometric recognition systems. As contactless biometric modalities are gaining more attention, their reliability should also be improved. Many factors, such as illumination, image contrast, camera angle, etc., in fingerphoto acquisition introduce various types of distortion that may render the samples useless. Current quality estimation methods developed for fingerprints collected using contact-based sensors are inadequate for fingerphotos. We propose Utility guided Fingerphoto Quality Assessment (UFQA), a self-supervised dual encoder framework to learn meaningful feature representations to assess fingerphoto quality. A quality prediction model is trained to assess fingerphoto quality with additional supervision of quality maps. The quality metric is a predictor of the utility of fingerphotos in matching scenarios. Therefore, we use a holistic approach by including fingerphoto utility and local quality when labeling the training data. Experimental results verify that our approach performs better than the widely used fingerprint quality metric NFIQ2.2 and state-of-the-art image quality assessment algorithms on multiple publicly available fingerphoto datasets. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2407.09481 [pdf]

ChatGPT and Vaccine Hesitancy: A Comparison of English, Spanish, and French Responses Using a Validated Scale

Authors: Saubhagya Joshi, Eunbin Ha, Yonaira Rivera, Vivek K. Singh

Abstract: ChatGPT is a popular information system (over 1 billion visits in August 2023) that can generate natural language responses to user queries. It is important to study the quality and equity of its responses on health-related topics, such as vaccination, as they may influence public health decision-making. We use the Vaccine Hesitancy Scale (VHS) proposed by Shapiro et al.1 to measure the hesitancy… ▽ More ChatGPT is a popular information system (over 1 billion visits in August 2023) that can generate natural language responses to user queries. It is important to study the quality and equity of its responses on health-related topics, such as vaccination, as they may influence public health decision-making. We use the Vaccine Hesitancy Scale (VHS) proposed by Shapiro et al.1 to measure the hesitancy of ChatGPT responses in English, Spanish, and French. We find that: (a) ChatGPT responses indicate less hesitancy than those reported for human respondents in past literature; (b) ChatGPT responses vary significantly across languages, with English responses being the most hesitant on average and Spanish being the least; (c) ChatGPT responses are largely consistent across different model parameters but show some variations across the scale factors (vaccine competency, risk). Results have implications for researchers interested in evaluating and improving the quality and equity of health-related web information. △ Less

Submitted 6 May, 2024; originally announced July 2024.

Comments: 11 pages. Appeared in the Proceedings of the AMIA Informatics Summit, 2024

arXiv:2407.02968 [pdf, other]

Unified Anomaly Detection methods on Edge Device using Knowledge Distillation and Quantization

Authors: Sushovan Jena, Arya Pulkit, Kajal Singh, Anoushka Banerjee, Sharad Joshi, Ananth Ganesh, Dinesh Singh, Arnav Bhavsar

Abstract: With the rapid advances in deep learning and smart manufacturing in Industry 4.0, there is an imperative for high-throughput, high-performance, and fully integrated visual inspection systems. Most anomaly detection approaches using defect detection datasets, such as MVTec AD, employ one-class models that require fitting separate models for each class. On the contrary, unified models eliminate the… ▽ More With the rapid advances in deep learning and smart manufacturing in Industry 4.0, there is an imperative for high-throughput, high-performance, and fully integrated visual inspection systems. Most anomaly detection approaches using defect detection datasets, such as MVTec AD, employ one-class models that require fitting separate models for each class. On the contrary, unified models eliminate the need for fitting separate models for each class and significantly reduce cost and memory requirements. Thus, in this work, we experiment with considering a unified multi-class setup. Our experimental study shows that multi-class models perform at par with one-class models for the standard MVTec AD dataset. Hence, this indicates that there may not be a need to learn separate object/class-wise models when the object classes are significantly different from each other, as is the case of the dataset considered. Furthermore, we have deployed three different unified lightweight architectures on the CPU and an edge device (NVIDIA Jetson Xavier NX). We analyze the quantized multi-class anomaly detection models in terms of latency and memory requirements for deployment on the edge device while comparing quantization-aware training (QAT) and post-training quantization (PTQ) for performance at different precision widths. In addition, we explored two different methods of calibration required in post-training scenarios and show that one of them performs notably better, highlighting its importance for unsupervised tasks. Due to quantization, the performance drop in PTQ is further compensated by QAT, which yields at par performance with the original 32-bit Floating point in two of the models considered. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: 20 pages

MSC Class: 68T07 ACM Class: I.2.10

arXiv:2407.00121 [pdf, other]

Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks

Authors: Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matthew Stallone, Rameswar Panda, Yara Rizk, GP Bhargav, Maxwell Crouse, Chulaka Gunasekara, Shajith Ikbal, Sachin Joshi, Hima Karanam, Vineet Kumar, Asim Munawar, Sumit Neelam, Dinesh Raghu, Udit Sharma, Adriana Meza Soria, Dheeraj Sreedhar, Praveen Venkateswaran, Merve Unuvar, David Cox, Salim Roukos, Luis Lastras , et al. (1 additional authors not shown)

Abstract: Large language models (LLMs) have recently shown tremendous promise in serving as the backbone to agentic systems, as demonstrated by their performance in multi-faceted, challenging benchmarks like SWE-Bench and Agent-Bench. However, to realize the true potential of LLMs as autonomous agents, they must learn to identify, call, and interact with external tools and application program interfaces (AP… ▽ More Large language models (LLMs) have recently shown tremendous promise in serving as the backbone to agentic systems, as demonstrated by their performance in multi-faceted, challenging benchmarks like SWE-Bench and Agent-Bench. However, to realize the true potential of LLMs as autonomous agents, they must learn to identify, call, and interact with external tools and application program interfaces (APIs) to complete complex tasks. These tasks together are termed function calling. Endowing LLMs with function calling abilities leads to a myriad of advantages, such as access to current and domain-specific information in databases and knowledge sources, and the ability to outsource tasks that can be reliably performed by tools, e.g., a Python interpreter or calculator. While there has been significant progress in function calling with LLMs, there is still a dearth of open models that perform on par with proprietary LLMs like GPT, Claude, and Gemini. Therefore, in this work, we introduce the GRANITE-20B-FUNCTIONCALLING model under an Apache 2.0 license. The model is trained using a multi-task training approach on seven fundamental tasks encompassed in function calling, those being Nested Function Calling, Function Chaining, Parallel Functions, Function Name Detection, Parameter-Value Pair Detection, Next-Best Function, and Response Generation. We present a comprehensive evaluation on multiple out-of-domain datasets comparing GRANITE-20B-FUNCTIONCALLING to more than 15 other best proprietary and open models. GRANITE-20B-FUNCTIONCALLING provides the best performance among all open models on the Berkeley Function Calling Leaderboard and fourth overall. As a result of the diverse tasks and datasets used for training our model, we show that GRANITE-20B-FUNCTIONCALLING has better generalizability on multiple tasks in seven different evaluation datasets. △ Less

Submitted 27 June, 2024; originally announced July 2024.

arXiv:2406.13844 [pdf, other]

MAMA-MIA: A Large-Scale Multi-Center Breast Cancer DCE-MRI Benchmark Dataset with Expert Segmentations

Authors: Lidia Garrucho, Claire-Anne Reidel, Kaisar Kushibar, Smriti Joshi, Richard Osuala, Apostolia Tsirikoglou, Maciej Bobowicz, Javier del Riego, Alessandro Catanese, Katarzyna Gwoździewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo-Salem, Androniki Kozana, Eugen Divjak, Gordana Ivanac, Katerina Nikiforaki, Michail E. Klontzas, Rosa García-Dosdá, Meltem Gulsun-Akpinar, Oğuz Lafcı, Ritse Mann , et al. (8 additional authors not shown)

Abstract: Current research in breast cancer Magnetic Resonance Imaging (MRI), especially with Artificial Intelligence (AI), faces challenges due to the lack of expert segmentations. To address this, we introduce the MAMA-MIA dataset, comprising 1506 multi-center dynamic contrast-enhanced MRI cases with expert segmentations of primary tumors and non-mass enhancement areas. These cases were sourced from four… ▽ More Current research in breast cancer Magnetic Resonance Imaging (MRI), especially with Artificial Intelligence (AI), faces challenges due to the lack of expert segmentations. To address this, we introduce the MAMA-MIA dataset, comprising 1506 multi-center dynamic contrast-enhanced MRI cases with expert segmentations of primary tumors and non-mass enhancement areas. These cases were sourced from four publicly available collections in The Cancer Imaging Archive (TCIA). Initially, we trained a deep learning model to automatically segment the cases, generating preliminary segmentations that significantly reduced expert segmentation time. Sixteen experts, averaging 9 years of experience in breast cancer, then corrected these segmentations, resulting in the final expert segmentations. Additionally, two radiologists conducted a visual inspection of the automatic segmentations to support future quality control studies. Alongside the expert segmentations, we provide 49 harmonized demographic and clinical variables and the pretrained weights of the well-known nnUNet architecture trained using the DCE-MRI full-images and expert segmentations. This dataset aims to accelerate the development and benchmarking of deep learning models and foster innovation in breast cancer diagnostics and treatment planning. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 15 paes, 7 figures, 3 tables

arXiv:2406.08848 [pdf, other]

An Approach to Build Zero-Shot Slot-Filling System for Industry-Grade Conversational Assistants

Authors: G P Shrivatsa Bhargav, Sumit Neelam, Udit Sharma, Shajith Ikbal, Dheeraj Sreedhar, Hima Karanam, Sachindra Joshi, Pankaj Dhoolia, Dinesh Garg, Kyle Croutwater, Haode Qi, Eric Wayne, J William Murdock

Abstract: We present an approach to build Large Language Model (LLM) based slot-filling system to perform Dialogue State Tracking in conversational assistants serving across a wide variety of industry-grade applications. Key requirements of this system include: 1) usage of smaller-sized models to meet low latency requirements and to enable convenient and cost-effective cloud and customer premise deployments… ▽ More We present an approach to build Large Language Model (LLM) based slot-filling system to perform Dialogue State Tracking in conversational assistants serving across a wide variety of industry-grade applications. Key requirements of this system include: 1) usage of smaller-sized models to meet low latency requirements and to enable convenient and cost-effective cloud and customer premise deployments, and 2) zero-shot capabilities to serve across a wide variety of domains, slot types and conversational scenarios. We adopt a fine-tuning approach where a pre-trained LLM is fine-tuned into a slot-filling model using task specific data. The fine-tuning data is prepared carefully to cover a wide variety of slot-filling task scenarios that the model is expected to face across various domains. We give details of the data preparation and model building process. We also give a detailed analysis of the results of our experimental evaluations. Results show that our prescribed approach for slot-filling model building has resulted in 6.9% relative improvement of F1 metric over the best baseline on a realistic benchmark, while at the same time reducing the latency by 57%. More over, the data we prepared has helped improve F1 on an average by 4.2% relative across various slot-types. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.05120 [pdf, other]

Contextual fusion enhances robustness to image blurring

Authors: Shruti Joshi, Aiswarya Akumalla, Seth Haney, Maxim Bazhenov

Abstract: Mammalian brains handle complex reasoning by integrating information across brain regions specialized for particular sensory modalities. This enables improved robustness and generalization versus deep neural networks, which typically process one modality and are vulnerable to perturbations. While defense methods exist, they do not generalize well across perturbations. We developed a fusion model c… ▽ More Mammalian brains handle complex reasoning by integrating information across brain regions specialized for particular sensory modalities. This enables improved robustness and generalization versus deep neural networks, which typically process one modality and are vulnerable to perturbations. While defense methods exist, they do not generalize well across perturbations. We developed a fusion model combining background and foreground features from CNNs trained on Imagenet and Places365. We tested its robustness to human-perceivable perturbations on MS COCO. The fusion model improved robustness, especially for classes with greater context variability. Our proposed solution for integrating multiple modalities provides a new approach to enhance robustness and may be complementary to existing methods. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: arXiv admin note: text overlap with arXiv:2011.09526

arXiv:2405.14030 [pdf, other]

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

Authors: Haocheng Dai, Sarang Joshi

Abstract: Large vision-language models (VLMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often rid… ▽ More Large vision-language models (VLMs), such as CLIP, have become foundational, demonstrating remarkable success across a variety of downstream tasks. Despite their advantages, these models, akin to other foundational systems, inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment. Prevalent datasets like ImageNet are often riddled with non-causal, spurious correlations that can diminish VLM performance in scenarios where these contextual elements are absent. This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications. Our analysis reveals that the CLIP text representations are often tainted by spurious correlations, inherited in the biased pre-training dataset. Empirical evidence suggests that relying on visual representations from CLIP, as opposed to text embedding, is more practical to refine the skewed perceptions in VLMs, emphasizing the superior utility of visual representations in overcoming embedded biases. Our codes will be available here. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 18 pages, 7 figures

arXiv:2405.06467 [pdf, other]

Attend, Distill, Detect: Attention-aware Entropy Distillation for Anomaly Detection

Authors: Sushovan Jena, Vishwas Saini, Ujjwal Shaw, Pavitra Jain, Abhay Singh Raihal, Anoushka Banerjee, Sharad Joshi, Ananth Ganesh, Arnav Bhavsar

Abstract: Unsupervised anomaly detection encompasses diverse applications in industrial settings where a high-throughput and precision is imperative. Early works were centered around one-class-one-model paradigm, which poses significant challenges in large-scale production environments. Knowledge-distillation based multi-class anomaly detection promises a low latency with a reasonably good performance but w… ▽ More Unsupervised anomaly detection encompasses diverse applications in industrial settings where a high-throughput and precision is imperative. Early works were centered around one-class-one-model paradigm, which poses significant challenges in large-scale production environments. Knowledge-distillation based multi-class anomaly detection promises a low latency with a reasonably good performance but with a significant drop as compared to one-class version. We propose a DCAM (Distributed Convolutional Attention Module) which improves the distillation process between teacher and student networks when there is a high variance among multiple classes or objects. Integrated multi-scale feature matching strategy to utilise a mixture of multi-level knowledge from the feature pyramid of the two networks, intuitively helping in detecting anomalies of varying sizes which is also an inherent problem in the multi-class scenario. Briefly, our DCAM module consists of Convolutional Attention blocks distributed across the feature maps of the student network, which essentially learns to masks the irrelevant information during student learning alleviating the "cross-class interference" problem. This process is accompanied by minimizing the relative entropy using KL-Divergence in Spatial dimension and a Channel-wise Cosine Similarity between the same feature maps of teacher and student. The losses enables to achieve scale-invariance and capture non-linear relationships. We also highlight that the DCAM module would only be used during training and not during inference as we only need the learned feature maps and losses for anomaly scoring and hence, gaining a performance gain of 3.92% than the multi-class baseline with a preserved latency. △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: 15 pages

MSC Class: 68T07 ACM Class: I.2.10

arXiv:2405.01852 [pdf]

doi 10.4018/IJIIT.309588

Tokenization of Real Estate Assets Using Blockchain

Authors: Shashank Joshi, Arhan Choudhury

Abstract: Blockchain technology is one of the key technologies that have revolutionized various facets of society, such as the banking, healthcare, and other critical ecosystems. One area that can harness the usage of blockchain is the real estate sector. The most lucrative long-term investment is real estate, followed by gold, equities, mutual funds, and savings accounts. Nevertheless, it has administrativ… ▽ More Blockchain technology is one of the key technologies that have revolutionized various facets of society, such as the banking, healthcare, and other critical ecosystems. One area that can harness the usage of blockchain is the real estate sector. The most lucrative long-term investment is real estate, followed by gold, equities, mutual funds, and savings accounts. Nevertheless, it has administrative overheads such as lack of transparency, fraud, several intermediaries, title issues, paperwork, an increasing number of arbitrations, and the lack of liquidity. This paper proposes a framework that uses blockchain as an underlying technology. With the aid of blockchain and the suite of tools, it supports many of these problems that can be alleviated in the real estate investment ecosystem. These include smart contracts, immutable record management, tokenization, record tracking, and time-stamped storage. Tokenization of real estate lowers the entry barrier by fixing liquidity and interoperability and improving the interaction between various stakeholders. △ Less

Submitted 3 May, 2024; originally announced May 2024.

Journal ref: IJIIT vol.18, no.3 2022: pp.1-12.

arXiv:2404.19291 [pdf, other]

Dynamic Human Trust Modeling of Autonomous Agents With Varying Capability and Strategy

Authors: Jason Dekarske, Zhaodan Kong, Sanjay Joshi

Abstract: Objective We model the dynamic trust of human subjects in a human-autonomy-teaming screen-based task. Background Trust is an emerging area of study in human-robot collaboration. Many studies have looked at the issue of robot performance as a sole predictor of human trust, but this could underestimate the complexity of the interaction. Method Subjects were paired with autonomous agents to searc… ▽ More Objective We model the dynamic trust of human subjects in a human-autonomy-teaming screen-based task. Background Trust is an emerging area of study in human-robot collaboration. Many studies have looked at the issue of robot performance as a sole predictor of human trust, but this could underestimate the complexity of the interaction. Method Subjects were paired with autonomous agents to search an on-screen grid to determine the number of outlier objects. In each trial, a different autonomous agent with a preassigned capability used one of three search strategies and then reported the number of outliers it found as a fraction of its capability. Then, the subject reported their total outlier estimate. Human subjects then evaluated statements about the agent's behavior, reliability, and their trust in the agent. Results 80 subjects were recruited. Self-reported trust was modeled using Ordinary Least Squares, but the group that interacted with varying capability agents on a short time order produced a better performing ARIMAX model. Models were cross-validated between groups and found a moderate improvement in the next trial trust prediction. Conclusion A time series modeling approach reveals the effects of temporal ordering of agent performance on estimated trust. Recency bias may affect how subjects weigh the contribution of strategy or capability to trust. Understanding the connections between agent behavior, agent performance, and human trust is crucial to improving human-robot collaborative tasks. Application The modeling approach in this study demonstrates the need to represent autonomous agent characteristics over time to capture changes in human trust. △ Less

Submitted 30 April, 2024; originally announced April 2024.

arXiv:2404.16632 [pdf]

Introducing Systems Thinking as a Framework for Teaching and Assessing Threat Modeling Competency

Authors: Siddhant S. Joshi, Preeti Mukherjee, Kirsten A. Davis, James C. Davis

Abstract: Computing systems face diverse and substantial cybersecurity threats. To mitigate these cybersecurity threats, software engineers need to be competent in the skill of threat modeling. In industry and academia, there are many frameworks for teaching threat modeling, but our analysis of these frameworks suggests that (1) these approaches tend to be focused on component-level analysis rather than edu… ▽ More Computing systems face diverse and substantial cybersecurity threats. To mitigate these cybersecurity threats, software engineers need to be competent in the skill of threat modeling. In industry and academia, there are many frameworks for teaching threat modeling, but our analysis of these frameworks suggests that (1) these approaches tend to be focused on component-level analysis rather than educating students to reason holistically about a system's cybersecurity, and (2) there is no rubric for assessing a student's threat modeling competency. To address these concerns, we propose using systems thinking in conjunction with popular and industry-standard threat modeling frameworks like STRIDE for teaching and assessing threat modeling competency. Prior studies suggest a holistic approach, like systems thinking, can help understand and mitigate cybersecurity threats. Thus, we developed and piloted two novel rubrics - one for assessing STRIDE threat modeling performance and the other for assessing systems thinking performance while conducting STRIDE. To conduct this study, we piloted the two rubrics mentioned above to assess threat model artifacts of students enrolled in an upper-level software engineering course at Purdue University in Fall 2021, Spring 2023, and Fall 2023. Students who had both systems thinking and STRIDE instruction identified and attempted to mitigate component-level as well as systems-level threats. Students with only STRIDE instruction tended to focus on identifying and mitigating component-level threats and discounted system-level threats. We contribute to engineering education by: (1) describing a new rubric for assessing threat modeling based on systems thinking; (2) identifying trends and blindspots in students' threat modeling approach; and (3) envisioning the benefits of integrating systems thinking in threat modeling teaching and assessment. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: Presented at the Annual Conference of the American Society for Engineering Education (ASEE'24) 2024

arXiv:2404.05807 [pdf, other]

Slax: A Composable JAX Library for Rapid and Flexible Prototyping of Spiking Neural Networks

Authors: Thomas M. Summe, Siddharth Joshi

Abstract: Recent advances to algorithms for training spiking neural networks (SNNs) often leverage their unique dynamics. While backpropagation through time (BPTT) with surrogate gradients dominate the field, a rich landscape of alternatives can situate algorithms across various points in the performance, bio-plausibility, and complexity landscape. Evaluating and comparing algorithms is currently a cumberso… ▽ More Recent advances to algorithms for training spiking neural networks (SNNs) often leverage their unique dynamics. While backpropagation through time (BPTT) with surrogate gradients dominate the field, a rich landscape of alternatives can situate algorithms across various points in the performance, bio-plausibility, and complexity landscape. Evaluating and comparing algorithms is currently a cumbersome and error-prone process, requiring them to be repeatedly re-implemented. We introduce Slax, a JAX-based library designed to accelerate SNN algorithm design, compatible with the broader JAX and Flax ecosystem. Slax provides optimized implementations of diverse training algorithms, allowing direct performance comparison. Its toolkit includes methods to visualize and debug algorithms through loss landscapes, gradient similarities, and other metrics of model behavior during training. △ Less

Submitted 8 April, 2024; originally announced April 2024.

Comments: 13 pages, 11 figures, early draft

arXiv:2403.18679 [pdf]

An Exploratory Study on Upper-Level Computing Students' Use of Large Language Models as Tools in a Semester-Long Project

Authors: Ben Arie Tanay, Lexy Arinze, Siddhant S. Joshi, Kirsten A. Davis, James C. Davis

Abstract: Background: Large Language Models (LLMs) such as ChatGPT and CoPilot are influencing software engineering practice. Software engineering educators must teach future software engineers how to use such tools well. As of yet, there have been few studies that report on the use of LLMs in the classroom. It is, therefore, important to evaluate students' perception of LLMs and possible ways of adapting t… ▽ More Background: Large Language Models (LLMs) such as ChatGPT and CoPilot are influencing software engineering practice. Software engineering educators must teach future software engineers how to use such tools well. As of yet, there have been few studies that report on the use of LLMs in the classroom. It is, therefore, important to evaluate students' perception of LLMs and possible ways of adapting the computing curriculum to these shifting paradigms. Purpose: The purpose of this study is to explore computing students' experiences and approaches to using LLMs during a semester-long software engineering project. Design/Method: We collected data from a senior-level software engineering course at Purdue University. This course uses a project-based learning (PBL) design. The students used LLMs such as ChatGPT and Copilot in their projects. A sample of these student teams were interviewed to understand (1) how they used LLMs in their projects; and (2) whether and how their perspectives on LLMs changed over the course of the semester. We analyzed the data to identify themes related to students' usage patterns and learning outcomes. Results/Discussion: When computing students utilize LLMs within a project, their use cases cover both technical and professional applications. In addition, these students perceive LLMs to be efficient tools in obtaining information and completion of tasks. However, there were concerns about the responsible use of LLMs without being detrimental to their own learning outcomes. Based on our findings, we recommend future research to investigate the usage of LLM's in lower-level computer engineering courses to understand whether and how LLMs can be integrated as a learning aid without hurting the learning outcomes. △ Less

Submitted 16 April, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

Comments: Accepted to the 2024 General Conference of the American Society for Engineering Education (ASEE)

arXiv:2403.17426 [pdf, other]

Knowledge-Powered Recommendation for an Improved Diet Water Footprint

Authors: Saurav Joshi, Filip Ilievski, Jay Pujara

Abstract: According to WWF, 1.1 billion people lack access to water, and 2.7 billion experience water scarcity at least one month a year. By 2025, two-thirds of the world's population may be facing water shortages. This highlights the urgency of managing water usage efficiently, especially in water-intensive sectors like food. This paper proposes a recommendation engine, powered by knowledge graphs, aiming… ▽ More According to WWF, 1.1 billion people lack access to water, and 2.7 billion experience water scarcity at least one month a year. By 2025, two-thirds of the world's population may be facing water shortages. This highlights the urgency of managing water usage efficiently, especially in water-intensive sectors like food. This paper proposes a recommendation engine, powered by knowledge graphs, aiming to facilitate sustainable and healthy food consumption. The engine recommends ingredient substitutes in user recipes that improve nutritional value and reduce environmental impact, particularly water footprint. The system architecture includes source identification, information extraction, schema alignment, knowledge graph construction, and user interface development. The research offers a promising tool for promoting healthier eating habits and contributing to water conservation efforts. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: 3 pages, 1 figure, AAAI'24

arXiv:2403.13890 [pdf, other]

Towards Learning Contrast Kinetics with Multi-Condition Latent Diffusion Models

Authors: Richard Osuala, Daniel M. Lang, Preeti Verma, Smriti Joshi, Apostolia Tsirikoglou, Grzegorz Skorupko, Kaisar Kushibar, Lidia Garrucho, Walter H. L. Pinaya, Oliver Diaz, Julia A. Schnabel, Karim Lekadir

Abstract: Contrast agents in dynamic contrast enhanced magnetic resonance imaging allow to localize tumors and observe their contrast kinetics, which is essential for cancer characterization and respective treatment decision-making. However, contrast agent administration is not only associated with adverse health risks, but also restricted for patients during pregnancy, and for those with kidney malfunction… ▽ More Contrast agents in dynamic contrast enhanced magnetic resonance imaging allow to localize tumors and observe their contrast kinetics, which is essential for cancer characterization and respective treatment decision-making. However, contrast agent administration is not only associated with adverse health risks, but also restricted for patients during pregnancy, and for those with kidney malfunction, or other adverse reactions. With contrast uptake as key biomarker for lesion malignancy, cancer recurrence risk, and treatment response, it becomes pivotal to reduce the dependency on intravenous contrast agent administration. To this end, we propose a multi-conditional latent diffusion model capable of acquisition time-conditioned image synthesis of DCE-MRI temporal sequences. To evaluate medical image synthesis, we additionally propose and validate the Fréchet radiomics distance as an image quality measure based on biomarker variability between synthetic and real imaging data. Our results demonstrate our method's ability to generate realistic multi-sequence fat-saturated breast DCE-MRI and uncover the emerging potential of deep learning based contrast kinetics simulation. We publicly share our accessible codebase at https://github.com/RichardObi/ccnet and provide a user-friendly library for Fréchet radiomics distance calculation at https://pypi.org/project/frd-score. △ Less

Submitted 17 July, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

Comments: Early Accept at MICCAI2024

arXiv:2403.12267 [pdf, other]

Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity

Authors: Siddharth Joshi, Arnav Jain, Ali Payani, Baharan Mirzasoleiman

Abstract: Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding… ▽ More Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding small subsets of training data that provably generalize the best has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance. Our extensive experiments on ConceptualCaptions3M and ConceptualCaptions12M demonstrate that subsets found by \method\ achieve over 2.7x and 1.4x the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subsets obtain 1.5x the average accuracy across 11 downstream datasets, of the next best baseline. The code is available at: https://github.com/BigML-CS-UCLA/clipcov-data-efficient-clip. △ Less

Submitted 19 March, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

Comments: AISTATS 2024, Code: https://github.com/BigML-CS-UCLA/clipcov-data-efficient-clip

arXiv:2403.11391 [pdf, other]

Investigating the Benefits of Projection Head for Representation Learning

Authors: Yihao Xue, Eric Gan, Jiayi Ni, Siddharth Joshi, Baharan Mirzasoleiman

Abstract: An effective technique for obtaining high-quality representations is adding a projection head on top of the encoder during training, then discarding it and using the pre-projection representations. Despite its proven practical effectiveness, the reason behind the success of this technique is poorly understood. The pre-projection representations are not directly optimized by the loss function, rais… ▽ More An effective technique for obtaining high-quality representations is adding a projection head on top of the encoder during training, then discarding it and using the pre-projection representations. Despite its proven practical effectiveness, the reason behind the success of this technique is poorly understood. The pre-projection representations are not directly optimized by the loss function, raising the question: what makes them better? In this work, we provide a rigorous theoretical answer to this question. We start by examining linear models trained with self-supervised contrastive loss. We reveal that the implicit bias of training algorithms leads to layer-wise progressive feature weighting, where features become increasingly unequal as we go deeper into the layers. Consequently, lower layers tend to have more normalized and less specialized representations. We theoretically characterize scenarios where such representations are more beneficial, highlighting the intricate interplay between data augmentation and input features. Additionally, we demonstrate that introducing non-linearity into the network allows lower layers to learn features that are completely absent in higher layers. Finally, we show how this mechanism improves the robustness in supervised contrastive learning and supervised learning. We empirically validate our results through various experiments on CIFAR-10/100, UrbanCars and shifted versions of ImageNet. We also introduce a potential alternative to projection head, which offers a more interpretable and controllable design. △ Less

Submitted 17 March, 2024; originally announced March 2024.

Journal ref: ICLR 2024

arXiv:2403.04890 [pdf, other]

Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering

Authors: Ojas Gramopadhye, Saeel Sandeep Nachane, Prateek Chanda, Ganesh Ramakrishnan, Kshitij Sharad Jadhav, Yatin Nandwani, Dinesh Raghu, Sachindra Joshi

Abstract: Large Language models (LLMs) have demonstrated significant potential in transforming healthcare by automating tasks such as clinical documentation, information retrieval, and decision support. In this aspect, carefully engineered prompts have emerged as a powerful tool for using LLMs for medical scenarios, e.g., patient clinical scenarios. In this paper, we propose a modified version of the MedQA-… ▽ More Large Language models (LLMs) have demonstrated significant potential in transforming healthcare by automating tasks such as clinical documentation, information retrieval, and decision support. In this aspect, carefully engineered prompts have emerged as a powerful tool for using LLMs for medical scenarios, e.g., patient clinical scenarios. In this paper, we propose a modified version of the MedQA-USMLE dataset, which is subjective, to mimic real-life clinical scenarios. We explore the Chain of Thought (CoT) reasoning based on subjective response generation for the modified MedQA-USMLE dataset with appropriate LM-driven forward reasoning for correct responses to the medical questions. Keeping in mind the importance of response verification in the medical setting, we utilize a reward training mechanism whereby the language model also provides an appropriate verified response for a particular response to a clinical question. In this regard, we also include human-in-the-loop for different evaluation aspects. We develop better in-contrast learning strategies by modifying the 5-shot-codex-CoT-prompt from arXiv:2207.08143 for the subjective MedQA dataset and developing our incremental-reasoning prompt. Our evaluations show that the incremental reasoning prompt performs better than the modified codex prompt in certain scenarios. We also show that greedy decoding with the incremental reasoning method performs better than other strategies, such as prompt chaining and eliminative reasoning. △ Less

Submitted 7 March, 2024; originally announced March 2024.

arXiv:2403.02054 [pdf, other]

Large Language Model-Based Evolutionary Optimizer: Reasoning with elitism

Authors: Shuvayan Brahmachary, Subodh M. Joshi, Aniruddha Panda, Kaushik Koneripalli, Arun Kumar Sagotra, Harshil Patel, Ankush Sharma, Ameya D. Jagtap, Kaushic Kalyanaraman

Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, prompting interest in their application as black-box optimizers. This paper asserts that LLMs possess the capability for zero-shot optimization across diverse scenarios, including multi-objective and high-dimensional problems. We introduce a novel population-based method for numerical optimization using LLMs called Lang… ▽ More Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, prompting interest in their application as black-box optimizers. This paper asserts that LLMs possess the capability for zero-shot optimization across diverse scenarios, including multi-objective and high-dimensional problems. We introduce a novel population-based method for numerical optimization using LLMs called Language-Model-Based Evolutionary Optimizer (LEO). Our hypothesis is supported through numerical examples, spanning benchmark and industrial engineering problems such as supersonic nozzle shape optimization, heat transfer, and windfarm layout optimization. We compare our method to several gradient-based and gradient-free optimization approaches. While LLMs yield comparable results to state-of-the-art methods, their imaginative nature and propensity to hallucinate demand careful handling. We provide practical guidelines for obtaining reliable answers from LLMs and discuss method limitations and potential research directions. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2403.01926 [pdf, other]

IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

Authors: Tahir Javed, Janki Atul Nawale, Eldho Ittan George, Sakshi Joshi, Kaushal Santosh Bhogale, Deovrat Mehendale, Ishvinder Virender Sethi, Aparna Ananthanarayanan, Hafsah Faquih, Pratiti Palit, Sneha Ravishankar, Saranya Sukumaran, Tripura Panchagnula, Sunjay Murali, Kunal Sharad Gandhi, Ambujavalli R, Manickam K M, C Venkata Vaijayanthi, Krishnan Srinivasa Raghavan Karunganni, Pratyush Kumar, Mitesh M Khapra

Abstract: We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural,… ▽ More We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.19355 [pdf, other]

Unraveling Adversarial Examples against Speaker Identification -- Techniques for Attack Detection and Victim Model Classification

Authors: Sonal Joshi, Thomas Thebaud, Jesús Villalba, Najim Dehak

Abstract: Adversarial examples have proven to threaten speaker identification systems, and several countermeasures against them have been proposed. In this paper, we propose a method to detect the presence of adversarial examples, i.e., a binary classifier distinguishing between benign and adversarial examples. We build upon and extend previous work on attack type classification by exploring new architectur… ▽ More Adversarial examples have proven to threaten speaker identification systems, and several countermeasures against them have been proposed. In this paper, we propose a method to detect the presence of adversarial examples, i.e., a binary classifier distinguishing between benign and adversarial examples. We build upon and extend previous work on attack type classification by exploring new architectures. Additionally, we introduce a method for identifying the victim model on which the adversarial attack is carried out. To achieve this, we generate a new dataset containing multiple attacks performed against various victim models. We achieve an AUC of 0.982 for attack detection, with no more than a 0.03 drop in performance for unknown attacks. Our attack classification accuracy (excluding benign) reaches 86.48% across eight attack types using our LightResNet34 architecture, while our victim model classification accuracy reaches 72.28% across four victim models. △ Less

Submitted 29 February, 2024; originally announced February 2024.

arXiv:2402.19109 [pdf, other]

Confidence and Assurance of Percentiles

Authors: Sanjay M. Joshi

Abstract: Confidence interval of mean is often used when quoting statistics. The same rigor is often missing when quoting percentiles and tolerance or percentile intervals. This article derives the expression for confidence in percentiles of a sample population. Confidence intervals of median is compared to those of mean for a few sample distributions. The concept of assurance from reliability engineering i… ▽ More Confidence interval of mean is often used when quoting statistics. The same rigor is often missing when quoting percentiles and tolerance or percentile intervals. This article derives the expression for confidence in percentiles of a sample population. Confidence intervals of median is compared to those of mean for a few sample distributions. The concept of assurance from reliability engineering is then extended to percentiles. The assurance level of sorted samples simply matches the confidence and percentile levels. Numerical method to compute assurance using Brent's optimization method is provided as an open-source python package. △ Less

Submitted 29 February, 2024; originally announced February 2024.

Comments: 5 pages, 4 Figures

arXiv:2402.07658 [pdf, other]

The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models

Authors: Ayo Adedeji, Sarita Joshi, Brendan Doohan

Abstract: In the rapidly evolving landscape of medical documentation, transcribing clinical dialogues accurately is increasingly paramount. This study explores the potential of Large Language Models (LLMs) to enhance the accuracy of Automatic Speech Recognition (ASR) systems in medical transcription. Utilizing the PriMock57 dataset, which encompasses a diverse range of primary care consultations, we apply a… ▽ More In the rapidly evolving landscape of medical documentation, transcribing clinical dialogues accurately is increasingly paramount. This study explores the potential of Large Language Models (LLMs) to enhance the accuracy of Automatic Speech Recognition (ASR) systems in medical transcription. Utilizing the PriMock57 dataset, which encompasses a diverse range of primary care consultations, we apply advanced LLMs to refine ASR-generated transcripts. Our research is multifaceted, focusing on improvements in general Word Error Rate (WER), Medical Concept WER (MC-WER) for the accurate transcription of essential medical terms, and speaker diarization accuracy. Additionally, we assess the role of LLM post-processing in improving semantic textual similarity, thereby preserving the contextual integrity of clinical dialogues. Through a series of experiments, we compare the efficacy of zero-shot and Chain-of-Thought (CoT) prompting techniques in enhancing diarization and correction accuracy. Our findings demonstrate that LLMs, particularly through CoT prompting, not only improve the diarization accuracy of existing ASR systems but also achieve state-of-the-art performance in this domain. This improvement extends to more accurately capturing medical concepts and enhancing the overall semantic coherence of the transcribed dialogues. These findings illustrate the dual role of LLMs in augmenting ASR outputs and independently excelling in transcription tasks, holding significant promise for transforming medical ASR systems and leading to more accurate and reliable patient records in healthcare settings. △ Less

Submitted 12 February, 2024; originally announced February 2024.

Comments: 31 pages, 17 figures

arXiv:2402.06185 [pdf, other]

Development and validation of an artificial intelligence model to accurately predict spinopelvic parameters

Authors: Edward S. Harake, Joseph R. Linzey, Cheng Jiang, Rushikesh S. Joshi, Mark M. Zaki, Jaes C. Jones, Siri S. Khalsa, John H. Lee, Zachary Wilseck, Jacob R. Joseph, Todd C. Hollon, Paul Park

Abstract: Objective. Achieving appropriate spinopelvic alignment has been shown to be associated with improved clinical symptoms. However, measurement of spinopelvic radiographic parameters is time-intensive and interobserver reliability is a concern. Automated measurement tools have the promise of rapid and consistent measurements, but existing tools are still limited by some degree of manual user-entry re… ▽ More Objective. Achieving appropriate spinopelvic alignment has been shown to be associated with improved clinical symptoms. However, measurement of spinopelvic radiographic parameters is time-intensive and interobserver reliability is a concern. Automated measurement tools have the promise of rapid and consistent measurements, but existing tools are still limited by some degree of manual user-entry requirements. This study presents a novel artificial intelligence (AI) tool called SpinePose that automatically predicts spinopelvic parameters with high accuracy without the need for manual entry. Methods. SpinePose was trained and validated on 761 sagittal whole-spine X-rays to predict sagittal vertical axis (SVA), pelvic tilt (PT), pelvic incidence (PI), sacral slope (SS), lumbar lordosis (LL), T1-pelvic angle (T1PA), and L1-pelvic angle (L1PA). A separate test set of 40 X-rays was labeled by 4 reviewers, including fellowship-trained spine surgeons and a fellowship-trained radiologist with neuroradiology subspecialty certification. Median errors relative to the most senior reviewer were calculated to determine model accuracy on test images. Intraclass correlation coefficients (ICC) were used to assess inter-rater reliability. Results. SpinePose exhibited the following median (interquartile range) parameter errors: SVA: 2.2(2.3)mm, p=0.93; PT: 1.3(1.2)°, p=0.48; SS: 1.7(2.2)°, p=0.64; PI: 2.2(2.1)°, p=0.24; LL: 2.6(4.0)°, p=0.89; T1PA: 1.1(0.9)°, p=0.42; and L1PA: 1.4(1.6)°, p=0.49. Model predictions also exhibited excellent reliability at all parameters (ICC: 0.91-1.0). Conclusions. SpinePose accurately predicted spinopelvic parameters with excellent reliability comparable to fellowship-trained spine surgeons and neuroradiologists. Utilization of predictive AI tools in spinal imaging can substantially aid in patient selection and surgical planning. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Comments: 10 pages, 5 figures, to appear in Journal of Neurosurgery: Spine

arXiv:2402.02479 [pdf, other]

BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback

Authors: Gaurav Pandey, Yatin Nandwani, Tahira Naseem, Mayank Mishra, Guangxuan Xu, Dinesh Raghu, Sachindra Joshi, Asim Munawar, Ramón Fernandez Astudillo

Abstract: Distribution matching methods for language model alignment such as Generation with Distributional Control (GDC) and Distributional Policy Gradient (DPG) have not received the same level of attention in reinforcement learning from human feedback (RLHF) as contrastive methods such as Sequence Likelihood Calibration (SLiC), Direct Preference Optimization (DPO) and its variants. We identify high varia… ▽ More Distribution matching methods for language model alignment such as Generation with Distributional Control (GDC) and Distributional Policy Gradient (DPG) have not received the same level of attention in reinforcement learning from human feedback (RLHF) as contrastive methods such as Sequence Likelihood Calibration (SLiC), Direct Preference Optimization (DPO) and its variants. We identify high variance of the gradient estimate as the primary reason for the lack of success of these methods and propose a self-normalized baseline to reduce the variance. We further generalize the target distribution in DPG, GDC and DPO by using Bayes' rule to define the reward-conditioned posterior. The resulting approach, referred to as BRAIn - Bayesian Reward-conditioned Amortized Inference acts as a bridge between distribution matching methods and DPO and significantly outperforms prior art in summarization and Antropic HH tasks. △ Less

Submitted 10 June, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

Comments: Accepted at ICML 2024 (main conference)

arXiv:2401.00451 [pdf, other]

doi 10.1145/3639474.3640079

Exploring the Need of Accessibility Education in the Software Industry: Insights from a Survey of Software Professionals in India

Authors: Parthasarathy P D, Swaroop Joshi

Abstract: A UserWay study in 2021 indicates that an annual global e-commerce revenue loss of approximately $16 billion can be attributed to inaccessible websites and applications. According to the 2023 WebAIM study, only 3.7% of the world's top one million website homepages are fully accessible. This shows that many software developers use poor coding practices that don't adhere to the Web Content Accessibi… ▽ More A UserWay study in 2021 indicates that an annual global e-commerce revenue loss of approximately $16 billion can be attributed to inaccessible websites and applications. According to the 2023 WebAIM study, only 3.7% of the world's top one million website homepages are fully accessible. This shows that many software developers use poor coding practices that don't adhere to the Web Content Accessibility Guidelines (WCAG). This research centers on software professionals and their role in addressing accessibility. This work seeks to understand (a) who within the software development community actively practices accessibility, (b) when and how accessibility is considered in the software development lifecycle, (c) the various challenges encountered in building accessible software, and (d) the resources required by software professionals to enhance product accessibility. Our survey of 269 software professionals from India sheds light on the pressing need for accessibility education within the software industry. A substantial majority (69.9%, N=269) of respondents express the need for training materials, workshops, and bootcamps to enhance their accessibility skills. We present a list of actionable recommendations that can be implemented within the industry to promote accessibility awareness and skills. We also open source our raw data for further research, encouraging continued exploration in this domain. △ Less

Submitted 31 December, 2023; originally announced January 2024.

Comments: To be published in International Conference on Software Engineering (ICSE'24), Software Engineering Education and Training Track

ACM Class: K.3.2; K.4.2

arXiv:2401.00449 [pdf, other]

doi 10.1145/3639474.3640083

Teaching Digital Accessibility to Industry Professionals using the Community of Practice Framework: An Experience Report

Authors: Parthasarathy PD, Swaroop Joshi

Abstract: Despite recent initiatives aimed at improving accessibility, the field of digital accessibility remains markedly behind contemporary advancements in the software industry as a large number of real world software and web applications continue to fall short of accessibility requirements. A persisting skills deficit within the existing technology workforce has been an enduring impediment, hindering o… ▽ More Despite recent initiatives aimed at improving accessibility, the field of digital accessibility remains markedly behind contemporary advancements in the software industry as a large number of real world software and web applications continue to fall short of accessibility requirements. A persisting skills deficit within the existing technology workforce has been an enduring impediment, hindering organizations from delivering truly accessible software products. This, in turn, elevates the risk of isolating and excluding a substantial portion of potential users. In this paper, we report lessons learned from a training program for teaching digital accessibility using the Communities of Practice (CoP) framework to industry professionals. We recruited 66 participants from a large multi-national software company and assigned them to two groups: one participating in a CoP and the other using self-paced learning. We report experiences from designing the training program, conducting the actual training, and assessing the efficiency of the two approaches. Based on these findings, we provide recommendations for practitioners in Learning and Development teams and educators in designing accessibility courses for industry professionals. △ Less

Submitted 31 December, 2023; originally announced January 2024.

Comments: To be published in International Conference on Software Engineering (ICSE'24), Software Engineering Education and Training Track

ACM Class: K.3.2; K.4.2

arXiv:2312.07418 [pdf]

Attention Based Encoder Decoder Model for Video Captioning in Nepali (2023)

Authors: Kabita Parajuli, Shashidhar Ram Joshi

Abstract: Video captioning in Nepali, a language written in the Devanagari script, presents a unique challenge due to the lack of existing academic work in this domain. This work develops a novel encoder-decoder paradigm for Nepali video captioning to tackle this difficulty. LSTM and GRU sequence-to-sequence models are used in the model to produce related textual descriptions based on features retrieved fro… ▽ More Video captioning in Nepali, a language written in the Devanagari script, presents a unique challenge due to the lack of existing academic work in this domain. This work develops a novel encoder-decoder paradigm for Nepali video captioning to tackle this difficulty. LSTM and GRU sequence-to-sequence models are used in the model to produce related textual descriptions based on features retrieved from video frames using CNNs. Using Google Translate and manual post-editing, a Nepali video captioning dataset is generated from the Microsoft Research Video Description Corpus (MSVD) dataset created using Google Translate, and manual post-editing work. The efficiency of the model for Devanagari-scripted video captioning is demonstrated by BLEU, METOR, and ROUGE measures, which are used to assess its performance. △ Less

Submitted 19 May, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

arXiv:2311.16151 [pdf, other]

Estimating Post-Synaptic Effects for Online Training of Feed-Forward SNNs

Authors: Thomas Summe, Clemens JS Schaefer, Siddharth Joshi

Abstract: Facilitating online learning in spiking neural networks (SNNs) is a key step in developing event-based models that can adapt to changing environments and learn from continuous data streams in real-time. Although forward-mode differentiation enables online learning, its computational requirements restrict scalability. This is typically addressed through approximations that limit learning in deep mo… ▽ More Facilitating online learning in spiking neural networks (SNNs) is a key step in developing event-based models that can adapt to changing environments and learn from continuous data streams in real-time. Although forward-mode differentiation enables online learning, its computational requirements restrict scalability. This is typically addressed through approximations that limit learning in deep models. In this study, we propose Online Training with Postsynaptic Estimates (OTPE) for training feed-forward SNNs, which approximates Real-Time Recurrent Learning (RTRL) by incorporating temporal dynamics not captured by current approximations, such as Online Training Through Time (OTTT) and Online Spatio-Temporal Learning (OSTL). We show improved scaling for multi-layer networks using a novel approximation of temporal effects on the subsequent layer's activity. This approximation incurs minimal overhead in the time and space complexity compared to similar algorithms, and the calculation of temporal effects remains local to each layer. We characterize the learning performance of our proposed algorithms on multiple SNN model configurations for rate-based and time-based encoding. OTPE exhibits the highest directional alignment to exact gradients, calculated with backpropagation through time (BPTT), in deep networks and, on time-based encoding, outperforms other approximate methods. We also observe sizeable gains in average performance over similar algorithms in offline training of Spiking Heidelberg Digits with equivalent hyper-parameters (OTTT/OSTL - 70.5%; OTPE - 75.2%; BPTT - 78.1%). △ Less

Submitted 7 November, 2023; originally announced November 2023.

arXiv:2311.14335 [pdf, other]

Comparative Analysis of Transformers for Modeling Tabular Data: A Casestudy using Industry Scale Dataset

Authors: Usneek Singh, Piyush Arora, Shamika Ganesan, Mohit Kumar, Siddhant Kulkarni, Salil R. Joshi

Abstract: We perform a comparative analysis of transformer-based models designed for modeling tabular data, specifically on an industry-scale dataset. While earlier studies demonstrated promising outcomes on smaller public or synthetic datasets, the effectiveness did not extend to larger industry-scale datasets. The challenges identified include handling high-dimensional data, the necessity for efficient pr… ▽ More We perform a comparative analysis of transformer-based models designed for modeling tabular data, specifically on an industry-scale dataset. While earlier studies demonstrated promising outcomes on smaller public or synthetic datasets, the effectiveness did not extend to larger industry-scale datasets. The challenges identified include handling high-dimensional data, the necessity for efficient pre-processing of categorical and numerical features, and addressing substantial computational requirements. To overcome the identified challenges, the study conducts an extensive examination of various transformer-based models using both synthetic datasets and the default prediction Kaggle dataset (2022) from American Express. The paper presents crucial insights into optimal data pre-processing, compares pre-training and direct supervised learning methods, discusses strategies for managing categorical and numerical features, and highlights trade-offs between computational resources and performance. Focusing on temporal financial data modeling, the research aims to facilitate the systematic development and deployment of transformer-based models in real-world scenarios, emphasizing scalability. △ Less

Submitted 24 November, 2023; originally announced November 2023.

Comments: Accepted at 7th Joint International Conference on Data Science & Management of Data (11th ACMIKDD CODS and 29th COMAD)

arXiv:2311.12944 [pdf, other]

SkyCharge: Deploying Unmanned Aerial Vehicles for Dynamic Load Optimization in Solar Small Cell 5G Networks

Authors: Daksh Dave, Vinay Chamola, Sandeep Joshi, Sherali Zeadally

Abstract: The power requirements posed by the fifth-generation and beyond cellular networks are an important constraint in network deployment and require energy-efficient solutions. In this work, we propose a novel user load transfer approach using airborne base stations (BS) mounted on drones for reliable and secure power redistribution across the micro-grid network comprising green small cell BSs. Dependi… ▽ More The power requirements posed by the fifth-generation and beyond cellular networks are an important constraint in network deployment and require energy-efficient solutions. In this work, we propose a novel user load transfer approach using airborne base stations (BS) mounted on drones for reliable and secure power redistribution across the micro-grid network comprising green small cell BSs. Depending on the user density and the availability of an aerial BS, the energy requirement of a cell with an energy deficit is accommodated by migrating the aerial BS from a high-energy to a low-energy cell. The proposed hybrid drone-based framework integrates long short-term memory with unique cost functions using an evolutionary neural network for drones and BSs and efficiently manages energy and load redistribution. The proposed algorithm reduces power outages at BSs and maintains consistent throughput stability, thereby demonstrating its capability to boost the reliability and robustness of wireless communication systems. △ Less

Submitted 9 February, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

arXiv:2311.12235 [pdf, other]

Improvements in Interlayer Pipelining of CNN Accelerators Using Genetic Algorithms

Authors: Mark Horeni, Siddharth Joshi

Abstract: Deploying Convolutional Neural Networks (CNNs) on edge platforms necessitates efficient hardware acceleration. Any unnecessary data movement in such accelerators can unacceptably degrade performance and efficiency. To address this, we develop a layer fusion technique targeting CNNs, that reduces off-chip data communication using a Genetic Algorithm (GA) applied to graph-based topological sort. Res… ▽ More Deploying Convolutional Neural Networks (CNNs) on edge platforms necessitates efficient hardware acceleration. Any unnecessary data movement in such accelerators can unacceptably degrade performance and efficiency. To address this, we develop a layer fusion technique targeting CNNs, that reduces off-chip data communication using a Genetic Algorithm (GA) applied to graph-based topological sort. Results show a 1.8$\times$ increase in energy efficiency and 1.9$\times$ improvement in energy-delay product (EDP) for MobileNet-v3 on a SIMBA-like mobile architecture. Our approach consistently improves workload performance, averaging 1.4$\times$ improvement to EDP for SIMBA and 1.12$\times$ for Eyeriss. △ Less

Submitted 20 November, 2023; originally announced November 2023.

arXiv:2311.11157 [pdf, other]

Contextualizing Internet Memes Across Social Media Platforms

Authors: Saurav Joshi, Filip Ilievski, Luca Luceri

Abstract: Internet memes have emerged as a novel format for communication and expressing ideas on the web. Their fluidity and creative nature are reflected in their widespread use, often across platforms and occasionally for unethical or harmful purposes. While computational work has already analyzed their high-level virality over time and developed specialized classifiers for hate speech detection, there h… ▽ More Internet memes have emerged as a novel format for communication and expressing ideas on the web. Their fluidity and creative nature are reflected in their widespread use, often across platforms and occasionally for unethical or harmful purposes. While computational work has already analyzed their high-level virality over time and developed specialized classifiers for hate speech detection, there have been no efforts to date that aim to holistically track, identify, and map internet memes posted on social media. To bridge this gap, we investigate whether internet memes across social media platforms can be contextualized by using a semantic repository of knowledge, namely, a knowledge graph. We collect thousands of potential internet meme posts from two social media platforms, namely Reddit and Discord, and develop an extract-transform-load procedure to create a data lake with candidate meme posts. By using vision transformer-based similarity, we match these candidates against the memes cataloged in IMKG -- a recently released knowledge graph of internet memes. We leverage this grounding to highlight the potential of our proposed framework to study the prevalence of memes on different platforms, map them to IMKG, and provide context about memes on social media. △ Less

Submitted 26 February, 2024; v1 submitted 18 November, 2023; originally announced November 2023.

Comments: 10 pages, 7 figures, 2 tables

arXiv:2311.10899 [pdf, other]

Extraction and Summarization of Explicit Video Content using Multi-Modal Deep Learning

Authors: Shaunak Joshi, Raghav Gaggar

Abstract: With the increase in video-sharing platforms across the internet, it is difficult for humans to moderate the data for explicit content. Hence, an automated pipeline to scan through video data for explicit content has become the need of the hour. We propose a novel pipeline that uses multi-modal deep learning to first extract the explicit segments of input videos and then summarize their content us… ▽ More With the increase in video-sharing platforms across the internet, it is difficult for humans to moderate the data for explicit content. Hence, an automated pipeline to scan through video data for explicit content has become the need of the hour. We propose a novel pipeline that uses multi-modal deep learning to first extract the explicit segments of input videos and then summarize their content using text to determine its age appropriateness and age rating. We also evaluate our pipeline's effectiveness in the end using standard metrics. △ Less

Submitted 20 November, 2023; v1 submitted 17 November, 2023; originally announced November 2023.

Comments: 8 pages, 3 figures

ACM Class: I.2.10

arXiv:2311.10879 [pdf, other]

doi 10.1117/12.3006961

Pre- to Post-Contrast Breast MRI Synthesis for Enhanced Tumour Segmentation

Authors: Richard Osuala, Smriti Joshi, Apostolia Tsirikoglou, Lidia Garrucho, Walter H. L. Pinaya, Oliver Diaz, Karim Lekadir

Abstract: Despite its benefits for tumour detection and treatment, the administration of contrast agents in dynamic contrast-enhanced MRI (DCE-MRI) is associated with a range of issues, including their invasiveness, bioaccumulation, and a risk of nephrogenic systemic fibrosis. This study explores the feasibility of producing synthetic contrast enhancements by translating pre-contrast T1-weighted fat-saturat… ▽ More Despite its benefits for tumour detection and treatment, the administration of contrast agents in dynamic contrast-enhanced MRI (DCE-MRI) is associated with a range of issues, including their invasiveness, bioaccumulation, and a risk of nephrogenic systemic fibrosis. This study explores the feasibility of producing synthetic contrast enhancements by translating pre-contrast T1-weighted fat-saturated breast MRI to their corresponding first DCE-MRI sequence leveraging the capabilities of a generative adversarial network (GAN). Additionally, we introduce a Scaled Aggregate Measure (SAMe) designed for quantitatively evaluating the quality of synthetic data in a principled manner and serving as a basis for selecting the optimal generative model. We assess the generated DCE-MRI data using quantitative image quality metrics and apply them to the downstream task of 3D breast tumour segmentation. Our results highlight the potential of post-contrast DCE-MRI synthesis in enhancing the robustness of breast tumour segmentation models via data augmentation. Our code is available at https://github.com/RichardObi/pre_post_synthesis. △ Less

Submitted 31 May, 2024; v1 submitted 17 November, 2023; originally announced November 2023.

Comments: Accepted as oral presentation at SPIE Medical Imaging 2024 (Image Processing)

arXiv:2311.08705 [pdf, other]

Evaluating Robustness of Dialogue Summarization Models in the Presence of Naturally Occurring Variations

Authors: Ankita Gupta, Chulaka Gunasekara, Hui Wan, Jatin Ganhotra, Sachindra Joshi, Marina Danilevsky

Abstract: Dialogue summarization task involves summarizing long conversations while preserving the most salient information. Real-life dialogues often involve naturally occurring variations (e.g., repetitions, hesitations) and existing dialogue summarization models suffer from performance drop on such conversations. In this study, we systematically investigate the impact of such variations on state-of-the-a… ▽ More Dialogue summarization task involves summarizing long conversations while preserving the most salient information. Real-life dialogues often involve naturally occurring variations (e.g., repetitions, hesitations) and existing dialogue summarization models suffer from performance drop on such conversations. In this study, we systematically investigate the impact of such variations on state-of-the-art dialogue summarization models using publicly available datasets. To simulate real-life variations, we introduce two types of perturbations: utterance-level perturbations that modify individual utterances with errors and language variations, and dialogue-level perturbations that add non-informative exchanges (e.g., repetitions, greetings). We conduct our analysis along three dimensions of robustness: consistency, saliency, and faithfulness, which capture different aspects of the summarization model's performance. We find that both fine-tuned and instruction-tuned models are affected by input variations, with the latter being more susceptible, particularly to dialogue-level perturbations. We also validate our findings via human evaluation. Finally, we investigate if the robustness of fine-tuned models can be improved by training them with a fraction of perturbed data and observe that this approach is insufficient to address robustness challenges with current models and thus warrants a more thorough investigation to identify better solutions. Overall, our work highlights robustness challenges in dialogue summarization and provides insights for future research. △ Less

Submitted 15 November, 2023; originally announced November 2023.

arXiv:2311.07693 [pdf, other]

Matching aggregate posteriors in the variational autoencoder

Authors: Surojit Saha, Sarang Joshi, Ross Whitaker

Abstract: The variational autoencoder (VAE) is a well-studied, deep, latent-variable model (DLVM) that efficiently optimizes the variational lower bound of the log marginal data likelihood and has a strong theoretical foundation. However, the VAE's known failure to match the aggregate posterior often results in \emph{pockets/holes} in the latent distribution (i.e., a failure to match the prior) and/or \emph… ▽ More The variational autoencoder (VAE) is a well-studied, deep, latent-variable model (DLVM) that efficiently optimizes the variational lower bound of the log marginal data likelihood and has a strong theoretical foundation. However, the VAE's known failure to match the aggregate posterior often results in \emph{pockets/holes} in the latent distribution (i.e., a failure to match the prior) and/or \emph{posterior collapse}, which is associated with a loss of information in the latent space. This paper addresses these shortcomings in VAEs by reformulating the objective function associated with VAEs in order to match the aggregate/marginal posterior distribution to the prior. We use kernel density estimate (KDE) to model the aggregate posterior in high dimensions. The proposed method is named the \emph{aggregate variational autoencoder} (AVAE) and is built on the theoretical framework of the VAE. Empirical evaluation of the proposed method on multiple benchmark data sets demonstrates the effectiveness of the AVAE relative to state-of-the-art (SOTA) methods. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2311.06212 [pdf, other]

Differentiable VQ-VAE's for Robust White Matter Streamline Encodings

Authors: Andrew Lizarraga, Brandon Taraku, Edouardo Honig, Ying Nian Wu, Shantanu H. Joshi

Abstract: Given the complex geometry of white matter streamlines, Autoencoders have been proposed as a dimension-reduction tool to simplify the analysis streamlines in a low-dimensional latent spaces. However, despite these recent successes, the majority of encoder architectures only perform dimension reduction on single streamlines as opposed to a full bundle of streamlines. This is a severe limitation of… ▽ More Given the complex geometry of white matter streamlines, Autoencoders have been proposed as a dimension-reduction tool to simplify the analysis streamlines in a low-dimensional latent spaces. However, despite these recent successes, the majority of encoder architectures only perform dimension reduction on single streamlines as opposed to a full bundle of streamlines. This is a severe limitation of the encoder architecture that completely disregards the global geometric structure of streamlines at the expense of individual fibers. Moreover, the latent space may not be well structured which leads to doubt into their interpretability. In this paper we propose a novel Differentiable Vector Quantized Variational Autoencoder, which are engineered to ingest entire bundles of streamlines as single data-point and provides reliable trustworthy encodings that can then be later used to analyze streamlines in the latent space. Comparisons with several state of the art Autoencoders demonstrate superior performance in both encoding and synthesis. △ Less

Submitted 18 November, 2023; v1 submitted 10 November, 2023; originally announced November 2023.

Comments: 5 pages, 4 figures, 1 table

arXiv:2310.19583 [pdf, other]

GC-MVSNet: Multi-View, Multi-Scale, Geometrically-Consistent Multi-View Stereo

Authors: Vibhas K. Vats, Sripad Joshi, David J. Crandall, Md. Alimoor Reza, Soon-heung Jung

Abstract: Traditional multi-view stereo (MVS) methods rely heavily on photometric and geometric consistency constraints, but newer machine learning-based MVS methods check geometric consistency across multiple source views only as a post-processing step. In this paper, we present a novel approach that explicitly encourages geometric consistency of reference view depth maps across multiple source views at di… ▽ More Traditional multi-view stereo (MVS) methods rely heavily on photometric and geometric consistency constraints, but newer machine learning-based MVS methods check geometric consistency across multiple source views only as a post-processing step. In this paper, we present a novel approach that explicitly encourages geometric consistency of reference view depth maps across multiple source views at different scales during learning (see Fig. 1). We find that adding this geometric consistency loss significantly accelerates learning by explicitly penalizing geometrically inconsistent pixels, reducing the training iteration requirements to nearly half that of other MVS methods. Our extensive experiments show that our approach achieves a new state-of-the-art on the DTU and BlendedMVS datasets, and competitive results on the Tanks and Temples benchmark. To the best of our knowledge, GC-MVSNet is the first attempt to enforce multi-view, multi-scale geometric consistency during learning. △ Less

Submitted 21 December, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

Comments: Accepted in WACV 2024 Link: https://openaccess.thecvf.com/content/WACV2024/html/Vats_GC-MVSNet_Multi-View_Multi-Scale_Geometrically-Consistent_Multi-View_Stereo_WACV_2024_paper.html

Journal ref: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2024

arXiv:2310.16370 [pdf, other]

PartRePer-MPI: Combining Fault Tolerance and Performance for MPI Applications

Authors: Sarthak Joshi, Sathish Vadhiyar

Abstract: As we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher frequency resulting in an excessive amount of overhead which would not be sustainable for many scientific applications. Replication allows for fast re… ▽ More As we have entered Exascale computing, the faults in high-performance systems are expected to increase considerably. To compensate for a higher failure rate, the standard checkpoint/restart technique would need to create checkpoints at a much higher frequency resulting in an excessive amount of overhead which would not be sustainable for many scientific applications. Replication allows for fast recovery from failures by simply dropping the failed processes and using their replicas to continue the regular operation of the application. In this paper, we have implemented PartRePer-MPI, a novel fault-tolerant MPI library that adopts partial replication of some of the launched MPI processes in order to provide resilience from failures. The novelty of our work is that it combines both fault tolerance, due to the use of the User Level Failure Mitigation (ULFM) framework in the Open MPI library, and high performance, due to the use of communication protocols in the native MPI library that is generally fine-tuned for specific HPC platforms. We have implemented efficient and parallel communication strategies with computational and replica processes, and our library can seamlessly provide fault tolerance support to an existing MPI application. Our experiments using seven NAS Parallel Benchmarks and two scientific applications show that the failure-free overheads in PartRePer-MPI when compared to the baseline MVAPICH2, are only up to 6.4% for the NAS parallel benchmarks and up to 9.7% for the scientific applications. △ Less

Submitted 25 October, 2023; originally announced October 2023.

arXiv:2310.04971 [pdf, other]

Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift

Authors: Yihao Xue, Siddharth Joshi, Dang Nguyen, Baharan Mirzasoleiman

Abstract: Recently, multimodal contrastive learning (MMCL) approaches, such as CLIP, have achieved a remarkable success in learning representations that are robust against distribution shift and generalize to new domains. Despite the empirical success, the mechanism behind learning such generalizable representations is not understood. In this work, we rigorously analyze this problem and uncover two mechanis… ▽ More Recently, multimodal contrastive learning (MMCL) approaches, such as CLIP, have achieved a remarkable success in learning representations that are robust against distribution shift and generalize to new domains. Despite the empirical success, the mechanism behind learning such generalizable representations is not understood. In this work, we rigorously analyze this problem and uncover two mechanisms behind MMCL's robustness: \emph{intra-class contrasting}, which allows the model to learn features with a high variance, and \emph{inter-class feature sharing}, where annotated details in one class help learning other classes better. Both mechanisms prevent spurious features that are over-represented in the training data to overshadow the generalizable core features. This yields superior zero-shot classification accuracy under distribution shift. Furthermore, we theoretically demonstrate the benefits of using rich captions on robustness and explore the effect of annotating different types of details in the captions. We validate our theoretical findings through experiments, including a well-designed synthetic experiment and an experiment involving training CLIP models on MSCOCO/Conceptual Captions and evaluating them on shifted ImageNets. △ Less

Submitted 17 March, 2024; v1 submitted 7 October, 2023; originally announced October 2023.

arXiv:2310.03675 [pdf, other]

Hadamard Domain Training with Integers for Class Incremental Quantized Learning

Authors: Martin Schiemer, Clemens JS Schaefer, Jayden Parker Vap, Mark James Horeni, Yu Emma Wang, Juan Ye, Siddharth Joshi

Abstract: Continual learning is a desirable feature in many modern machine learning applications, which allows in-field adaptation and updating, ranging from accommodating distribution shift, to fine-tuning, and to learning new tasks. For applications with privacy and low latency requirements, the compute and memory demands imposed by continual learning can be cost-prohibitive for resource-constraint edge p… ▽ More Continual learning is a desirable feature in many modern machine learning applications, which allows in-field adaptation and updating, ranging from accommodating distribution shift, to fine-tuning, and to learning new tasks. For applications with privacy and low latency requirements, the compute and memory demands imposed by continual learning can be cost-prohibitive for resource-constraint edge platforms. Reducing computational precision through fully quantized training (FQT) simultaneously reduces memory footprint and increases compute efficiency for both training and inference. However, aggressive quantization especially integer FQT typically degrades model accuracy to unacceptable levels. In this paper, we propose a technique that leverages inexpensive Hadamard transforms to enable low-precision training with only integer matrix multiplications. We further determine which tensors need stochastic rounding and propose tiled matrix multiplication to enable low-bit width accumulators. We demonstrate the effectiveness of our technique on several human activity recognition datasets and CIFAR100 in a class incremental learning setting. We achieve less than 0.5% and 3% accuracy degradation while we quantize all matrix multiplications inputs down to 4-bits with 8-bit accumulators. △ Less

Submitted 5 October, 2023; originally announced October 2023.

arXiv:2309.15734 [pdf, other]

Synthetic Latent Fingerprint Generation Using Style Transfer

Authors: Amol S. Joshi, Ali Dabouei, Nasser Nasrabadi, Jeremy Dawson

Abstract: Limited data availability is a challenging problem in the latent fingerprint domain. Synthetically generated fingerprints are vital for training data-hungry neural network-based algorithms. Conventional methods distort clean fingerprints to generate synthetic latent fingerprints. We propose a simple and effective approach using style transfer and image blending to synthesize realistic latent finge… ▽ More Limited data availability is a challenging problem in the latent fingerprint domain. Synthetically generated fingerprints are vital for training data-hungry neural network-based algorithms. Conventional methods distort clean fingerprints to generate synthetic latent fingerprints. We propose a simple and effective approach using style transfer and image blending to synthesize realistic latent fingerprints. Our evaluation criteria and experiments demonstrate that the generated synthetic latent fingerprints preserve the identity information from the input contact-based fingerprints while possessing similar characteristics as real latent fingerprints. Additionally, we show that the generated fingerprints exhibit several qualities and styles, suggesting that the proposed method can generate multiple samples from a single fingerprint. △ Less

Submitted 27 September, 2023; originally announced September 2023.

arXiv:2309.12325 [pdf]

FUTURE-AI: International consensus guideline for trustworthy and deployable artificial intelligence in healthcare

Authors: Karim Lekadir, Aasa Feragen, Abdul Joseph Fofanah, Alejandro F Frangi, Alena Buyx, Anais Emelie, Andrea Lara, Antonio R Porras, An-Wen Chan, Arcadi Navarro, Ben Glocker, Benard O Botwe, Bishesh Khanal, Brigit Beger, Carol C Wu, Celia Cintas, Curtis P Langlotz, Daniel Rueckert, Deogratias Mzurikwao, Dimitrios I Fotiadis, Doszhan Zhussupov, Enzo Ferrante, Erik Meijering, Eva Weicken, Fabio A González , et al. (95 additional authors not shown)

Abstract: Despite major advances in artificial intelligence (AI) for medicine and healthcare, the deployment and adoption of AI technologies remain limited in real-world clinical practice. In recent years, concerns have been raised about the technical, clinical, ethical and legal risks associated with medical AI. To increase real world adoption, it is essential that medical AI tools are trusted and accepted… ▽ More Despite major advances in artificial intelligence (AI) for medicine and healthcare, the deployment and adoption of AI technologies remain limited in real-world clinical practice. In recent years, concerns have been raised about the technical, clinical, ethical and legal risks associated with medical AI. To increase real world adoption, it is essential that medical AI tools are trusted and accepted by patients, clinicians, health organisations and authorities. This work describes the FUTURE-AI guideline as the first international consensus framework for guiding the development and deployment of trustworthy AI tools in healthcare. The FUTURE-AI consortium was founded in 2021 and currently comprises 118 inter-disciplinary experts from 51 countries representing all continents, including AI scientists, clinicians, ethicists, and social scientists. Over a two-year period, the consortium defined guiding principles and best practices for trustworthy AI through an iterative process comprising an in-depth literature review, a modified Delphi survey, and online consensus meetings. The FUTURE-AI framework was established based on 6 guiding principles for trustworthy AI in healthcare, i.e. Fairness, Universality, Traceability, Usability, Robustness and Explainability. Through consensus, a set of 28 best practices were defined, addressing technical, clinical, legal and socio-ethical dimensions. The recommendations cover the entire lifecycle of medical AI, from design, development and validation to regulation, deployment, and monitoring. FUTURE-AI is a risk-informed, assumption-free guideline which provides a structured approach for constructing medical AI tools that will be trusted, deployed and adopted in real-world practice. Researchers are encouraged to take the recommendations into account in proof-of-concept stages to facilitate future translation towards clinical practice of medical AI. △ Less

Submitted 8 July, 2024; v1 submitted 11 August, 2023; originally announced September 2023.

ACM Class: I.2.0; I.4.0; I.5.0

arXiv:2309.05680 [pdf, other]

Evaluating Chatbots to Promote Users' Trust -- Practices and Open Problems

Authors: Biplav Srivastava, Kausik Lakkaraju, Tarmo Koppel, Vignesh Narayanan, Ashish Kundu, Sachindra Joshi

Abstract: Chatbots, the common moniker for collaborative assistants, are Artificial Intelligence (AI) software that enables people to naturally interact with them to get tasks done. Although chatbots have been studied since the dawn of AI, they have particularly caught the imagination of the public and businesses since the launch of easy-to-use and general-purpose Large Language Model-based chatbots like Ch… ▽ More Chatbots, the common moniker for collaborative assistants, are Artificial Intelligence (AI) software that enables people to naturally interact with them to get tasks done. Although chatbots have been studied since the dawn of AI, they have particularly caught the imagination of the public and businesses since the launch of easy-to-use and general-purpose Large Language Model-based chatbots like ChatGPT. As businesses look towards chatbots as a potential technology to engage users, who may be end customers, suppliers, or even their own employees, proper testing of chatbots is important to address and mitigate issues of trust related to service or product performance, user satisfaction and long-term unintended consequences for society. This paper reviews current practices for chatbot testing, identifies gaps as open problems in pursuit of user trust, and outlines a path forward. △ Less

Submitted 13 September, 2023; v1 submitted 9 September, 2023; originally announced September 2023.

arXiv:2306.15124 [pdf, other]

Identifying and Consolidating Knowledge Engineering Requirements

Authors: Bradley P. Allen, Filip Ilievski, Saurav Joshi

Abstract: Knowledge engineering is the process of creating and maintaining knowledge-producing systems. Throughout the history of computer science and AI, knowledge engineering workflows have been widely used because high-quality knowledge is assumed to be crucial for reliable intelligent agents. However, the landscape of knowledge engineering has changed, presenting four challenges: unaddressed stakeholder… ▽ More Knowledge engineering is the process of creating and maintaining knowledge-producing systems. Throughout the history of computer science and AI, knowledge engineering workflows have been widely used because high-quality knowledge is assumed to be crucial for reliable intelligent agents. However, the landscape of knowledge engineering has changed, presenting four challenges: unaddressed stakeholder requirements, mismatched technologies, adoption barriers for new organizations, and misalignment with software engineering practices. In this paper, we propose to address these challenges by developing a reference architecture using a mainstream software methodology. By studying the requirements of different stakeholders and eras, we identify 23 essential quality attributes for evaluating reference architectures. We assess three candidate architectures from recent literature based on these attributes. Finally, we discuss the next steps towards a comprehensive reference architecture, including prioritizing quality attributes, integrating components with complementary strengths, and supporting missing socio-technical requirements. As this endeavor requires a collaborative effort, we invite all knowledge engineering researchers and practitioners to join us. △ Less

Submitted 26 June, 2023; originally announced June 2023.

arXiv:2306.11957 [pdf, other]

Towards Mitigating Spurious Correlations in the Wild: A Benchmark and a more Realistic Dataset

Authors: Siddharth Joshi, Yu Yang, Yihao Xue, Wenhan Yang, Baharan Mirzasoleiman

Abstract: Deep neural networks often exploit non-predictive features that are spuriously correlated with class labels, leading to poor performance on groups of examples without such features. Despite the growing body of recent works on remedying spurious correlations, the lack of a standardized benchmark hinders reproducible evaluation and comparison of the proposed solutions. To address this, we present Sp… ▽ More Deep neural networks often exploit non-predictive features that are spuriously correlated with class labels, leading to poor performance on groups of examples without such features. Despite the growing body of recent works on remedying spurious correlations, the lack of a standardized benchmark hinders reproducible evaluation and comparison of the proposed solutions. To address this, we present SpuCo, a python package with modular implementations of state-of-the-art solutions enabling easy and reproducible evaluation of current methods. Using SpuCo, we demonstrate the limitations of existing datasets and evaluation schemes in validating the learning of predictive features over spurious ones. To overcome these limitations, we propose two new vision datasets: (1) SpuCoMNIST, a synthetic dataset that enables simulating the effect of real world data properties e.g. difficulty of learning spurious feature, as well as noise in the labels and features; (2) SpuCoAnimals, a large-scale dataset curated from ImageNet that captures spurious correlations in the wild much more closely than existing datasets. These contributions highlight the shortcomings of current methods and provide a direction for future research in tackling spurious correlations. SpuCo, containing the benchmark and datasets, can be found at https://github.com/BigML-CS-UCLA/SpuCo, with detailed documentation available at https://spuco.readthedocs.io/en/latest/. △ Less

Submitted 29 September, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

Comments: Package: https://github.com/BigML-CS-UCLA/SpuCo

arXiv:2306.04879 [pdf, other]

Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision Post-Training Quantization

Authors: Clemens JS Schaefer, Navid Lambert-Shirzad, Xiaofan Zhang, Chiachen Chou, Tom Jablin, Jian Li, Elfie Guo, Caitlin Stanton, Siddharth Joshi, Yu Emma Wang

Abstract: Efficiently serving neural network models with low latency is becoming more challenging due to increasing model complexity and parameter count. Model quantization offers a solution which simultaneously reduces memory footprint and compute requirements. However, aggressive quantization may lead to an unacceptable loss in model accuracy owing to differences in sensitivity to numerical imperfection a… ▽ More Efficiently serving neural network models with low latency is becoming more challenging due to increasing model complexity and parameter count. Model quantization offers a solution which simultaneously reduces memory footprint and compute requirements. However, aggressive quantization may lead to an unacceptable loss in model accuracy owing to differences in sensitivity to numerical imperfection across different layers in the model. To address this challenge, we propose a mixed-precision post training quantization (PTQ) approach that assigns different numerical precisions to tensors in a network based on their specific needs, for a reduced memory footprint and improved latency while preserving model accuracy. Previous works rely on layer-wise Hessian information to determine numerical precision, but as we demonstrate, Hessian estimation is typically insufficient in determining an effective ordering of layer sensitivities. We address this by augmenting the estimated Hessian with additional information to capture inter-layer dependencies. We demonstrate that this consistently improves PTQ performance along the accuracy-latency Pareto frontier across multiple models. Our method combines second-order information and inter-layer dependencies to guide a bisection search, finding quantization configurations within a user-configurable model accuracy degradation range. We evaluate the effectiveness of our method on the ResNet50, MobileNetV2, and BERT models. Our experiments demonstrate latency reductions compared to a 16-bit baseline of $25.48\%$, $21.69\%$, and $33.28\%$ respectively, while maintaining model accuracy to within $99.99\%$ of the baseline model. △ Less

Submitted 7 June, 2023; originally announced June 2023.

arXiv:2306.00248 [pdf, other]

TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest

Authors: Xue Xia, Pong Eksombatchai, Nikil Pancha, Dhruvil Deven Badani, Po-Wei Wang, Neng Gu, Saurabh Vishwas Joshi, Nazanin Farahpour, Zhiyuan Zhang, Andrew Zhai

Abstract: Sequential models that encode user activity for next action prediction have become a popular design choice for building web-scale personalized recommendation systems. Traditional methods of sequential recommendation either utilize end-to-end learning on realtime user actions, or learn user representations separately in an offline batch-generated manner. This paper (1) presents Pinterest's ranking… ▽ More Sequential models that encode user activity for next action prediction have become a popular design choice for building web-scale personalized recommendation systems. Traditional methods of sequential recommendation either utilize end-to-end learning on realtime user actions, or learn user representations separately in an offline batch-generated manner. This paper (1) presents Pinterest's ranking architecture for Homefeed, our personalized recommendation product and the largest engagement surface; (2) proposes TransAct, a sequential model that extracts users' short-term preferences from their realtime activities; (3) describes our hybrid approach to ranking, which combines end-to-end sequential modeling via TransAct with batch-generated user embeddings. The hybrid approach allows us to combine the advantages of responsiveness from learning directly on realtime user activity with the cost-effectiveness of batch user representations learned over a longer time period. We describe the results of ablation studies, the challenges we faced during productionization, and the outcome of an online A/B experiment, which validates the effectiveness of our hybrid ranking model. We further demonstrate the effectiveness of TransAct on other surfaces such as contextual recommendations and search. Our model has been deployed to production in Homefeed, Related Pins, Notifications, and Search at Pinterest. △ Less

Submitted 31 May, 2023; originally announced June 2023.

Comments: \c{opyright} {ACM} {2023}. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in KDD'23, http://dx.doi.org/10.1145/3580305.3599918

Showing 1–50 of 240 results for author: Joshi, S