-
All Neural Low-latency Directional Speech Extraction
Authors:
Ashutosh Pandey,
Sanha Lee,
Juan Azcarreta,
Daniel Wong,
Buye Xu
Abstract:
We introduce a novel all neural model for low-latency directional speech extraction. The model uses direction of arrival (DOA) embeddings from a predefined spatial grid, which are transformed and fused into a recurrent neural network based speech extraction model. This process enables the model to effectively extract speech from a specified DOA. Unlike previous methods that relied on hand-crafted…
▽ More
We introduce a novel all neural model for low-latency directional speech extraction. The model uses direction of arrival (DOA) embeddings from a predefined spatial grid, which are transformed and fused into a recurrent neural network based speech extraction model. This process enables the model to effectively extract speech from a specified DOA. Unlike previous methods that relied on hand-crafted directional features, the proposed model trains DOA embeddings from scratch using speech enhancement loss, making it suitable for low-latency scenarios. Additionally, it operates at a high frame rate, taking in DOA with each input frame, which brings in the capability of quickly adapting to changing scene in highly dynamic real-world scenarios. We provide extensive evaluation to demonstrate the model's efficacy in directional speech extraction, robustness to DOA mismatch, and its capability to quickly adapt to abrupt changes in DOA.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Exploring Algorithmic Solutions for the Independent Roman Domination Problem in Graphs
Authors:
Kaustav Paul,
Ankit Sharma,
Arti Pandey
Abstract:
Given a graph $G=(V,E)$, a function $f:V\to \{0,1,2\}$ is said to be a \emph{Roman Dominating function} if for every $v\in V$ with $f(v)=0$, there exists a vertex $u\in N(v)$ such that $f(u)=2$. A Roman Dominating function $f$ is said to be an \emph{Independent Roman Dominating function} (or IRDF), if $V_1\cup V_2$ forms an independent set, where $V_i=\{v\in V~\vert~f(v)=i\}$, for…
▽ More
Given a graph $G=(V,E)$, a function $f:V\to \{0,1,2\}$ is said to be a \emph{Roman Dominating function} if for every $v\in V$ with $f(v)=0$, there exists a vertex $u\in N(v)$ such that $f(u)=2$. A Roman Dominating function $f$ is said to be an \emph{Independent Roman Dominating function} (or IRDF), if $V_1\cup V_2$ forms an independent set, where $V_i=\{v\in V~\vert~f(v)=i\}$, for $i\in \{0,1,2\}$. The total weight of $f$ is equal to $\sum_{v\in V} f(v)$, and is denoted as $w(f)$. The \emph{Independent Roman Domination Number} of $G$, denoted by $i_R(G)$, is defined as min$\{w(f)~\vert~f$ is an IRDF of $G\}$. For a given graph $G$, the problem of computing $i_R(G)$ is defined as the \emph{Minimum Independent Roman Domination problem}. The problem is already known to be NP-hard for bipartite graphs. In this paper, we further study the algorithmic complexity of the problem.
In this paper, we propose a polynomial-time algorithm to solve the Minimum Independent Roman Domination problem for distance-hereditary graphs, split graphs, and $P_4$-sparse graphs.
△ Less
Submitted 12 July, 2024; v1 submitted 4 July, 2024;
originally announced July 2024.
-
Algorithmic Results for Weak Roman Domination Problem in Graphs
Authors:
Kaustav Paul,
Ankit Sharma,
Arti Pandey
Abstract:
Consider a graph $G = (V, E)$ and a function $f: V \rightarrow \{0, 1, 2\}$. A vertex $u$ with $f(u)=0$ is defined as \emph{undefended} by $f$ if it lacks adjacency to any vertex with a positive $f$-value. The function $f$ is said to be a \emph{Weak Roman Dominating function} (WRD function) if, for every vertex $u$ with $f(u) = 0$, there exists a neighbour $v$ of $u$ with $f(v) > 0$ and a new func…
▽ More
Consider a graph $G = (V, E)$ and a function $f: V \rightarrow \{0, 1, 2\}$. A vertex $u$ with $f(u)=0$ is defined as \emph{undefended} by $f$ if it lacks adjacency to any vertex with a positive $f$-value. The function $f$ is said to be a \emph{Weak Roman Dominating function} (WRD function) if, for every vertex $u$ with $f(u) = 0$, there exists a neighbour $v$ of $u$ with $f(v) > 0$ and a new function $f': V \rightarrow \{0, 1, 2\}$ defined in the following way: $f'(u) = 1$, $f'(v) = f(v) - 1$, and $f'(w) = f(w)$, for all vertices $w$ in $V\setminus\{u,v\}$; so that no vertices are undefended by $f'$. The total weight of $f$ is equal to $\sum_{v\in V} f(v)$, and is denoted as $w(f)$. The \emph{Weak Roman Domination Number} denoted by $γ_r(G)$, represents $min\{w(f)~\vert~f$ is a WRD function of $G\}$. For a given graph $G$, the problem of finding a WRD function of weight $γ_r(G)$ is defined as the \emph{Minimum Weak Roman domination problem}. The problem is already known to be NP-hard for bipartite and chordal graphs. In this paper, we further study the algorithmic complexity of the problem. We prove the NP-hardness of the problem for star convex bipartite graphs and comb convex bipartite graphs, which are subclasses of bipartite graphs. In addition, we show that for the bounded degree star convex bipartite graphs, the problem is efficiently solvable. We also prove the NP-hardness of the problem for split graphs, a subclass of chordal graphs. On the positive side, we give polynomial-time algorithms to solve the problem for $P_4$-sparse graphs. Further, we have presented some approximation results.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
SONIC: Connect the Unconnected via FM Radio & SMS
Authors:
Ayush Pandey,
Rohail Asim,
Khalid Mengal,
Matteo Varvello,
Yasir Zaki
Abstract:
As of 2022, about 2.78 billion people in developing countries do not have access to the Internet. Lack of Internet access hinders economic growth, educational opportunities, and access to information and services. Recent initiatives to ``connect the unconnected'' have either failed (project Loon and Aquila) or are characterized by exorbitant costs (Starlink and similar), which are unsustainable fo…
▽ More
As of 2022, about 2.78 billion people in developing countries do not have access to the Internet. Lack of Internet access hinders economic growth, educational opportunities, and access to information and services. Recent initiatives to ``connect the unconnected'' have either failed (project Loon and Aquila) or are characterized by exorbitant costs (Starlink and similar), which are unsustainable for users in developing regions. This paper proposes SONIC, a novel connectivity solution that repurposes a widespread communication infrastructure (AM/FM radio) to deliver access to pre-rendered webpages. Our rationale is threefold: 1) the radio network is widely accessible -- currently reaching 70% of the world -- even in developing countries, 2) unused frequencies are highly available, 3) while data over sound can be slow, when combined with the radio network, it takes advantage of its broadcast nature, efficiently reaching a large number of users. We have designed and built a proof of concept of SONIC which shows encouraging initial results.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Modelling financial volume curves with hierarchical Poisson processes
Authors:
Creighton Heaukulani,
Abhinav Pandey,
Lancelot F. James
Abstract:
Modeling the trading volume curves of financial instruments throughout the day is of key interest in financial trading applications. Predictions of these so-called volume profiles guide trade execution strategies, for example, a common strategy is to trade a desired quantity across many orders in line with the expected volume curve throughout the day so as not to impact the price of the instrument…
▽ More
Modeling the trading volume curves of financial instruments throughout the day is of key interest in financial trading applications. Predictions of these so-called volume profiles guide trade execution strategies, for example, a common strategy is to trade a desired quantity across many orders in line with the expected volume curve throughout the day so as not to impact the price of the instrument. The volume curves (for each day) are naturally grouped by stock and can be further gathered into higher-level groupings, such as by industry. In order to model such admixtures of volume curves, we introduce a hierarchical Poisson process model for the intensity functions of admixtures of inhomogenous Poisson processes, which represent the trading times of the stock throughout the day. The model is based on the hierarchical Dirichlet process, and an efficient Markov Chain Monte Carlo (MCMC) algorithm is derived following the slice sampling framework for Bayesian nonparametric mixture models. We demonstrate the method on datasets of different stocks from the Trade and Quote repository maintained by Wharton Research Data Services, including the most liquid stock on the NASDAQ stock exchange, Apple, demonstrating the scalability of the approach.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
Unlearning Climate Misinformation in Large Language Models
Authors:
Michael Fore,
Simranjit Singh,
Chaehong Lee,
Amritanshu Pandey,
Antonios Anastasopoulos,
Dimitrios Stamoulis
Abstract:
Misinformation regarding climate change is a key roadblock in addressing one of the most serious threats to humanity. This paper investigates factual accuracy in large language models (LLMs) regarding climate information. Using true/false labeled Q&A data for fine-tuning and evaluating LLMs on climate-related claims, we compare open-source models, assessing their ability to generate truthful respo…
▽ More
Misinformation regarding climate change is a key roadblock in addressing one of the most serious threats to humanity. This paper investigates factual accuracy in large language models (LLMs) regarding climate information. Using true/false labeled Q&A data for fine-tuning and evaluating LLMs on climate-related claims, we compare open-source models, assessing their ability to generate truthful responses to climate change questions. We investigate the detectability of models intentionally poisoned with false climate information, finding that such poisoning may not affect the accuracy of a model's responses in other domains. Furthermore, we compare the effectiveness of unlearning algorithms, fine-tuning, and Retrieval-Augmented Generation (RAG) for factually grounding LLMs on climate change topics. Our evaluation reveals that unlearning algorithms can be effective for nuanced conceptual claims, despite previous findings suggesting their inefficacy in privacy contexts. These insights aim to guide the development of more factually reliable LLMs and highlight the need for additional work to secure LLMs against misinformation attacks.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
PVF (Parameter Vulnerability Factor): A Scalable Metric for Understanding AI Vulnerability Against SDCs in Model Parameters
Authors:
Xun Jiao,
Fred Lin,
Harish D. Dixit,
Joel Coburn,
Abhinav Pandey,
Han Wang,
Venkat Ramesh,
Jianyu Huang,
Wang Xu,
Daniel Moore,
Sriram Sankar
Abstract:
Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults, e.g., silent data corruptions (SDC), that can potentially corrupt model parameters. When this occurs during AI inference/servicing, it can…
▽ More
Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults, e.g., silent data corruptions (SDC), that can potentially corrupt model parameters. When this occurs during AI inference/servicing, it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services. In light of the escalating threat, it is crucial to address key questions: How vulnerable are AI models to parameter corruptions, and how do different components (such as modules, layers) of the models exhibit varying vulnerabilities to parameter corruptions? To systematically address this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT), while presenting an in-depth vulnerability analysis on DLRM. PVF can provide pivotal insights to AI hardware designers in balancing the tradeoff between fault protection and performance/efficiency such as mapping vulnerable AI parameter components to well-protected hardware modules. PVF metric is applicable to any AI model and has a potential to help unify and standardize AI vulnerability/resilience evaluation practice.
△ Less
Submitted 11 June, 2024; v1 submitted 2 May, 2024;
originally announced May 2024.
-
RE-GrievanceAssist: Enhancing Customer Experience through ML-Powered Complaint Management
Authors:
Venkatesh C,
Harshit Oberoi,
Anurag Kumar Pandey,
Anil Goyal,
Nikhil Sikka
Abstract:
In recent years, digital platform companies have faced increasing challenges in managing customer complaints, driven by widespread consumer adoption. This paper introduces an end-to-end pipeline, named RE-GrievanceAssist, designed specifically for real estate customer complaint management. The pipeline consists of three key components: i) response/no-response ML model using TF-IDF vectorization an…
▽ More
In recent years, digital platform companies have faced increasing challenges in managing customer complaints, driven by widespread consumer adoption. This paper introduces an end-to-end pipeline, named RE-GrievanceAssist, designed specifically for real estate customer complaint management. The pipeline consists of three key components: i) response/no-response ML model using TF-IDF vectorization and XGBoost classifier ; ii) user type classifier using fasttext classifier; iii) issue/sub-issue classifier using TF-IDF vectorization and XGBoost classifier. Finally, it has been deployed as a batch job in Databricks, resulting in a remarkable 40% reduction in overall manual effort with monthly cost reduction of Rs 1,50,000 since August 2023.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
RE-RFME: Real-Estate RFME Model for customer segmentation
Authors:
Anurag Kumar Pandey,
Anil Goyal,
Nikhil Sikka
Abstract:
Marketing is one of the high-cost activities for any online platform. With the increase in the number of customers, it is crucial to understand customers based on their dynamic behaviors to design effective marketing strategies. Customer segmentation is a widely used approach to group customers into different categories and design the marketing strategy targeting each group individually. Therefore…
▽ More
Marketing is one of the high-cost activities for any online platform. With the increase in the number of customers, it is crucial to understand customers based on their dynamic behaviors to design effective marketing strategies. Customer segmentation is a widely used approach to group customers into different categories and design the marketing strategy targeting each group individually. Therefore, in this paper, we propose an end-to-end pipeline RE-RFME for segmenting customers into 4 groups: high value, promising, need attention, and need activation. Concretely, we propose a novel RFME (Recency, Frequency, Monetary and Engagement) model to track behavioral features of customers and segment them into different categories. Finally, we train the K-means clustering algorithm to cluster the user into one of the 4 categories. We show the effectiveness of the proposed approach on real-world Housing.com datasets for both website and mobile application users.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
Anomaly Detection in Power Grids via Context-Agnostic Learning
Authors:
SangWoo Park,
Amritanshu Pandey
Abstract:
An important tool grid operators use to safeguard against failures, whether naturally occurring or malicious, involves detecting anomalies in the power system SCADA data. In this paper, we aim to solve a real-time anomaly detection problem. Given time-series measurement values coming from a fixed set of sensors on the grid, can we identify anomalies in the network topology or measurement data? Exi…
▽ More
An important tool grid operators use to safeguard against failures, whether naturally occurring or malicious, involves detecting anomalies in the power system SCADA data. In this paper, we aim to solve a real-time anomaly detection problem. Given time-series measurement values coming from a fixed set of sensors on the grid, can we identify anomalies in the network topology or measurement data? Existing methods, primarily optimization-based, mostly use only a single snapshot of the measurement values and do not scale well with the network size. Recent data-driven ML techniques have shown promise by using a combination of current and historical data for anomaly detection but generally do not consider physical attributes like the impact of topology or load/generation changes on sensor measurements and thus cannot accommodate regular context-variability in the historical data. To address this gap, we propose a novel context-aware anomaly detection algorithm, GridCAL, that considers the effect of regular topology and load/generation changes. This algorithm converts the real-time power flow measurements to context-agnostic values, which allows us to analyze measurement coming from different grid contexts in an aggregate fashion, enabling us to derive a unified statistical model that becomes the basis of anomaly detection. Through numerical simulations on networks up to 2383 nodes, we show that our approach is accurate, outperforming state-of-the-art approaches, and is computationally efficient.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR
Authors:
Yufeng Yang,
Ashutosh Pandey,
DeLiang Wang
Abstract:
It has been shown that the intelligibility of noisy speech can be improved by speech enhancement (SE) algorithms. However, monaural SE has not been established as an effective frontend for automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has ma…
▽ More
It has been shown that the intelligibility of noisy speech can be improved by speech enhancement (SE) algorithms. However, monaural SE has not been established as an effective frontend for automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has made major advances in recent years. This paper focuses on eliminating this divide with an ARN (attentive recurrent network) time-domain and a CrossNet time-frequency domain enhancement models. The proposed systems fully decouple frontend enhancement and backend ASR trained only on clean speech. Results on the WSJ, CHiME-2, LibriSpeech, and CHiME-4 corpora demonstrate that ARN and CrossNet enhanced speech both translate to improved ASR results in noisy and reverberant environments, and generalize well to real acoustic scenarios. The proposed system outperforms the baselines trained on corrupted speech directly. Furthermore, it cuts the previous best word error rate (WER) on CHiME-2 by $28.4\%$ relatively with a $5.57\%$ WER, and achieves $3.32/4.44\%$ WER on single-channel CHiME-4 simulated/real test data without training on CHiME-4.
△ Less
Submitted 10 March, 2024;
originally announced March 2024.
-
LISR: Learning Linear 3D Implicit Surface Representation Using Compactly Supported Radial Basis Functions
Authors:
Atharva Pandey,
Vishal Yadav,
Rajendra Nagar,
Santanu Chaudhury
Abstract:
Implicit 3D surface reconstruction of an object from its partial and noisy 3D point cloud scan is the classical geometry processing and 3D computer vision problem. In the literature, various 3D shape representations have been developed, differing in memory efficiency and shape retrieval effectiveness, such as volumetric, parametric, and implicit surfaces. Radial basis functions provide memory-effi…
▽ More
Implicit 3D surface reconstruction of an object from its partial and noisy 3D point cloud scan is the classical geometry processing and 3D computer vision problem. In the literature, various 3D shape representations have been developed, differing in memory efficiency and shape retrieval effectiveness, such as volumetric, parametric, and implicit surfaces. Radial basis functions provide memory-efficient parameterization of the implicit surface. However, we show that training a neural network using the mean squared error between the ground-truth implicit surface and the linear basis-based implicit surfaces does not converge to the global solution. In this work, we propose locally supported compact radial basis functions for a linear representation of the implicit surface. This representation enables us to generate 3D shapes with arbitrary topologies at any resolution due to their continuous nature. We then propose a neural network architecture for learning the linear implicit shape representation of the 3D surface of an object. We learn linear implicit shapes within a supervised learning framework using ground truth Signed-Distance Field (SDF) data for guidance. The classical strategies face difficulties in finding linear implicit shapes from a given 3D point cloud due to numerical issues (requires solving inverse of a large matrix) in basis and query point selection. The proposed approach achieves better Chamfer distance and comparable F-score than the state-of-the-art approach on the benchmark dataset. We also show the effectiveness of the proposed approach by using it for the 3D shape completion task.
△ Less
Submitted 11 February, 2024;
originally announced February 2024.
-
Community detection in the hypergraph stochastic block model and reconstruction on hypertrees
Authors:
Yuzhou Gu,
Aaradhya Pandey
Abstract:
We study the weak recovery problem on the $r$-uniform hypergraph stochastic block model ($r$-HSBM) with two balanced communities. In this model, $n$ vertices are randomly divided into two communities, and size-$r$ hyperedges are added randomly depending on whether all vertices in the hyperedge are in the same community. The goal of weak recovery is to recover a non-trivial fraction of the communit…
▽ More
We study the weak recovery problem on the $r$-uniform hypergraph stochastic block model ($r$-HSBM) with two balanced communities. In this model, $n$ vertices are randomly divided into two communities, and size-$r$ hyperedges are added randomly depending on whether all vertices in the hyperedge are in the same community. The goal of weak recovery is to recover a non-trivial fraction of the communities given the hypergraph. Pal and Zhu (2021); Stephan and Zhu (2022) established that weak recovery is always possible above a natural threshold called the Kesten-Stigum (KS) threshold. For assortative models (i.e., monochromatic hyperedges are preferred), Gu and Polyanskiy (2023) proved that the KS threshold is tight if $r\le 4$ or the expected degree $d$ is small. For other cases, the tightness of the KS threshold remained open.
In this paper we determine the tightness of the KS threshold for a wide range of parameters. We prove that for $r\le 6$ and $d$ large enough, the KS threshold is tight. This shows that there is no information-computation gap in this regime and partially confirms a conjecture of Angelini et al. (2015). On the other hand, we show that for $r\ge 5$, there exist parameters for which the KS threshold is not tight. In particular, for $r\ge 7$, the KS threshold is not tight if the model is disassortative (i.e., polychromatic hyperedges are preferred) or $d$ is large enough. This provides more evidence supporting the existence of an information-computation gap in these cases.
Furthermore, we establish asymptotic bounds on the weak recovery threshold for fixed $r$ and large $d$. We also obtain a number of results regarding the broadcasting on hypertrees (BOHT) model, including the asymptotics of the reconstruction threshold for $r\ge 7$ and impossibility of robust reconstruction at criticality.
△ Less
Submitted 11 June, 2024; v1 submitted 9 February, 2024;
originally announced February 2024.
-
Comparing Spectral Bias and Robustness For Two-Layer Neural Networks: SGD vs Adaptive Random Fourier Features
Authors:
Aku Kammonen,
Lisi Liang,
Anamika Pandey,
Raúl Tempone
Abstract:
We present experimental results highlighting two key differences resulting from the choice of training algorithm for two-layer neural networks. The spectral bias of neural networks is well known, while the spectral bias dependence on the choice of training algorithm is less studied. Our experiments demonstrate that an adaptive random Fourier features algorithm (ARFF) can yield a spectral bias clos…
▽ More
We present experimental results highlighting two key differences resulting from the choice of training algorithm for two-layer neural networks. The spectral bias of neural networks is well known, while the spectral bias dependence on the choice of training algorithm is less studied. Our experiments demonstrate that an adaptive random Fourier features algorithm (ARFF) can yield a spectral bias closer to zero compared to the stochastic gradient descent optimizer (SGD). Additionally, we train two identically structured classifiers, employing SGD and ARFF, to the same accuracy levels and empirically assess their robustness against adversarial noise attacks.
△ Less
Submitted 31 January, 2024;
originally announced February 2024.
-
On the Importance of Neural Wiener Filter for Resource Efficient Multichannel Speech Enhancement
Authors:
Tsun-An Hsieh,
Jacob Donley,
Daniel Wong,
Buye Xu,
Ashutosh Pandey
Abstract:
We introduce a time-domain framework for efficient multichannel speech enhancement, emphasizing low latency and computational efficiency. This framework incorporates two compact deep neural networks (DNNs) surrounding a multichannel neural Wiener filter (NWF). The first DNN enhances the speech signal to estimate NWF coefficients, while the second DNN refines the output from the NWF. The NWF, while…
▽ More
We introduce a time-domain framework for efficient multichannel speech enhancement, emphasizing low latency and computational efficiency. This framework incorporates two compact deep neural networks (DNNs) surrounding a multichannel neural Wiener filter (NWF). The first DNN enhances the speech signal to estimate NWF coefficients, while the second DNN refines the output from the NWF. The NWF, while conceptually similar to the traditional frequency-domain Wiener filter, undergoes a training process optimized for low-latency speech enhancement, involving fine-tuning of both analysis and synthesis transforms. Our research results illustrate that the NWF output, having minimal nonlinear distortions, attains performance levels akin to those of the first DNN, deviating from conventional Wiener filter paradigms. Training all components jointly outperforms sequential training, despite its simplicity. Consequently, this framework achieves superior performance with fewer parameters and reduced computational demands, making it a compelling solution for resource-efficient multichannel speech enhancement.
△ Less
Submitted 15 January, 2024;
originally announced January 2024.
-
Decoupled Spatial and Temporal Processing for Resource Efficient Multichannel Speech Enhancement
Authors:
Ashutosh Pandey,
Buye Xu
Abstract:
We present a novel model designed for resource-efficient multichannel speech enhancement in the time domain, with a focus on low latency, lightweight, and low computational requirements. The proposed model incorporates explicit spatial and temporal processing within deep neural network (DNN) layers. Inspired by frequency-dependent multichannel filtering, our spatial filtering process applies multi…
▽ More
We present a novel model designed for resource-efficient multichannel speech enhancement in the time domain, with a focus on low latency, lightweight, and low computational requirements. The proposed model incorporates explicit spatial and temporal processing within deep neural network (DNN) layers. Inspired by frequency-dependent multichannel filtering, our spatial filtering process applies multiple trainable filters to each hidden unit across the spatial dimension, resulting in a multichannel output. The temporal processing is applied over a single-channel output stream from the spatial processing using a Long Short-Term Memory (LSTM) network. The output from the temporal processing stage is then further integrated into the spatial dimension through elementwise multiplication. This explicit separation of spatial and temporal processing results in a resource-efficient network design. Empirical findings from our experiments show that our proposed model significantly outperforms robust baseline models while demanding far fewer parameters and computations, while achieving an ultra-low algorithmic latency of just 2 milliseconds.
△ Less
Submitted 15 January, 2024;
originally announced January 2024.
-
Practical Bias Mitigation through Proxy Sensitive Attribute Label Generation
Authors:
Bhushan Chaudhary,
Anubha Pandey,
Deepak Bhatt,
Darshika Tiwari
Abstract:
Addressing bias in the trained machine learning system often requires access to sensitive attributes. In practice, these attributes are not available either due to legal and policy regulations or data unavailability for a given demographic. Existing bias mitigation algorithms are limited in their applicability to real-world scenarios as they require access to sensitive attributes to achieve fairne…
▽ More
Addressing bias in the trained machine learning system often requires access to sensitive attributes. In practice, these attributes are not available either due to legal and policy regulations or data unavailability for a given demographic. Existing bias mitigation algorithms are limited in their applicability to real-world scenarios as they require access to sensitive attributes to achieve fairness. In this research work, we aim to address this bottleneck through our proposed unsupervised proxy-sensitive attribute label generation technique. Towards this end, we propose a two-stage approach of unsupervised embedding generation followed by clustering to obtain proxy-sensitive labels. The efficacy of our work relies on the assumption that bias propagates through non-sensitive attributes that are correlated to the sensitive attributes and, when mapped to the high dimensional latent space, produces clusters of different demographic groups that exist in the data. Experimental results demonstrate that bias mitigation using existing algorithms such as Fair Mixup and Adversarial Debiasing yields comparable results on derived proxy labels when compared against using true sensitive attributes.
△ Less
Submitted 26 December, 2023;
originally announced December 2023.
-
GroupMixNorm Layer for Learning Fair Models
Authors:
Anubha Pandey,
Aditi Rai,
Maneet Singh,
Deepak Bhatt,
Tanmoy Bhowmik
Abstract:
Recent research has identified discriminatory behavior of automated prediction algorithms towards groups identified on specific protected attributes (e.g., gender, ethnicity, age group, etc.). When deployed in real-world scenarios, such techniques may demonstrate biased predictions resulting in unfair outcomes. Recent literature has witnessed algorithms for mitigating such biased behavior mostly b…
▽ More
Recent research has identified discriminatory behavior of automated prediction algorithms towards groups identified on specific protected attributes (e.g., gender, ethnicity, age group, etc.). When deployed in real-world scenarios, such techniques may demonstrate biased predictions resulting in unfair outcomes. Recent literature has witnessed algorithms for mitigating such biased behavior mostly by adding convex surrogates of fairness metrics such as demographic parity or equalized odds in the loss function, which are often not easy to estimate. This research proposes a novel in-processing based GroupMixNorm layer for mitigating bias from deep learning models. The GroupMixNorm layer probabilistically mixes group-level feature statistics of samples across different groups based on the protected attribute. The proposed method improves upon several fairness metrics with minimal impact on overall accuracy. Analysis on benchmark tabular and image datasets demonstrates the efficacy of the proposed method in achieving state-of-the-art performance. Further, the experimental analysis also suggests the robustness of the GroupMixNorm layer against new protected attributes during inference and its utility in eliminating bias from a pre-trained network.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
PyPose v0.6: The Imperative Programming Interface for Robotics
Authors:
Zitong Zhan,
Xiangfu Li,
Qihang Li,
Haonan He,
Abhinav Pandey,
Haitao Xiao,
Yangmengfei Xu,
Xiangyu Chen,
Kuan Xu,
Kun Cao,
Zhipeng Zhao,
Zihan Wang,
Huan Xu,
Zihang Fang,
Yutian Chen,
Wentao Wang,
Xu Fang,
Yi Du,
Tianhao Wu,
Xiao Lin,
Yuheng Qiu,
Fan Yang,
Jingnan Shi,
Shaoshu Su,
Yiren Lu
, et al. (11 additional authors not shown)
Abstract:
PyPose is an open-source library for robot learning. It combines a learning-based approach with physics-based optimization, which enables seamless end-to-end robot learning. It has been used in many tasks due to its meticulously designed application programming interface (API) and efficient implementation. From its initial launch in early 2022, PyPose has experienced significant enhancements, inco…
▽ More
PyPose is an open-source library for robot learning. It combines a learning-based approach with physics-based optimization, which enables seamless end-to-end robot learning. It has been used in many tasks due to its meticulously designed application programming interface (API) and efficient implementation. From its initial launch in early 2022, PyPose has experienced significant enhancements, incorporating a wide variety of new features into its platform. To satisfy the growing demand for understanding and utilizing the library and reduce the learning curve of new users, we present the fundamental design principle of the imperative programming interface, and showcase the flexible usage of diverse functionalities and modules using an extremely simple Dubins car example. We also demonstrate that the PyPose can be easily used to navigate a real quadruped robot with a few lines of code.
△ Less
Submitted 22 September, 2023;
originally announced September 2023.
-
Effect of Attention and Self-Supervised Speech Embeddings on Non-Semantic Speech Tasks
Authors:
Payal Mohapatra,
Akash Pandey,
Yueyuan Sui,
Qi Zhu
Abstract:
Human emotion understanding is pivotal in making conversational technology mainstream. We view speech emotion understanding as a perception task which is a more realistic setting. With varying contexts (languages, demographics, etc.) different share of people perceive the same speech segment as a non-unanimous emotion. As part of the ACM Multimedia 2023 Computational Paralinguistics ChallengE (Com…
▽ More
Human emotion understanding is pivotal in making conversational technology mainstream. We view speech emotion understanding as a perception task which is a more realistic setting. With varying contexts (languages, demographics, etc.) different share of people perceive the same speech segment as a non-unanimous emotion. As part of the ACM Multimedia 2023 Computational Paralinguistics ChallengE (ComParE) in the EMotion Share track, we leverage their rich dataset of multilingual speakers and multi-label regression target of 'emotion share' or perception of that emotion. We demonstrate that the training scheme of different foundation models dictates their effectiveness for tasks beyond speech recognition, especially for non-semantic speech tasks like emotion understanding. This is a very complex task due to multilingual speakers, variability in the target labels, and inherent imbalance in the regression dataset. Our results show that HuBERT-Large with a self-attention-based light-weight sequence model provides 4.6% improvement over the reported baseline.
△ Less
Submitted 27 September, 2023; v1 submitted 28 August, 2023;
originally announced August 2023.
-
Using Internal Bar Strength as a Key Indicator for Trading Country ETFs
Authors:
Aditya Pandey,
Kunal Joshi
Abstract:
This report aims to investigate the effectiveness of using internal bar strength (IBS) as a key indicator for trading country exchange-traded funds (ETFs). The study uses a quantitative approach to analyze historical price data for a bucket of country ETFs over a period of 10 years and uses the idea of Mean Reversion to create a profitable trading strategy. Our findings suggest that IBS can be a u…
▽ More
This report aims to investigate the effectiveness of using internal bar strength (IBS) as a key indicator for trading country exchange-traded funds (ETFs). The study uses a quantitative approach to analyze historical price data for a bucket of country ETFs over a period of 10 years and uses the idea of Mean Reversion to create a profitable trading strategy. Our findings suggest that IBS can be a useful technical indicator for predicting short-term price movements in this basket of ETFs.
△ Less
Submitted 14 June, 2023;
originally announced June 2023.
-
Duality in Multi-View Restricted Kernel Machines
Authors:
Sonny Achten,
Arun Pandey,
Hannes De Meulemeester,
Bart De Moor,
Johan A. K. Suykens
Abstract:
We propose a unifying setting that combines existing restricted kernel machine methods into a single primal-dual multi-view framework for kernel principal component analysis in both supervised and unsupervised settings. We derive the primal and dual representations of the framework and relate different training and inference algorithms from a theoretical perspective. We show how to achieve full eq…
▽ More
We propose a unifying setting that combines existing restricted kernel machine methods into a single primal-dual multi-view framework for kernel principal component analysis in both supervised and unsupervised settings. We derive the primal and dual representations of the framework and relate different training and inference algorithms from a theoretical perspective. We show how to achieve full equivalence in primal and dual formulations by rescaling primal variables. Finally, we experimentally validate the equivalence and provide insight into the relationships between different methods on a number of time series data sets by recursively forecasting unseen test data and visualizing the learned features.
△ Less
Submitted 6 July, 2023; v1 submitted 26 May, 2023;
originally announced May 2023.
-
Towards Realistic Generative 3D Face Models
Authors:
Aashish Rai,
Hiresh Gupta,
Ayush Pandey,
Francisco Vicente Carrasco,
Shingo Jason Takagi,
Amaury Aubel,
Daeil Kim,
Aayush Prakash,
Fernando de la Torre
Abstract:
In recent years, there has been significant progress in 2D generative face models fueled by applications such as animation, synthetic data generation, and digital avatars. However, due to the absence of 3D information, these 2D models often struggle to accurately disentangle facial attributes like pose, expression, and illumination, limiting their editing capabilities. To address this limitation,…
▽ More
In recent years, there has been significant progress in 2D generative face models fueled by applications such as animation, synthetic data generation, and digital avatars. However, due to the absence of 3D information, these 2D models often struggle to accurately disentangle facial attributes like pose, expression, and illumination, limiting their editing capabilities. To address this limitation, this paper proposes a 3D controllable generative face model to produce high-quality albedo and precise 3D shape leveraging existing 2D generative models. By combining 2D face generative models with semantic face manipulation, this method enables editing of detailed 3D rendered faces. The proposed framework utilizes an alternating descent optimization approach over shape and albedo. Differentiable rendering is used to train high-quality shapes and albedo without 3D supervision. Moreover, this approach outperforms the state-of-the-art (SOTA) methods in the well-known NoW benchmark for shape reconstruction. It also outperforms the SOTA reconstruction models in recovering rendered faces' identities across novel poses by an average of 10%. Additionally, the paper demonstrates direct control of expressions in 3D faces by exploiting latent space leading to text-based editing of 3D faces.
△ Less
Submitted 26 October, 2023; v1 submitted 24 April, 2023;
originally announced April 2023.
-
Contingency Analyses with Warm Starter using Probabilistic Graphical Model
Authors:
Shimiao Li,
Amritanshu Pandey,
Larry Pileggi
Abstract:
Cyberthreats are an increasingly common risk to the power grid and can thwart secure grid operations. We propose to extend contingency analysis to include cyberthreat evaluations. However, unlike the traditional N-1 or N-2 contingencies, cyberthreats (e.g., MadIoT) require simulating hard-to-solve N-k (with k >> 2) contingencies in a practical amount of time. Purely physics-based power flow solver…
▽ More
Cyberthreats are an increasingly common risk to the power grid and can thwart secure grid operations. We propose to extend contingency analysis to include cyberthreat evaluations. However, unlike the traditional N-1 or N-2 contingencies, cyberthreats (e.g., MadIoT) require simulating hard-to-solve N-k (with k >> 2) contingencies in a practical amount of time. Purely physics-based power flow solvers, while being accurate, are slow and may not solve N-k contingencies in a timely manner, whereas the emerging data-driven alternatives are fast but not sufficiently generalizable, interpretable, and scalable. To address these challenges, we propose a novel conditional Gaussian Random Field-based data-driven method that performs fast and accurate evaluation of cyberthreats. It achieves speedup of contingency analysis by warm-starting simulations, i.e., improving starting points, for the physical solvers. To improve the physical interpretability and generalizability, the proposed method incorporates domain knowledge by considering the graphical nature of the grid topology. To improve scalability, the method applies physics-informed regularization that reduces model complexity. Experiments validate that simulating MadIoT-induced attacks with our warm starter becomes approximately 5x faster on a realistic 2000-bus system.
△ Less
Submitted 19 March, 2024; v1 submitted 10 April, 2023;
originally announced April 2023.
-
Pacti: Scaling Assume-Guarantee Reasoning for System Analysis and Design
Authors:
Inigo Incer,
Apurva Badithela,
Josefine Graebener,
Piergiuseppe Mallozzi,
Ayush Pandey,
Sheng-Jung Yu,
Albert Benveniste,
Benoit Caillaud,
Richard M. Murray,
Alberto Sangiovanni-Vincentelli,
Sanjit A. Seshia
Abstract:
Contract-based design is a method to facilitate modular system design. While there has been substantial progress on the theory of contracts, there has been less progress on scalable algorithms for the algebraic operations in this theory. In this paper, we present: 1) principles to implement a contract-based design tool at scale and 2) Pacti, a tool that can efficiently compute these operations. We…
▽ More
Contract-based design is a method to facilitate modular system design. While there has been substantial progress on the theory of contracts, there has been less progress on scalable algorithms for the algebraic operations in this theory. In this paper, we present: 1) principles to implement a contract-based design tool at scale and 2) Pacti, a tool that can efficiently compute these operations. We then illustrate the use of Pacti in a variety of case studies.
△ Less
Submitted 30 March, 2023;
originally announced March 2023.
-
Benchmarking and Security Considerations of Wi-Fi FTM for Ranging in IoT Devices
Authors:
Govind Singh,
Anshul Pandey,
Monika Prakash,
Martin Andreoni,
Michael Baddeley
Abstract:
The IEEE 802.11mc standard introduces fine time measurement (Wi-Fi FTM), allowing high-precision synchronization between peers and round-trip time calculation (Wi-Fi RTT) for location estimation - typically with a precision of one to two meters. This has considerable advantages over received signal strength (RSS)-based trilateration, which is prone to errors due to multipath reflections. We examin…
▽ More
The IEEE 802.11mc standard introduces fine time measurement (Wi-Fi FTM), allowing high-precision synchronization between peers and round-trip time calculation (Wi-Fi RTT) for location estimation - typically with a precision of one to two meters. This has considerable advantages over received signal strength (RSS)-based trilateration, which is prone to errors due to multipath reflections. We examine different commercial radios which support Wi-Fi RTT and benchmark Wi-Fi FTM ranging over different spectrums and bandwidths. Importantly, we find that while Wi-Fi FTM supports localization accuracy to within one to two meters in ideal conditions during outdoor line-of-sight experiments, for indoor environments at short ranges similar accuracy was only achievable on chipsets supporting Wi-Fi FTM on wider (VHT80) channel bandwidths rather than narrower (HT20) channel bandwidths. Finally, we explore the security implications of Wi-Fi FTM and use an on-air sniffer to demonstrate that Wi-Fi FTM messages are unprotected. We consequently propose a threat model with possible mitigations and directions for further research.
△ Less
Submitted 7 March, 2023;
originally announced March 2023.
-
Complexity of total dominator coloring in graphs
Authors:
Michael A. Henning,
Kusum,
Arti Pandey,
Kaustav Paul
Abstract:
Let $G=(V,E)$ be a graph with no isolated vertices. A vertex $v$ totally dominate a vertex $w$ ($w \ne v$), if $v$ is adjacent to $w$. A set $D \subseteq V$ called a total dominating set of $G$ if every vertex $v\in V$ is totally dominated by some vertex in $D$. The minimum cardinality of a total dominating set is the total domination number of $G$ and is denoted by $γ_t(G)$. A total dominator col…
▽ More
Let $G=(V,E)$ be a graph with no isolated vertices. A vertex $v$ totally dominate a vertex $w$ ($w \ne v$), if $v$ is adjacent to $w$. A set $D \subseteq V$ called a total dominating set of $G$ if every vertex $v\in V$ is totally dominated by some vertex in $D$. The minimum cardinality of a total dominating set is the total domination number of $G$ and is denoted by $γ_t(G)$. A total dominator coloring of graph $G$ is a proper coloring of vertices of $G$, so that each vertex totally dominates some color class. The total dominator chromatic number $χ_{td}(G)$ of $G$ is the least number of colors required for a total dominator coloring of $G$. The Total Dominator Coloring problem is to find a total dominator coloring of $G$ using the minimum number of colors. It is known that the decision version of this problem is NP-complete for general graphs. We show that it remains NP-complete even when restricted to bipartite, planar and split graphs. We further study the Total Dominator Coloring problem for various graph classes, including trees, cographs and chain graphs. First, we characterize the trees having $χ_{td}(T)=γ_t(T)+1$, which completes the characterization of trees achieving all possible values of $χ_{td}(T)$. Also, we show that for a cograph $G$, $χ_{td}(G)$ can be computed in linear-time. Moreover, we show that $2 \le χ_{td}(G) \le 4$ for a chain graph $G$ and give characterization of chain graphs for every possible value of $χ_{td}(G)$ in linear-time.
△ Less
Submitted 3 March, 2023;
originally announced March 2023.
-
Addressing DAO Insider Attacks in IPv6-Based Low-Power and Lossy Networks
Authors:
Sachin Kumar Verma,
Abhishek Verma,
Avinash Chandra Pandey
Abstract:
Low-Power and Lossy Networks (LLNs) run on resource-constrained devices and play a key role in many Industrial Internet of Things and Cyber-Physical Systems based applications. But, achieving an energy-efficient routing in LLNs is a major challenge nowadays. This challenge is addressed by Routing Protocol for Low-power Lossy Networks (RPL), which is specified in RFC 6550 as a "Proposed Standard" a…
▽ More
Low-Power and Lossy Networks (LLNs) run on resource-constrained devices and play a key role in many Industrial Internet of Things and Cyber-Physical Systems based applications. But, achieving an energy-efficient routing in LLNs is a major challenge nowadays. This challenge is addressed by Routing Protocol for Low-power Lossy Networks (RPL), which is specified in RFC 6550 as a "Proposed Standard" at present. In RPL, a client node uses Destination Advertisement Object (DAO) control messages to pass on the destination information towards the root node. An attacker may exploit the DAO sending mechanism of RPL to perform a DAO Insider attack in LLNs. In this paper, it is shown that an aggressive attacker can drastically degrade the network performance. To address DAO Insider attack, a lightweight defense solution is proposed. The proposed solution uses an early blacklisting strategy to significantly mitigate the attack and restore RPL performance. The proposed solution is implemented and tested on Cooja Simulator.
△ Less
Submitted 1 March, 2023;
originally announced March 2023.
-
Cosecure Domination: Hardness Results and Algorithm
Authors:
Kusum,
Arti Pandey
Abstract:
For a simple graph $G=(V,E)$ without any isolated vertex, a cosecure dominating set $D$ of $G$ satisfies the following two properties (i) $S$ is a dominating set of $G$, (ii) for every vertex $v \in S$ there exists a vertex $u \in V \setminus S$ such that $uv \in E$ and $(S \setminus \{v\}) \cup \{u\}$ is a dominating set of $G$. The minimum cardinality of a cosecure dominating set of $G$ is calle…
▽ More
For a simple graph $G=(V,E)$ without any isolated vertex, a cosecure dominating set $D$ of $G$ satisfies the following two properties (i) $S$ is a dominating set of $G$, (ii) for every vertex $v \in S$ there exists a vertex $u \in V \setminus S$ such that $uv \in E$ and $(S \setminus \{v\}) \cup \{u\}$ is a dominating set of $G$. The minimum cardinality of a cosecure dominating set of $G$ is called cosecure domination number of $G$ and is denoted by $γ_{cs}(G)$. The Minimum Cosecure Domination problem is to find a cosecure dominating set of a graph $G$ of cardinality $γ_{cs}(G)$. The decision version of the problem is known to be NP-complete for bipartite, planar, and split graphs. Also, it is known that the Minimum Cosecure Domination problem is efficiently solvable for proper interval graphs and cographs.
In this paper, we work on various important graph classes in an effort to reduce the complexity gap of the Minimum Cosecure Domination problem. We show that the decision version of the problem remains NP-complete for circle graphs, doubly chordal graphs, chordal bipartite graphs, star-convex bipartite graphs and comb-convex bipartite graphs. On the positive side, we give an efficient algorithm to compute the cosecure domination number of chain graphs, which is an important subclass of bipartite graphs. In addition, we show that the problem is linear-time solvable for bounded tree-width graphs. Further, we prove that the computational complexity of this problem varies from the domination problem.
△ Less
Submitted 25 February, 2023;
originally announced February 2023.
-
Multi-view Kernel PCA for Time series Forecasting
Authors:
Arun Pandey,
Hannes De Meulemeester,
Bart De Moor,
Johan A. K. Suykens
Abstract:
In this paper, we propose a kernel principal component analysis model for multi-variate time series forecasting, where the training and prediction schemes are derived from the multi-view formulation of Restricted Kernel Machines. The training problem is simply an eigenvalue decomposition of the summation of two kernel matrices corresponding to the views of the input and output data. When a linear…
▽ More
In this paper, we propose a kernel principal component analysis model for multi-variate time series forecasting, where the training and prediction schemes are derived from the multi-view formulation of Restricted Kernel Machines. The training problem is simply an eigenvalue decomposition of the summation of two kernel matrices corresponding to the views of the input and output data. When a linear kernel is used for the output view, it is shown that the forecasting equation takes the form of kernel ridge regression. When that kernel is non-linear, a pre-image problem has to be solved to forecast a point in the input space. We evaluate the model on several standard time series datasets, perform ablation studies, benchmark with closely related models and discuss its results.
△ Less
Submitted 23 January, 2023;
originally announced January 2023.
-
Cross-Domain Shopping and Stock Trend Analysis
Authors:
Aditya Pandey,
Haseeba Fathiya,
Nivedita Patel
Abstract:
This paper presents a cross-domain trend analysis that aims to identify and analyze the relationships between stock prices, stock news on Twitter, and users' behaviors on e-commerce websites. The analysis is based on three datasets: a US stock dataset, a stock tweets dataset, and an e-commerce behavior dataset. The analysis is performed using Hadoop, Hive, and Tableau, allowing for efficient and s…
▽ More
This paper presents a cross-domain trend analysis that aims to identify and analyze the relationships between stock prices, stock news on Twitter, and users' behaviors on e-commerce websites. The analysis is based on three datasets: a US stock dataset, a stock tweets dataset, and an e-commerce behavior dataset. The analysis is performed using Hadoop, Hive, and Tableau, allowing for efficient and scalable processing and visualizing large datasets.
The analysis includes trend analysis of Twitter sentiment (positive and negative tweets) and correlation analysis, including the correlation between tweet sentiment and stocks, the correlation between stock trends and shopping behavior, and the understanding of data based on different slices of time. By comparing different features from the datasets over time, we hope to gain insight into the factors that drive user behavior as well as the market in different categories. The results of this analysis can provide valuable insights for businesses and investors to inform decision-making.
We believe that our analysis can serve as a valuable starting point for further research and investigation into these topics.
△ Less
Submitted 23 December, 2022;
originally announced December 2022.
-
Cross-Domain Consumer Review Analysis
Authors:
Aditya Pandey,
Kunal Joshi
Abstract:
The paper presents a cross-domain review analysis on four popular review datasets: Amazon, Yelp, Steam, IMDb. The analysis is performed using Hadoop and Spark, which allows for efficient and scalable processing of large datasets. By examining close to 12 million reviews from these four online forums, we hope to uncover interesting trends in sales and customer sentiment over the years. Our analysis…
▽ More
The paper presents a cross-domain review analysis on four popular review datasets: Amazon, Yelp, Steam, IMDb. The analysis is performed using Hadoop and Spark, which allows for efficient and scalable processing of large datasets. By examining close to 12 million reviews from these four online forums, we hope to uncover interesting trends in sales and customer sentiment over the years. Our analysis will include a study of the number of reviews and their distribution over time, as well as an examination of the relationship between various review attributes such as upvotes, creation time, rating, and sentiment. By comparing the reviews across different domains, we hope to gain insight into the factors that drive customer satisfaction and engagement in different product categories.
△ Less
Submitted 23 December, 2022;
originally announced December 2022.
-
Forecasting formation of a Tropical Cyclone Using Reanalysis Data
Authors:
Sandeep Kumar,
Koushik Biswas,
Ashish Kumar Pandey
Abstract:
The tropical cyclone formation process is one of the most complex natural phenomena which is governed by various atmospheric, oceanographic, and geographic factors that varies with time and space. Despite several years of research, accurately predicting tropical cyclone formation remains a challenging task. While the existing numerical models have inherent limitations, the machine learning models…
▽ More
The tropical cyclone formation process is one of the most complex natural phenomena which is governed by various atmospheric, oceanographic, and geographic factors that varies with time and space. Despite several years of research, accurately predicting tropical cyclone formation remains a challenging task. While the existing numerical models have inherent limitations, the machine learning models fail to capture the spatial and temporal dimensions of the causal factors behind TC formation. In this study, a deep learning model has been proposed that can forecast the formation of a tropical cyclone with a lead time of up to 60 hours with high accuracy. The model uses the high-resolution reanalysis data ERA5 (ECMWF reanalysis 5th generation), and best track data IBTrACS (International Best Track Archive for Climate Stewardship) to forecast tropical cyclone formation in six ocean basins of the world. For 60 hours lead time the models achieve an accuracy in the range of 86.9% - 92.9% across the six ocean basins. The model takes about 5-15 minutes of training time depending on the ocean basin, and the amount of data used and can predict within seconds, thereby making it suitable for real-life usage.
△ Less
Submitted 10 December, 2022;
originally announced December 2022.
-
SpeechNet: Weakly Supervised, End-to-End Speech Recognition at Industrial Scale
Authors:
Raphael Tang,
Karun Kumar,
Gefei Yang,
Akshat Pandey,
Yajie Mao,
Vladislav Belyaev,
Madhuri Emmadi,
Craig Murray,
Ferhan Ture,
Jimmy Lin
Abstract:
End-to-end automatic speech recognition systems represent the state of the art, but they rely on thousands of hours of manually annotated speech for training, as well as heavyweight computation for inference. Of course, this impedes commercialization since most companies lack vast human and computational resources. In this paper, we explore training and deploying an ASR system in the label-scarce,…
▽ More
End-to-end automatic speech recognition systems represent the state of the art, but they rely on thousands of hours of manually annotated speech for training, as well as heavyweight computation for inference. Of course, this impedes commercialization since most companies lack vast human and computational resources. In this paper, we explore training and deploying an ASR system in the label-scarce, compute-limited setting. To reduce human labor, we use a third-party ASR system as a weak supervision source, supplemented with labeling functions derived from implicit user feedback. To accelerate inference, we propose to route production-time queries across a pool of CUDA graphs of varying input lengths, the distribution of which best matches the traffic's. Compared to our third-party ASR, we achieve a relative improvement in word-error rate of 8% and a speedup of 600%. Our system, called SpeechNet, currently serves 12 million queries per day on our voice-enabled smart television. To our knowledge, this is the first time a large-scale, Wav2vec-based deployment has been described in the academic literature.
△ Less
Submitted 21 November, 2022;
originally announced November 2022.
-
XAI-BayesHAR: A novel Framework for Human Activity Recognition with Integrated Uncertainty and Shapely Values
Authors:
Anand Dubey,
Niall Lyons,
Avik Santra,
Ashutosh Pandey
Abstract:
Human activity recognition (HAR) using IMU sensors, namely accelerometer and gyroscope, has several applications in smart homes, healthcare and human-machine interface systems. In practice, the IMU-based HAR system is expected to encounter variations in measurement due to sensor degradation, alien environment or sensor noise and will be subjected to unknown activities. In view of practical deploym…
▽ More
Human activity recognition (HAR) using IMU sensors, namely accelerometer and gyroscope, has several applications in smart homes, healthcare and human-machine interface systems. In practice, the IMU-based HAR system is expected to encounter variations in measurement due to sensor degradation, alien environment or sensor noise and will be subjected to unknown activities. In view of practical deployment of the solution, analysis of statistical confidence over the activity class score are important metrics. In this paper, we therefore propose XAI-BayesHAR, an integrated Bayesian framework, that improves the overall activity classification accuracy of IMU-based HAR solutions by recursively tracking the feature embedding vector and its associated uncertainty via Kalman filter. Additionally, XAI-BayesHAR acts as an out of data distribution (OOD) detector using the predictive uncertainty which help to evaluate and detect alien input data distribution. Furthermore, Shapley value-based performance of the proposed framework is also evaluated to understand the importance of the feature embedding vector and accordingly used for model compression
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
Harnessing the Power of Explanations for Incremental Training: A LIME-Based Approach
Authors:
Arnab Neelim Mazumder,
Niall Lyons,
Ashutosh Pandey,
Avik Santra,
Tinoosh Mohsenin
Abstract:
Explainability of neural network prediction is essential to understand feature importance and gain interpretable insight into neural network performance. However, explanations of neural network outcomes are mostly limited to visualization, and there is scarce work that looks to use these explanations as feedback to improve model performance. In this work, model explanations are fed back to the fee…
▽ More
Explainability of neural network prediction is essential to understand feature importance and gain interpretable insight into neural network performance. However, explanations of neural network outcomes are mostly limited to visualization, and there is scarce work that looks to use these explanations as feedback to improve model performance. In this work, model explanations are fed back to the feed-forward training to help the model generalize better. To this extent, a custom weighted loss where the weights are generated by considering the Euclidean distances between true LIME (Local Interpretable Model-Agnostic Explanations) explanations and model-predicted LIME explanations is proposed. Also, in practical training scenarios, developing a solution that can help the model learn sequentially without losing information on previous data distribution is imperative due to the unavailability of all the training data at once. Thus, the framework incorporates the custom weighted loss with Elastic Weight Consolidation (EWC) to maintain performance in sequential testing sets. The proposed custom training procedure results in a consistent enhancement of accuracy ranging from 0.5% to 1.5% throughout all phases of the incremental learning setup compared to traditional loss-based training methods for the keyword spotting task using the Google Speech Commands dataset.
△ Less
Submitted 11 July, 2023; v1 submitted 2 November, 2022;
originally announced November 2022.
-
Time-Domain Speech Enhancement for Robust Automatic Speech Recognition
Authors:
Yufeng Yang,
Ashutosh Pandey,
DeLiang Wang
Abstract:
It has been shown that the intelligibility of noisy speech can be improved by speech enhancement algorithms. However, speech enhancement has not been established as an effective frontend for robust automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between speech enhancement and ASR impedes the progress of robust ASR systems…
▽ More
It has been shown that the intelligibility of noisy speech can be improved by speech enhancement algorithms. However, speech enhancement has not been established as an effective frontend for robust automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between speech enhancement and ASR impedes the progress of robust ASR systems especially as speech enhancement has made big strides in recent years. In this work, we focus on eliminating this divide with an ARN (attentive recurrent network) based time-domain enhancement model. The proposed system fully decouples speech enhancement and an acoustic model trained only on clean speech. Results on the CHiME-2 corpus show that ARN enhanced speech translates to improved ASR results. The proposed system achieves $6.28\%$ average word error rate, outperforming the previous best by $19.3\%$ relatively.
△ Less
Submitted 20 June, 2023; v1 submitted 24 October, 2022;
originally announced October 2022.
-
BiaScope: Visual Unfairness Diagnosis for Graph Embeddings
Authors:
Agapi Rissaki,
Bruno Scarone,
David Liu,
Aditeya Pandey,
Brennan Klein,
Tina Eliassi-Rad,
Michelle A. Borkin
Abstract:
The issue of bias (i.e., systematic unfairness) in machine learning models has recently attracted the attention of both researchers and practitioners. For the graph mining community in particular, an important goal toward algorithmic fairness is to detect and mitigate bias incorporated into graph embeddings since they are commonly used in human-centered applications, e.g., social-media recommendat…
▽ More
The issue of bias (i.e., systematic unfairness) in machine learning models has recently attracted the attention of both researchers and practitioners. For the graph mining community in particular, an important goal toward algorithmic fairness is to detect and mitigate bias incorporated into graph embeddings since they are commonly used in human-centered applications, e.g., social-media recommendations. However, simple analytical methods for detecting bias typically involve aggregate statistics which do not reveal the sources of unfairness. Instead, visual methods can provide a holistic fairness characterization of graph embeddings and help uncover the causes of observed bias. In this work, we present BiaScope, an interactive visualization tool that supports end-to-end visual unfairness diagnosis for graph embeddings. The tool is the product of a design study in collaboration with domain experts. It allows the user to (i) visually compare two embeddings with respect to fairness, (ii) locate nodes or graph communities that are unfairly embedded, and (iii) understand the source of bias by interactively linking the relevant embedding subspace with the corresponding graph topology. Experts' feedback confirms that our tool is effective at detecting and diagnosing unfairness. Thus, we envision our tool both as a companion for researchers in designing their algorithms as well as a guide for practitioners who use off-the-shelf graph embeddings.
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
What the DAAM: Interpreting Stable Diffusion Using Cross Attention
Authors:
Raphael Tang,
Linqing Liu,
Akshat Pandey,
Zhiying Jiang,
Gefei Yang,
Karun Kumar,
Pontus Stenetorp,
Jimmy Lin,
Ferhan Ture
Abstract:
Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce pixel-level attribution maps, we upscale and aggregate cross-attention word-pixel scores in the denoising…
▽ More
Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce pixel-level attribution maps, we upscale and aggregate cross-attention word-pixel scores in the denoising subnetwork, naming our method DAAM. We evaluate its correctness by testing its semantic segmentation ability on nouns, as well as its generalized attribution quality on all parts of speech, rated by humans. We then apply DAAM to study the role of syntax in the pixel space, characterizing head--dependent heat map interaction patterns for ten common dependency relations. Finally, we study several semantic phenomena using DAAM, with a focus on feature entanglement, where we find that cohyponyms worsen generation quality and descriptive adjectives attend too broadly. To our knowledge, we are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future lines of research. Our code is at https://github.com/castorini/daam.
△ Less
Submitted 8 December, 2022; v1 submitted 10 October, 2022;
originally announced October 2022.
-
PyPose: A Library for Robot Learning with Physics-based Optimization
Authors:
Chen Wang,
Dasong Gao,
Kuan Xu,
Junyi Geng,
Yaoyu Hu,
Yuheng Qiu,
Bowen Li,
Fan Yang,
Brady Moon,
Abhinav Pandey,
Aryan,
Jiahe Xu,
Tianhao Wu,
Haonan He,
Daning Huang,
Zhongqiang Ren,
Shibo Zhao,
Taimeng Fu,
Pranay Reddy,
Xiao Lin,
Wenshan Wang,
Jingnan Shi,
Rajat Talak,
Kun Cao,
Yi Du
, et al. (12 additional authors not shown)
Abstract:
Deep learning has had remarkable success in robotic perception, but its data-centric nature suffers when it comes to generalizing to ever-changing environments. By contrast, physics-based optimization generalizes better, but it does not perform as well in complicated tasks due to the lack of high-level semantic information and reliance on manual parametric tuning. To take advantage of these two co…
▽ More
Deep learning has had remarkable success in robotic perception, but its data-centric nature suffers when it comes to generalizing to ever-changing environments. By contrast, physics-based optimization generalizes better, but it does not perform as well in complicated tasks due to the lack of high-level semantic information and reliance on manual parametric tuning. To take advantage of these two complementary worlds, we present PyPose: a robotics-oriented, PyTorch-based library that combines deep perceptual models with physics-based optimization. PyPose's architecture is tidy and well-organized, it has an imperative style interface and is efficient and user-friendly, making it easy to integrate into real-world robotic applications. Besides, it supports parallel computing of any order gradients of Lie groups and Lie algebras and $2^{\text{nd}}$-order optimizers, such as trust region methods. Experiments show that PyPose achieves more than $10\times$ speedup in computation compared to the state-of-the-art libraries. To boost future research, we provide concrete examples for several fields of robot learning, including SLAM, planning, control, and inertial navigation.
△ Less
Submitted 24 March, 2023; v1 submitted 30 September, 2022;
originally announced September 2022.
-
MEDLEY: Intent-based Recommendations to Support Dashboard Composition
Authors:
Aditeya Pandey,
Arjun Srinivasan,
Vidya Setlur
Abstract:
Despite the ever-growing popularity of dashboards across a wide range of domains, their authoring still remains a tedious and complex process. Current tools offer considerable support for creating individual visualizations but provide limited support for discovering groups of visualizations that can be collectively useful for composing analytic dashboards. To address this problem, we present MEDLE…
▽ More
Despite the ever-growing popularity of dashboards across a wide range of domains, their authoring still remains a tedious and complex process. Current tools offer considerable support for creating individual visualizations but provide limited support for discovering groups of visualizations that can be collectively useful for composing analytic dashboards. To address this problem, we present MEDLEY, a mixed-initiative interface that assists in dashboard composition by recommending dashboard collections (i.e., a logically grouped set of views and filtering widgets) that map to specific analytical intents. Users can specify dashboard intents (namely, measure analysis, change analysis, category analysis, or distribution analysis) explicitly through an input panel in the interface or implicitly by selecting data attributes and views of interest. The system recommends collections based on these analytic intents, and views and widgets can be selected to compose a variety of dashboards. MEDLEY also provides a lightweight direct manipulation interface to configure interactions between views in a dashboard. Based on a study with 13 participants performing both targeted and open-ended tasks, we discuss how MEDLEY's recommendations guide dashboard composition and facilitate different user workflows. Observations from the study identify potential directions for future work, including combining manual view specification with dashboard recommendations and designing natural language interfaces for dashboard authoring.
△ Less
Submitted 5 August, 2022;
originally announced August 2022.
-
Multilinguals at SemEval-2022 Task 11: Complex NER in Semantically Ambiguous Settings for Low Resource Languages
Authors:
Amit Pandey,
Swayatta Daw,
Narendra Babu Unnam,
Vikram Pudi
Abstract:
We leverage pre-trained language models to solve the task of complex NER for two low-resource languages: Chinese and Spanish. We use the technique of Whole Word Masking(WWM) to boost the performance of masked language modeling objective on large and unsupervised corpora. We experiment with multiple neural network architectures, incorporating CRF, BiLSTMs, and Linear Classifiers on top of a fine-tu…
▽ More
We leverage pre-trained language models to solve the task of complex NER for two low-resource languages: Chinese and Spanish. We use the technique of Whole Word Masking(WWM) to boost the performance of masked language modeling objective on large and unsupervised corpora. We experiment with multiple neural network architectures, incorporating CRF, BiLSTMs, and Linear Classifiers on top of a fine-tuned BERT layer. All our models outperform the baseline by a significant margin and our best performing model obtains a competitive position on the evaluation leaderboard for the blind test set.
△ Less
Submitted 14 July, 2022;
originally announced July 2022.
-
Deployment of ML Models using Kubeflow on Different Cloud Providers
Authors:
Aditya Pandey,
Maitreya Sonawane,
Sumit Mamtani
Abstract:
This project aims to explore the process of deploying Machine learning models on Kubernetes using an open-source tool called Kubeflow [1] - an end-to-end ML Stack orchestration toolkit. We create end-to-end Machine Learning models on Kubeflow in the form of pipelines and analyze various points including the ease of setup, deployment models, performance, limitations and features of the tool. We hop…
▽ More
This project aims to explore the process of deploying Machine learning models on Kubernetes using an open-source tool called Kubeflow [1] - an end-to-end ML Stack orchestration toolkit. We create end-to-end Machine Learning models on Kubeflow in the form of pipelines and analyze various points including the ease of setup, deployment models, performance, limitations and features of the tool. We hope that our project acts almost like a seminar/introductory report that can help vanilla cloud/Kubernetes users with zero knowledge on Kubeflow use Kubeflow to deploy ML models. From setup on different clouds to serving our trained model over the internet - we give details and metrics detailing the performance of Kubeflow.
△ Less
Submitted 27 June, 2022;
originally announced June 2022.
-
Towards Practical Physics-Informed ML Design and Evaluation for Power Grid
Authors:
Shimiao Li,
Amritanshu Pandey,
Larry Pileggi
Abstract:
When applied to a real-world safety critical system like the power grid, general machine learning methods suffer from expensive training, non-physical solutions, and limited interpretability. To address these challenges for power grids, many recent works have explored the inclusion of grid physics (i.e., domain expertise) into their method design, primarily through including system constraints and…
▽ More
When applied to a real-world safety critical system like the power grid, general machine learning methods suffer from expensive training, non-physical solutions, and limited interpretability. To address these challenges for power grids, many recent works have explored the inclusion of grid physics (i.e., domain expertise) into their method design, primarily through including system constraints and technical limits, reducing search space and defining meaningful features in latent space. Yet, there is no general methodology to evaluate the practicality of these approaches in power grid tasks, and limitations exist regarding scalability, generalization, interpretability, etc. This work formalizes a new concept of physical interpretability which assesses how a ML model makes predictions in a physically meaningful way and introduces an evaluation methodology that identifies a set of attributes that a practical method should satisfy. Inspired by the evaluation attributes, the paper further develops a novel contingency analysis warm starter for MadIoT cyberattack, based on a conditional Gaussian random field. This method serves as an instance of an ML model that can incorporate diverse domain knowledge and improve on these identified attributes. Experiments validate that the warm starter significantly boosts the efficiency of contingency analysis for MadIoT attack even with shallow NN architectures.
△ Less
Submitted 24 May, 2022; v1 submitted 7 May, 2022;
originally announced May 2022.
-
Improved far-field speech recognition using Joint Variational Autoencoder
Authors:
Shashi Kumar,
Shakti P. Rath,
Abhishek Pandey
Abstract:
Automatic Speech Recognition (ASR) systems suffer considerably when source speech is corrupted with noise or room impulse responses (RIR). Typically, speech enhancement is applied in both mismatched and matched scenario training and testing. In matched setting, acoustic model (AM) is trained on dereverberated far-field features while in mismatched setting, AM is fixed. In recent past, mapping spee…
▽ More
Automatic Speech Recognition (ASR) systems suffer considerably when source speech is corrupted with noise or room impulse responses (RIR). Typically, speech enhancement is applied in both mismatched and matched scenario training and testing. In matched setting, acoustic model (AM) is trained on dereverberated far-field features while in mismatched setting, AM is fixed. In recent past, mapping speech features from far-field to close-talk using denoising autoencoder (DA) has been explored. In this paper, we focus on matched scenario training and show that the proposed joint VAE based mapping achieves a significant improvement over DA. Specifically, we observe an absolute improvement of 2.5% in word error rate (WER) compared to DA based enhancement and 3.96% compared to AM trained directly on far-field filterbank features.
△ Less
Submitted 24 April, 2022;
originally announced April 2022.
-
Multilinguals at SemEval-2022 Task 11: Transformer Based Architecture for Complex NER
Authors:
Amit Pandey,
Swayatta Daw,
Vikram Pudi
Abstract:
We investigate the task of complex NER for the English language. The task is non-trivial due to the semantic ambiguity of the textual structure and the rarity of occurrence of such entities in the prevalent literature. Using pre-trained language models such as BERT, we obtain a competitive performance on this task. We qualitatively analyze the performance of multiple architectures for this task. A…
▽ More
We investigate the task of complex NER for the English language. The task is non-trivial due to the semantic ambiguity of the textual structure and the rarity of occurrence of such entities in the prevalent literature. Using pre-trained language models such as BERT, we obtain a competitive performance on this task. We qualitatively analyze the performance of multiple architectures for this task. All our models are able to outperform the baseline by a significant margin. Our best performing model beats the baseline F1-score by over 9%.
△ Less
Submitted 5 April, 2022;
originally announced April 2022.
-
Detecting and Localizing Copy-Move and Image-Splicing Forgery
Authors:
Aditya Pandey,
Anshuman Mitra
Abstract:
In the world of fake news and deepfakes, there have been an alarmingly large number of cases of images being tampered with and published in newspapers, used in court, and posted on social media for defamation purposes. Detecting these tampered images is an important task and one we try to tackle. In this paper, we focus on the methods to detect if an image has been tampered with using both Deep Le…
▽ More
In the world of fake news and deepfakes, there have been an alarmingly large number of cases of images being tampered with and published in newspapers, used in court, and posted on social media for defamation purposes. Detecting these tampered images is an important task and one we try to tackle. In this paper, we focus on the methods to detect if an image has been tampered with using both Deep Learning and Image transformation methods and comparing the performances and robustness of each method. We then attempt to identify the tampered area of the image and predict the corresponding mask. Based on the results, suggestions and approaches are provided to achieve a more robust framework to detect and identify the forgeries.
△ Less
Submitted 7 February, 2022;
originally announced February 2022.
-
Complexity of Paired Domination in AT-free and Planar Graphs
Authors:
Vikash Tripathi,
Ton Kloks,
Arti Pandey,
Kaustav Paul,
Hung-Lung Wang
Abstract:
For a graph $G=(V,E)$, a subset $D$ of vertex set $V$, is a dominating set of $G$ if every vertex not in $D$ is adjacent to atleast one vertex of $D$. A dominating set $D$ of a graph $G$ with no isolated vertices is called a paired dominating set (PD-set), if $G[D]$, the subgraph induced by $D$ in $G$ has a perfect matching. The Min-PD problem requires to compute a PD-set of minimum cardinality. T…
▽ More
For a graph $G=(V,E)$, a subset $D$ of vertex set $V$, is a dominating set of $G$ if every vertex not in $D$ is adjacent to atleast one vertex of $D$. A dominating set $D$ of a graph $G$ with no isolated vertices is called a paired dominating set (PD-set), if $G[D]$, the subgraph induced by $D$ in $G$ has a perfect matching. The Min-PD problem requires to compute a PD-set of minimum cardinality. The decision version of the Min-PD problem remains NP-complete even when $G$ belongs to restricted graph classes such as bipartite graphs, chordal graphs etc. On the positive side, the problem is efficiently solvable for many graph classes including intervals graphs, strongly chordal graphs, permutation graphs etc. In this paper, we study the complexity of the problem in AT-free graphs and planar graph. The class of AT-free graphs contains cocomparability graphs, permutation graphs, trapezoid graphs, and interval graphs as subclasses. We propose a polynomial-time algorithm to compute a minimum PD-set in AT-free graphs. In addition, we also present a linear-time $2$-approximation algorithm for the problem in AT-free graphs. Further, we prove that the decision version of the problem is NP-complete for planar graphs, which answers an open question asked by Lin et al. (in Theor. Comput. Sci., $591 (2015): 99-105$ and Algorithmica, $ 82 (2020) :2809-2840$).
△ Less
Submitted 10 December, 2021;
originally announced December 2021.
-
Algorithms for Maximum Internal Spanning Tree Problem for Some Graph Classes
Authors:
Gopika Sharma,
Arti Pandey,
Michael C. Wigal
Abstract:
For a given graph $G$, a maximum internal spanning tree of $G$ is a spanning tree of $G$ with maximum number of internal vertices. The Maximum Internal Spanning Tree (MIST) problem is to find a maximum internal spanning tree of the given graph. The MIST problem is a generalization of the Hamiltonian path problem. Since the Hamiltonian path problem is NP-hard, even for bipartite and chordal graphs,…
▽ More
For a given graph $G$, a maximum internal spanning tree of $G$ is a spanning tree of $G$ with maximum number of internal vertices. The Maximum Internal Spanning Tree (MIST) problem is to find a maximum internal spanning tree of the given graph. The MIST problem is a generalization of the Hamiltonian path problem. Since the Hamiltonian path problem is NP-hard, even for bipartite and chordal graphs, two important subclasses of graphs, the MIST problem also remains NP-hard for these graph classes. In this paper, we propose linear-time algorithms to compute a maximum internal spanning tree of cographs, block graphs, cactus graphs, chain graphs and bipartite permutation graphs. The optimal path cover problem, which asks to find a path cover of the given graph with maximum number of edges, is also a well studied problem. In this paper, we also study the relationship between the number of internal vertices in maximum internal spanning tree and number of edges in optimal path cover for the special graph classes mentioned above.
△ Less
Submitted 23 December, 2021; v1 submitted 4 December, 2021;
originally announced December 2021.
-
A Mixture of Expert Based Deep Neural Network for Improved ASR
Authors:
Vishwanath Pratap Singh,
Shakti P. Rath,
Abhishek Pandey
Abstract:
This paper presents a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR), termed as MixNet. Besides the conventional layers, such as fully connected layers in DNN-HMM and memory cells in LSTM-HMM, the model uses two additional layers based on Mixture of Experts (MoE). The first MoE layer operating at the input is based on pre-defined broad phon…
▽ More
This paper presents a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR), termed as MixNet. Besides the conventional layers, such as fully connected layers in DNN-HMM and memory cells in LSTM-HMM, the model uses two additional layers based on Mixture of Experts (MoE). The first MoE layer operating at the input is based on pre-defined broad phonetic classes and the second layer operating at the penultimate layer is based on automatically learned acoustic classes. In natural speech, overlap in distribution across different acoustic classes is inevitable, which leads to inter-class mis-classification. The ASR accuracy is expected to improve if the conventional architecture of acoustic model is modified to make them more suitable to account for such overlaps. MixNet is developed keeping this in mind. Analysis conducted by means of scatter diagram verifies that MoE indeed improves the separation between classes that translates to better ASR accuracy. Experiments are conducted on a large vocabulary ASR task which show that the proposed architecture provides 13.6% and 10.0% relative reduction in word error rates compared to the conventional models, namely, DNN and LSTM respectively, trained using sMBR criteria. In comparison to an existing method developed for phone-classification (by Eigen et al), our proposed method yields a significant improvement.
△ Less
Submitted 2 December, 2021;
originally announced December 2021.