Skip to main content

Showing 1–50 of 53 results for author: Risteski, A

  1. arXiv:2407.21046  [pdf, other

    cs.CL cs.LG

    Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines

    Authors: Yuchen Li, Alexandre Kirchmeyer, Aashay Mehta, Yilong Qin, Boris Dadachev, Kishore Papineni, Sanjiv Kumar, Andrej Risteski

    Abstract: Autoregressive language models are the currently dominant paradigm for text generation, but they have some fundamental limitations that cannot be remedied by scale-for example inherently sequential and unidirectional generation. While alternate classes of models have been explored, we have limited mathematical understanding of their fundamental power and limitations. In this paper we focus on Gene… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: ICML 2024

  2. arXiv:2312.01429  [pdf, other

    cs.LG cs.CL stat.ML

    Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars

    Authors: Kaiyue Wen, Yuchen Li, Bingbin Liu, Andrej Risteski

    Abstract: Interpretability methods aim to understand the algorithm implemented by a trained model (e.g., a Transofmer) by examining various aspects of the model, such as the weight matrices or the attention patterns. In this work, through a combination of theoretical results and carefully controlled experiments on synthetic data, we take a critical view of methods that exclusively focus on individual parts… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

  3. arXiv:2312.00234  [pdf, other

    cs.LG math.NA stat.ML

    Deep Equilibrium Based Neural Operators for Steady-State PDEs

    Authors: Tanya Marwah, Ashwini Pokle, J. Zico Kolter, Zachary C. Lipton, Jianfeng Lu, Andrej Risteski

    Abstract: Data-driven machine learning approaches are being increasingly used to solve partial differential equations (PDEs). They have shown particularly striking successes when training an operator, which takes as input a PDE in some family, and outputs its solution. However, the architectural design space, especially given structural knowledge of the PDE family of interest, is still poorly understood. We… ▽ More

    Submitted 30 November, 2023; originally announced December 2023.

    Comments: NeurIPS 2023

  4. arXiv:2311.04163  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization

    Authors: Elan Rosenfeld, Andrej Risteski

    Abstract: We identify a new phenomenon in neural network optimization which arises from the interaction of depth and a particular heavy-tailed structure in natural data. Our result offers intuitive explanations for several previously reported observations about network training dynamics. In particular, it implies a conceptually new cause for progressive sharpening and the edge of stability; we also highligh… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

  5. arXiv:2310.03902  [pdf, other

    stat.ML cs.LG

    Provable benefits of annealing for estimating normalizing constants: Importance Sampling, Noise-Contrastive Estimation, and beyond

    Authors: Omar Chehab, Aapo Hyvarinen, Andrej Risteski

    Abstract: Recent research has developed several Monte Carlo methods for estimating the normalization constant (partition function) based on the idea of annealing. This means sampling successively from a path of distributions that interpolate between a tractable "proposal" distribution and the unnormalized "target" distribution. Prominent estimators in this family include annealed importance sampling and ann… ▽ More

    Submitted 9 October, 2023; v1 submitted 5 October, 2023; originally announced October 2023.

  6. arXiv:2306.09332  [pdf, ps, other

    cs.LG cs.DS

    Fit Like You Sample: Sample-Efficient Generalized Score Matching from Fast Mixing Diffusions

    Authors: Yilong Qin, Andrej Risteski

    Abstract: Score matching is an approach to learning probability distributions parametrized up to a constant of proportionality (e.g. Energy-Based Models). The idea is to fit the score of the distribution, rather than the likelihood, thus avoiding the need to evaluate the constant of proportionality. While there's a clear algorithmic benefit, the statistical "cost'' can be steep: recent work by Koehler et al… ▽ More

    Submitted 13 December, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: 41 pages

  7. arXiv:2306.01993  [pdf, ps, other

    cs.LG cs.DS stat.ML

    Provable benefits of score matching

    Authors: Chirag Pabbaraju, Dhruv Rohatgi, Anish Sevekari, Holden Lee, Ankur Moitra, Andrej Risteski

    Abstract: Score matching is an alternative to maximum likelihood (ML) for estimating a probability distribution parametrized up to a constant of proportionality. By fitting the ''score'' of the distribution, it sidesteps the need to compute this constant of proportionality (which is often intractable). While score matching and variants thereof are popular in practice, precise theoretical understanding of th… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: 25 Pages

  8. arXiv:2306.00788  [pdf, other

    cs.LG stat.ML

    Understanding Augmentation-based Self-Supervised Representation Learning via RKHS Approximation and Regression

    Authors: Runtian Zhai, Bingbin Liu, Andrej Risteski, Zico Kolter, Pradeep Ravikumar

    Abstract: Data augmentation is critical to the empirical success of modern self-supervised representation learning, such as contrastive learning and masked language modeling. However, a theoretical understanding of the exact role of augmentation remains limited. Recent work has built the connection between self-supervised learning and the approximation of the top eigenspace of a graph Laplacian operator, su… ▽ More

    Submitted 18 January, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: ICLR 2024 spotlight. 34 pages

  9. arXiv:2303.04245  [pdf, other

    cs.LG cs.CL stat.ML

    How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding

    Authors: Yuchen Li, Yuanzhi Li, Andrej Risteski

    Abstract: While the successes of transformers across many domains are indisputable, accurate understanding of the learning mechanics is still largely lacking. Their capabilities have been probed on benchmarks which include a variety of structured and reasoning tasks -- but mathematical understanding is lagging substantially behind. Recent lines of work have begun studying representational aspects of this qu… ▽ More

    Submitted 24 July, 2023; v1 submitted 7 March, 2023; originally announced March 2023.

  10. arXiv:2210.12101  [pdf, ps, other

    cs.LG math.NA

    Neural Network Approximations of PDEs Beyond Linearity: A Representational Perspective

    Authors: Tanya Marwah, Zachary C. Lipton, Jianfeng Lu, Andrej Risteski

    Abstract: A burgeoning line of research leverages deep neural networks to approximate the solutions to high dimensional PDEs, opening lines of theoretical inquiry focused on explaining how it is that these models appear to evade the curse of dimensionality. However, most prior theoretical analyses have been limited to linear PDEs. In this work, we take a step towards studying the representational power of n… ▽ More

    Submitted 27 March, 2023; v1 submitted 21 October, 2022; originally announced October 2022.

  11. arXiv:2210.00726  [pdf, other

    cs.LG math.ST stat.ML

    Statistical Efficiency of Score Matching: The View from Isoperimetry

    Authors: Frederic Koehler, Alexander Heckett, Andrej Risteski

    Abstract: Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood $\log p(x)$ for the training data, we instead fit the score functio… ▽ More

    Submitted 22 December, 2022; v1 submitted 3 October, 2022; originally announced October 2022.

  12. arXiv:2210.00189  [pdf, other

    cs.LG stat.ML

    Pitfalls of Gaussians as a noise distribution in NCE

    Authors: Holden Lee, Chirag Pabbaraju, Anish Sevekari, Andrej Risteski

    Abstract: Noise Contrastive Estimation (NCE) is a popular approach for learning probability density functions parameterized up to a constant of proportionality. The main idea is to design a classification problem for distinguishing training data from samples from an easy-to-sample noise distribution $q$, in a manner that avoids having to calculate a partition function. It is well-known that the choice of… ▽ More

    Submitted 2 March, 2023; v1 submitted 1 October, 2022; originally announced October 2022.

    Comments: 14 pages, 1 figure

  13. arXiv:2203.15702  [pdf, other

    cs.LG cs.CV stat.ML

    Contrasting the landscape of contrastive and non-contrastive learning

    Authors: Ashwini Pokle, Jinjin Tian, Yuchen Li, Andrej Risteski

    Abstract: A lot of recent advances in unsupervised feature learning are based on designing features which are invariant under semantic data augmentations. A common way to do this is contrastive learning, which uses positive and negative samples. Some recent works however have shown promising results for non-contrastive learning, which does not require negative samples. However, the non-contrastive losses ha… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted for publication in the AISTATS 2022 conference (http://aistats.org/aistats2022/accepted.html)

  14. arXiv:2203.14383  [pdf, ps, other

    cs.LG stat.ML

    Continual learning: a feature extraction formalization, an efficient algorithm, and fundamental obstructions

    Authors: Binghui Peng, Andrej Risteski

    Abstract: Continual learning is an emerging paradigm in machine learning, wherein a model is exposed in an online fashion to data from multiple different distributions (i.e. environments), and is expected to adapt to the distribution change. Precisely, the goal is to perform well in the new environment, while simultaneously retaining the performance on the previous environments (i.e. avoid "catastrophic for… ▽ More

    Submitted 27 March, 2022; originally announced March 2022.

  15. arXiv:2202.09305  [pdf, other

    cs.LG stat.ML

    Masked prediction tasks: a parameter identifiability view

    Authors: Bingbin Liu, Daniel Hsu, Pradeep Ravikumar, Andrej Risteski

    Abstract: The vast majority of work in self-supervised learning, both theoretical and empirical (though mostly the latter), have largely focused on recovering good features for downstream tasks, with the definition of "good" often being intricately tied to the downstream task itself. This lens is undoubtedly very interesting, but suffers from the problem that there isn't a "canonical" set of downstream task… ▽ More

    Submitted 18 February, 2022; originally announced February 2022.

  16. arXiv:2202.08907  [pdf, ps, other

    cs.DS cs.LG math.PR stat.ML

    Sampling Approximately Low-Rank Ising Models: MCMC meets Variational Methods

    Authors: Frederic Koehler, Holden Lee, Andrej Risteski

    Abstract: We consider Ising models on the hypercube with a general interaction matrix $J$, and give a polynomial time sampling algorithm when all but $O(1)$ eigenvalues of $J$ lie in an interval of length one, a situation which occurs in many models of interest. This was previously known for the Glauber dynamics when *all* eigenvalues fit in an interval of length one; however, a single outlier can force the… ▽ More

    Submitted 17 February, 2022; originally announced February 2022.

    Comments: 43 pages

  17. arXiv:2202.06856  [pdf, other

    cs.LG cs.AI

    Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization

    Authors: Elan Rosenfeld, Pradeep Ravikumar, Andrej Risteski

    Abstract: A common explanation for the failure of deep networks to generalize out-of-distribution is that they fail to recover the "correct" features. We challenge this notion with a simple experiment which suggests that ERM already learns sufficient features and that the current bottleneck is not feature learning, but robust regression. Our findings also imply that given a small amount of data from the tar… ▽ More

    Submitted 27 October, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

  18. arXiv:2112.06868  [pdf, other

    cs.LG stat.ML

    Variational autoencoders in the presence of low-dimensional data: landscape and implicit bias

    Authors: Frederic Koehler, Viraj Mehta, Chenghui Zhou, Andrej Risteski

    Abstract: Variational Autoencoders are one of the most commonly used generative models, particularly for image data. A prominent difficulty in training VAEs is data that is supported on a lower-dimensional manifold. Recent work by Dai and Wipf (2020) proposes a two-stage training algorithm for VAEs, based on a conjecture that in standard VAE training the generator will converge to a solution with 0 variance… ▽ More

    Submitted 17 May, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

    Comments: Accepted as a conference paper at ICLR 2022

  19. arXiv:2110.11271  [pdf, other

    cs.LG stat.ML

    Analyzing and Improving the Optimization Landscape of Noise-Contrastive Estimation

    Authors: Bingbin Liu, Elan Rosenfeld, Pradeep Ravikumar, Andrej Risteski

    Abstract: Noise-contrastive estimation (NCE) is a statistically consistent method for learning unnormalized probabilistic models. It has been empirically observed that the choice of the noise distribution is crucial for NCE's performance. However, such observations have never been made formal or quantitative. In fact, it is not even clear whether the difficulties arising from a poorly chosen noise distribut… ▽ More

    Submitted 21 October, 2021; originally announced October 2021.

  20. arXiv:2107.04652  [pdf, ps, other

    cs.LG stat.ML

    The Effects of Invertibility on the Representational Complexity of Encoders in Variational Autoencoders

    Authors: Divyansh Pareek, Andrej Risteski

    Abstract: Training and using modern neural-network based latent-variable generative models (like Variational Autoencoders) often require simultaneously training a generative direction along with an inferential(encoding) direction, which approximates the posterior distribution over the latent variables. Thus, the question arises: how complex does the inferential model need to be, in order to be able to accur… ▽ More

    Submitted 9 July, 2021; originally announced July 2021.

    Comments: 34 pages

  21. arXiv:2107.02951  [pdf, ps, other

    cs.LG stat.ML

    Universal Approximation for Log-concave Distributions using Well-conditioned Normalizing Flows

    Authors: Holden Lee, Chirag Pabbaraju, Anish Sevekari, Andrej Risteski

    Abstract: Normalizing flows are a widely used class of latent-variable generative models with a tractable likelihood. Affine-coupling (Dinh et al, 2014-16) models are a particularly common type of normalizing flows, for which the Jacobian of the latent-to-observable-variable transformation is triangular, allowing the likelihood to be computed in linear time. Despite the widespread usage of affine couplings,… ▽ More

    Submitted 6 July, 2021; originally announced July 2021.

    Comments: 40 pages, 0 figures

  22. arXiv:2106.09913  [pdf, other

    cs.LG stat.ML

    Iterative Feature Matching: Toward Provable Domain Generalization with Logarithmic Environments

    Authors: Yining Chen, Elan Rosenfeld, Mark Sellke, Tengyu Ma, Andrej Risteski

    Abstract: Domain generalization aims at performing well on unseen test environments with data from a limited number of training environments. Despite a proliferation of proposal algorithms for this task, assessing their performance both theoretically and empirically is still very challenging. Distributional matching algorithms such as (Conditional) Domain Adversarial Networks [Ganin et al., 2016, Long et al… ▽ More

    Submitted 22 November, 2021; v1 submitted 18 June, 2021; originally announced June 2021.

    Comments: We acknowledge that the previous version of this paper (v1) contained an error - Theorem 3.2 was incorrect. We removed this theorem and updated the rest of the paper in v2

  23. arXiv:2106.01580  [pdf, other

    cs.CL cs.LG

    The Limitations of Limited Context for Constituency Parsing

    Authors: Yuchen Li, Andrej Risteski

    Abstract: Incorporating syntax into neural approaches in NLP has a multitude of practical and scientific benefits. For instance, a language model that is syntax-aware is likely to be able to produce better samples; even a discriminative model like BERT with a syntax module could be used for core NLP tasks like unsupervised syntactic parsing. Rapid progress in recent years was arguably spurred on by the empi… ▽ More

    Submitted 2 June, 2021; originally announced June 2021.

    Comments: To be published in ACL 2021 (https://2021.aclweb.org/) as a long paper

  24. arXiv:2103.02740  [pdf, ps, other

    stat.ML cs.LG

    Contrastive learning of strong-mixing continuous-time stochastic processes

    Authors: Bingbin Liu, Pradeep Ravikumar, Andrej Risteski

    Abstract: Contrastive learning is a family of self-supervised methods where a model is trained to solve a classification task constructed from unlabeled data. It has recently emerged as one of the leading learning paradigms in the absence of labels across many different domains (e.g. brain imaging, text, images). However, theoretical understanding of many aspects of training, both statistical and algorithmi… ▽ More

    Submitted 3 March, 2021; originally announced March 2021.

    Comments: Appearing in AISTATS 2021

  25. arXiv:2103.02138  [pdf, ps, other

    cs.LG math.NA stat.ML

    Parametric Complexity Bounds for Approximating PDEs with Neural Networks

    Authors: Tanya Marwah, Zachary C. Lipton, Andrej Risteski

    Abstract: Recent experiments have shown that deep networks can approximate solutions to high-dimensional PDEs, seemingly escaping the curse of dimensionality. However, questions regarding the theoretical basis for such approximations, including the required network size, remain open. In this paper, we investigate the representational power of neural networks for approximating solutions to linear elliptic PD… ▽ More

    Submitted 6 July, 2021; v1 submitted 2 March, 2021; originally announced March 2021.

  26. arXiv:2102.13128  [pdf, other

    cs.LG cs.AI cs.GT stat.ML

    An Online Learning Approach to Interpolation and Extrapolation in Domain Generalization

    Authors: Elan Rosenfeld, Pradeep Ravikumar, Andrej Risteski

    Abstract: A popular assumption for out-of-distribution generalization is that the training data comprises sub-datasets, each drawn from a distinct distribution; the goal is then to "interpolate" these distributions and "extrapolate" beyond them -- this objective is broadly known as domain generalization. A common belief is that ERM can interpolate but not extrapolate and that the latter is considerably more… ▽ More

    Submitted 18 November, 2021; v1 submitted 25 February, 2021; originally announced February 2021.

  27. arXiv:2010.05761  [pdf, other

    cs.LG cs.AI stat.ML

    The Risks of Invariant Risk Minimization

    Authors: Elan Rosenfeld, Pradeep Ravikumar, Andrej Risteski

    Abstract: Invariant Causal Prediction (Peters et al., 2016) is a technique for out-of-distribution generalization which assumes that some aspects of the data distribution vary across the training set but that the underlying causal mechanisms remain constant. Recently, Arjovsky et al. (2019) proposed Invariant Risk Minimization (IRM), an objective based on this idea for learning deep, invariant features of d… ▽ More

    Submitted 27 March, 2021; v1 submitted 12 October, 2020; originally announced October 2020.

    Comments: ICLR 2021 Camera-Ready

  28. arXiv:2010.01155  [pdf, other

    cs.LG stat.ML

    Representational aspects of depth and conditioning in normalizing flows

    Authors: Frederic Koehler, Viraj Mehta, Andrej Risteski

    Abstract: Normalizing flows are among the most popular paradigms in generative modeling, especially for images, primarily because we can efficiently evaluate the likelihood of a data point. This is desirable both for evaluating the fit of a model, and for ease of training, as maximizing the likelihood can be done by gradient descent. However, training normalizing flows comes with difficulties as well: model… ▽ More

    Submitted 25 June, 2021; v1 submitted 2 October, 2020; originally announced October 2020.

    Comments: Appeared in ICML 2021

  29. arXiv:2010.00137  [pdf, ps, other

    cs.LG math.ST stat.ML

    Efficient sampling from the Bingham distribution

    Authors: Rong Ge, Holden Lee, Jianfeng Lu, Andrej Risteski

    Abstract: We give a algorithm for exact sampling from the Bingham distribution $p(x)\propto \exp(x^\top A x)$ on the sphere $\mathcal S^{d-1}$ with expected runtime of $\operatorname{poly}(d, λ_{\max}(A)-λ_{\min}(A))$. The algorithm is based on rejection sampling, where the proposal distribution is a polynomial approximation of the pdf, and can be sampled from by explicitly evaluating integrals of polynomia… ▽ More

    Submitted 8 December, 2023; v1 submitted 30 September, 2020; originally announced October 2020.

    Comments: Corrected factor: $(1-x^2)^{(d-3)/2}$

    Journal ref: Algorithmic Learning Theory. PMLR, 2021

  30. arXiv:2008.04510  [pdf, other

    cs.LG cs.CL stat.ML

    On Learning Language-Invariant Representations for Universal Machine Translation

    Authors: Han Zhao, Junjie Hu, Andrej Risteski

    Abstract: The goal of universal machine translation is to learn to translate between any pair of languages, given a corpus of paired translated documents for \emph{a small subset} of all pairs of languages. Despite impressive empirical results and an increasing interest in massively multilingual models, theoretical analysis on translation errors made by such universal machine translation models is only nasc… ▽ More

    Submitted 11 August, 2020; originally announced August 2020.

    Comments: Appeared in ICML 2020

  31. arXiv:2002.05576  [pdf, other

    math.PR cs.DS cs.LG stat.ML

    Fast Convergence for Langevin Diffusion with Manifold Structure

    Authors: Ankur Moitra, Andrej Risteski

    Abstract: In this paper, we study the problem of sampling from distributions of the form p(x) \propto e^{-βf(x)} for some function f whose values and gradients we can query. This mode of access to f is natural in the scenarios in which such problems arise, for instance sampling from posteriors in parametric Bayesian models. Classical results show that a natural random walk, Langevin diffusion, mixes rapidly… ▽ More

    Submitted 21 September, 2020; v1 submitted 13 February, 2020; originally announced February 2020.

    Comments: 52 pages, in submission to NeurIPS 2020. This version: various typos fixed, minor reorganization

  32. arXiv:1907.00030  [pdf, other

    stat.ML cs.LG

    Empirical Study of the Benefits of Overparameterization in Learning Latent Variable Models

    Authors: Rares-Darius Buhai, Yoni Halpern, Yoon Kim, Andrej Risteski, David Sontag

    Abstract: One of the most surprising and exciting discoveries in supervised learning was the benefit of overparameterization (i.e. training a very large model) to improving the optimization landscape of a problem, with minimal effect on statistical performance (i.e. generalization). In contrast, unsupervised settings have been under-explored, despite the fact that it was observed that overparameterization c… ▽ More

    Submitted 16 July, 2020; v1 submitted 28 June, 2019; originally announced July 2019.

    Comments: 22 pages, to appear at ICML 2020

  33. arXiv:1905.13283  [pdf, ps, other

    cs.LG cs.DS math.ST stat.ML

    Sum-of-squares meets square loss: Fast rates for agnostic tensor completion

    Authors: Dylan J. Foster, Andrej Risteski

    Abstract: We study tensor completion in the agnostic setting. In the classical tensor completion problem, we receive $n$ entries of an unknown rank-$r$ tensor and wish to exactly complete the remaining entries. In agnostic tensor completion, we make no assumption on the rank of the unknown tensor, but attempt to predict unknown entries as well as the best rank-$r$ tensor. For agnostic learning of third-or… ▽ More

    Submitted 30 May, 2019; originally announced May 2019.

    Comments: To appear at COLT 2019

  34. arXiv:1812.00793  [pdf, other

    cs.LG cs.DS math.PR stat.ML

    Simulated Tempering Langevin Monte Carlo II: An Improved Proof using Soft Markov Chain Decomposition

    Authors: Rong Ge, Holden Lee, Andrej Risteski

    Abstract: A key task in Bayesian machine learning is sampling from distributions that are only specified up to a partition function (i.e., constant of proportionality). One prevalent example of this is sampling posteriors in parametric distributions, such as latent-variable generative models. However sampling (even very approximately) can be #P-hard. Classical results going back to Bakry and Émery (1985)… ▽ More

    Submitted 9 September, 2020; v1 submitted 29 November, 2018; originally announced December 2018.

    Comments: 69 pages. arXiv admin note: text overlap with arXiv:1710.02736

    Journal ref: Advances in Neural Information Processing Systems 31 (2018)

  35. arXiv:1808.07226  [pdf, ps, other

    cs.LG cs.DS math.PR stat.ML

    Mean-field approximation, convex hierarchies, and the optimality of correlation rounding: a unified perspective

    Authors: Vishesh Jain, Frederic Koehler, Andrej Risteski

    Abstract: The free energy is a key quantity of interest in Ising models, but unfortunately, computing it in general is computationally intractable. Two popular (variational) approximation schemes for estimating the free energy of general Ising models (in particular, even in regimes where correlation decay does not hold) are: (i) the mean-field approximation with roots in statistical physics, which estimates… ▽ More

    Submitted 28 August, 2018; v1 submitted 22 August, 2018; originally announced August 2018.

    Comments: This version: minor formatting changes, added grant acknowledgements

  36. arXiv:1806.10586  [pdf, other

    cs.LG cs.DS stat.ML

    Approximability of Discriminators Implies Diversity in GANs

    Authors: Yu Bai, Tengyu Ma, Andrej Risteski

    Abstract: While Generative Adversarial Networks (GANs) have empirically produced impressive results on learning complex real-world distributions, recent works have shown that they suffer from lack of diversity or mode collapse. The theoretical work of Arora et al. suggests a dilemma about GANs' statistical properties: powerful discriminators cause overfitting, whereas weak discriminators cannot detect mode… ▽ More

    Submitted 1 July, 2019; v1 submitted 27 June, 2018; originally announced June 2018.

  37. arXiv:1805.11405  [pdf, ps, other

    cs.LG cs.DS stat.ML

    Representational Power of ReLU Networks and Polynomial Kernels: Beyond Worst-Case Analysis

    Authors: Frederic Koehler, Andrej Risteski

    Abstract: There has been a large amount of interest, both in the past and particularly recently, into the power of different families of universal approximators, e.g. ReLU networks, polynomials, rational functions. However, current research has focused almost exclusively on understanding this problem in a worst-case setting, e.g. bounding the error of the best infinity-norm approximation in a box. In this s… ▽ More

    Submitted 29 May, 2018; originally announced May 2018.

  38. arXiv:1711.02651  [pdf, other

    cs.LG stat.ML

    Theoretical limitations of Encoder-Decoder GAN architectures

    Authors: Sanjeev Arora, Andrej Risteski, Yi Zhang

    Abstract: Encoder-decoder GANs architectures (e.g., BiGAN and ALI) seek to add an inference mechanism to the GANs setup, consisting of a small encoder deep net that maps data-points to their succinct encodings. The intuition is that being forced to train an encoder alongside the usual generator forces the system to learn meaningful mappings from the code to the data-point and vice-versa, which should improv… ▽ More

    Submitted 7 November, 2017; originally announced November 2017.

  39. arXiv:1710.02736  [pdf, other

    cs.LG cs.DS math.PR stat.ML

    Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo

    Authors: Rong Ge, Holden Lee, Andrej Risteski

    Abstract: A key task in Bayesian statistics is sampling from distributions that are only specified up to a partition function (i.e., constant of proportionality). However, without any assumptions, sampling (even approximately) can be #P-hard, and few works have provided "beyond worst-case" guarantees for such settings. For log-concave distributions, classical results going back to Bakry and Émery (1985) s… ▽ More

    Submitted 5 November, 2017; v1 submitted 7 October, 2017; originally announced October 2017.

    Comments: 53 pages

  40. arXiv:1706.04601  [pdf, ps, other

    cs.LG stat.ML

    Provable benefits of representation learning

    Authors: Sanjeev Arora, Andrej Risteski

    Abstract: There is general consensus that learning representations is useful for a variety of reasons, e.g. efficient use of labeled data (semi-supervised learning), transfer learning and understanding hidden structure of data. Popular techniques for representation learning include clustering, manifold learning, kernel-learning, autoencoders, Boltzmann machines, etc. To study the relative merits of these… ▽ More

    Submitted 14 June, 2017; originally announced June 2017.

    Comments: 22 pages

  41. arXiv:1705.00217  [pdf, other

    cs.CL cs.IR

    Extending and Improving Wordnet via Unsupervised Word Embeddings

    Authors: Mikhail Khodak, Andrej Risteski, Christiane Fellbaum, Sanjeev Arora

    Abstract: This work presents an unsupervised approach for improving WordNet that builds upon recent advances in document and sense representation via distributional semantics. We apply our methods to construct Wordnets in French and Russian, languages which both lack good manual constructions.1 These are evaluated on two new 600-word test sets for word-to-synset matching and found to improve greatly upon sy… ▽ More

    Submitted 29 April, 2017; originally announced May 2017.

    Comments: 17 pages, 3 figures, In Submission

  42. arXiv:1702.07028  [pdf, ps, other

    cs.LG

    On the ability of neural nets to express distributions

    Authors: Holden Lee, Rong Ge, Tengyu Ma, Andrej Risteski, Sanjeev Arora

    Abstract: Deep neural nets have caused a revolution in many classification tasks. A related ongoing revolution -- also theoretically not understood -- concerns their ability to serve as generative models for complicated types of data such as images and texts. These models are trained using ideas like variational autoencoders and Generative Adversarial Networks. We take a first cut at explaining the expres… ▽ More

    Submitted 17 April, 2021; v1 submitted 22 February, 2017; originally announced February 2017.

    Comments: Accepted to COLT 2017

  43. arXiv:1612.08795  [pdf, ps, other

    cs.LG cs.DS stat.ML

    Provable learning of Noisy-or Networks

    Authors: Sanjeev Arora, Rong Ge, Tengyu Ma, Andrej Risteski

    Abstract: Many machine learning applications use latent variable models to explain structure in data, whereby visible variables (= coordinates of the given datapoint) are explained as a probabilistic function of some hidden variables. Finding parameters with the maximum likelihood is NP-hard even in very simple settings. In recent years, provably efficient algorithms were nevertheless developed for models w… ▽ More

    Submitted 27 December, 2016; originally announced December 2016.

  44. arXiv:1611.03819  [pdf, ps, other

    cs.LG cs.DS stat.ML

    Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates

    Authors: Yuanzhi Li, Yingyu Liang, Andrej Risteski

    Abstract: Non-negative matrix factorization is a popular tool for decomposing data into feature and weight matrices under non-negativity constraints. It enjoys practical success but is poorly understood theoretically. This paper proposes an algorithm that alternates between decoding the weights and updating the features, and shows that assuming a generative model of the data, it provably recovers the ground… ▽ More

    Submitted 11 November, 2016; originally announced November 2016.

    Comments: To appear in NIPS 2016. 8 pages of extended abstract; 48 pages in total

  45. arXiv:1607.03360  [pdf, ps, other

    cs.LG cs.DS stat.ML

    Approximate maximum entropy principles via Goemans-Williamson with applications to provable variational methods

    Authors: Yuanzhi Li, Andrej Risteski

    Abstract: The well known maximum-entropy principle due to Jaynes, which states that given mean parameters, the maximum entropy distribution matching them is in an exponential family, has been very popular in machine learning due to its "Occam's razor" interpretation. Unfortunately, calculating the potentials in the maximum-entropy distribution is intractable \cite{bresler2014hardness}. We provide computatio… ▽ More

    Submitted 12 July, 2016; originally announced July 2016.

    Comments: 12 pages

  46. arXiv:1607.03183  [pdf, ps, other

    cs.LG cs.DS stat.ML

    How to calculate partition functions using convex programming hierarchies: provable bounds for variational methods

    Authors: Andrej Risteski

    Abstract: We consider the problem of approximating partition functions for Ising models. We make use of recent tools in combinatorial optimization: the Sherali-Adams and Lasserre convex programming hierarchies, in combination with variational methods to get algorithms for calculating partition functions in these families. These techniques give new, non-trivial approximation guarantees for the partition func… ▽ More

    Submitted 11 July, 2016; originally announced July 2016.

    Comments: This paper was accepted for presentation at Conference on Learning Theory (COLT) 2016

    Journal ref: 29th Annual Conference on Learning Theory (pp. 1402-1416), 2016

  47. arXiv:1602.02262  [pdf, ps, other

    cs.LG cs.DS stat.ML

    Recovery guarantee of weighted low-rank approximation via alternating minimization

    Authors: Yuanzhi Li, Yingyu Liang, Andrej Risteski

    Abstract: Many applications require recovering a ground truth low-rank matrix from noisy observations of the entries, which in practice is typically formulated as a weighted low-rank approximation problem and solved by non-convex optimization heuristics such as alternating minimization. In this paper, we provide provable recovery guarantee of weighted low-rank via a simple alternating minimization algorithm… ▽ More

    Submitted 8 December, 2016; v1 submitted 6 February, 2016; originally announced February 2016.

    Comments: 40 pages. Updated with the ICML 2016 camera ready version, together with an additional algorithm which needs less assumptions in Appendix C

  48. arXiv:1601.03764  [pdf, ps, other

    cs.CL cs.LG stat.ML

    Linear Algebraic Structure of Word Senses, with Applications to Polysemy

    Authors: Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, Andrej Risteski

    Abstract: Word embeddings are ubiquitous in NLP and information retrieval, but it is unclear what they represent when the word is polysemous. Here it is shown that multiple word senses reside in linear superposition within the word embedding and simple sparse coding can recover vectors that approximately capture the senses. The success of our approach, which applies to several embedding methods, is mathemat… ▽ More

    Submitted 7 December, 2018; v1 submitted 14 January, 2016; originally announced January 2016.

    Comments: Appear in the Transactions of the Association for Computational Linguistics 2018, link: https://transacl.org/ojs/index.php/tacl/article/view/1346

  49. arXiv:1512.01829  [pdf, other

    cs.DS

    On Routing Disjoint Paths in Bounded Treewidth Graphs

    Authors: Alina Ene, Matthias Mnich, Marcin Pilipczuk, Andrej Risteski

    Abstract: We study the problem of routing on disjoint paths in bounded treewidth graphs with both edge and node capacities. The input consists of a capacitated graph $G$ and a collection of $k$ source-destination pairs $\mathcal{M} = \{(s_1, t_1), \dots, (s_k, t_k)\}$. The goal is to maximize the number of pairs that can be routed subject to the capacities in the graph. A routing of a subset $\mathcal{M}'$… ▽ More

    Submitted 6 December, 2015; originally announced December 2015.

  50. arXiv:1503.06567  [pdf, ps, other

    cs.LG cs.DS stat.ML

    On some provably correct cases of variational inference for topic models

    Authors: Pranjal Awasthi, Andrej Risteski

    Abstract: Variational inference is a very efficient and popular heuristic used in various forms in the context of latent variable models. It's closely related to Expectation Maximization (EM), and is applied when exact EM is computationally infeasible. Despite being immensely popular, current theoretical understanding of the effectiveness of variaitonal inference based algorithms is very limited. In this wo… ▽ More

    Submitted 22 August, 2015; v1 submitted 23 March, 2015; originally announced March 2015.

    Comments: 46 pages, Compared to previous version: clarified notation, a number of typos fixed throughout paper