This document discusses using regularized linear models like logistic regression with feature engineering techniques like polynomial expansion to solve classification problems in a scalable way. It describes how polynomial expansion can make nonlinear relationships linear by transforming features into higher dimensions. It also explains how Elastic Net regularization, which combines L1 and L2 penalties, can select important features and scale to large datasets using Apache Spark. Experiments on several datasets show logistic regression with degree-2 polynomial features performs comparably to nonlinear kernels while training faster.
Spark offers a number of advantages over its predecessor MapReduce that make it ideal for large-scale machine learning. For example, Spark includes MLLib, a library of machine learning algorithms for large data. The presentation will cover the state of MLLib and the details of some of the scalable algorithms it includes.
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
As machine learning matures, the standard supervised learning setup is no longer sufficient. Instead of making and serving a single prediction as a function of a data point, machine learning applications increasingly must operate in dynamic environments, react to changes in the environment, and take sequences of actions to accomplish a goal. These modern applications are better framed within the context of reinforcement learning (RL), which deals with learning to operate within an environment. RL-based applications have already led to remarkable results, such as Google’s AlphaGo beating the Go world champion, and are finding their way into self-driving cars, UAVs, and surgical robotics. These applications have very demanding computational requirements–at the high end, they may need to execute millions of tasks per second with millisecond level latencies, and support heterogeneous and dynamic computation graphs. In this talk, we present Ray, a new cluster computing framework that meets these requirements, give some application examples, and discuss how it can be integrated with Apache Spark.
Recent workload trends indicate rapid growth in the deployment of machine learning, genomics and scientific workloads using Apache Spark. However, efficiently running these applications on cloud computing infrastructure like Amazon EC2 is challenging and we find that choosing the right hardware configuration can significantly improve performance and cost. The key to address the above challenge is having the ability to predict performance of applications under various resource configurations so that we can automatically choose the optimal configuration. We present Ernest, a performance prediction framework for large scale analytics. Ernest builds performance models based on the behavior of the job on small samples of data and then predicts its performance on larger datasets and cluster sizes. Our evaluation on Amazon EC2 using several workloads shows that our prediction error is low while having a training overhead of less than 5% for long-running jobs.
This document discusses using Spark Streaming and GraphX to perform near-realtime analytics on large distributed systems. The authors present a model-driven approach to implement Pregel-style graph processing to handle heterogeneous graphs. They were able to achieve over 100,000 messages per second on a 4 node cluster by using sufficient batch sizes. Implementation challenges included scaling graph processing across nodes, dealing with graph heterogeneity, and hidden memory costs from intermediate RDDs. Lessons learned include the importance of partitioning, testing high availability, and addressing memory sinks.
Modern Data-Intensive Scalable Computing (DISC) systems such as Apache Spark do not support sophisticated cost-based query optimizers because they are specifically designed to process data that resides in external storage systems (e.g. HDFS), or they lack the necessary data statistics. Consequently, many crucial optimizations, such as join order and plan selection, are presently out-of-scope in these DISC system optimizers. Yet, join order is one of the most important decisions a cost-optimizer can make because wrong orders can result in a query response time that can become more than an order-of-magnitude slower compared to the better order.
This document summarizes an online presentation about online learning with structured streaming in Spark. The key points are: - Online learning updates model parameters for each data point as it arrives, unlike batch learning which sees the full dataset before updating. - Structured streaming in Spark provides a single API for batch, streaming, and machine learning workloads. It offers exactly-once guarantees and understands external event time. - Streaming machine learning on structured streaming works by having a stateful aggregation query that picks up the last trained model and performs a distributed update and merge on each trigger interval. This allows modeling streaming data in a fault-tolerant way.
This document discusses dynamic community detection for e-commerce data using Spark Streaming and GraphX. It presents an approach for processing streaming graph data to perform community detection in real-time. Key points include using GraphX to merge small incremental graphs into a large stock graph, developing incremental algorithms like JV and UMG that make local updates to communities based on modularity optimization, and monitoring communities over time to trigger rebuilds if the modularity drops below a threshold. This dynamic approach allows for more sophisticated analysis of streaming e-commerce data compared to static community detection.
The document describes KeystoneML, an open source software framework for building scalable machine learning pipelines on Apache Spark. It discusses standard machine learning pipelines and examples of more complex pipelines for image classification, text classification, and recommender systems. It covers features of KeystoneML like transformers, estimators, and chaining estimators and transformers. It also discusses optimizing pipelines by choosing solvers, caching intermediate data, and operator selection. Benchmark results show KeystoneML achieves state-of-the-art accuracy on large datasets faster than other systems through end-to-end pipeline optimizations.
Tegra is a system for efficiently processing time-evolving graphs on commodity clusters. It uses a distributed graph snapshot index to represent and retrieve multiple snapshots of evolving graphs. It introduces a timelapse abstraction to perform temporal analytics on windows of snapshots, avoiding redundant computation. Tegra supports both bulk and incremental graph computations using this representation, allowing results to be reused when graphs are updated. An evaluation on real-world graphs shows Tegra can store more snapshots in memory and reduce computation time compared to baseline approaches.
Real-world graphs are seldom static. Applications that generate graph-structured data today do so continuously, giving rise to an underlying graph whose structure evolves over time. Mining these time-evolving graphs can be insightful, both from research and business perspectives. While several works have focused on some individual aspects, there exists no general purpose time-evolving graph processing engine. We present Tegra, a time-evolving graph processing system built on a general-purpose dataflow framework. We introduce Timelapse, a flexible abstraction that enables efficient analytics on evolving graphs by allowing graph-parallel stages to iterate over complete history of nodes. We use Timelapse to present two computational models, a temporal analysis model for performing computations on multiple snapshots of an evolving graph, and a generalized incremental computation model for efficiently updating results of computations.
This deck will provide SystemML architecture, how to get documentation for usage, algorithms etc. It will explain usage of it through command line or through notebook.
Kyle Foreman presented on using Spark for large-scale global health simulations. The talk discussed (1) the motivation for simulations to model disease burden forecasts and alternative scenarios, (2) SimBuilder for constructing modular simulation workflows as directed acyclic graphs, and (3) benchmarks showing Spark backends can efficiently distribute simulations across a cluster. Future work aims to optimize Spark DataFrame joins and take better advantage of Numpy's vectorization for panel data simulations.
This document discusses Catalyst, the query optimizer in Apache Spark. It begins by explaining how Catalyst works at a high level, including how it abstracts user programs as trees and uses transformations and strategies to optimize logical and physical plans. It then provides more details on specific aspects like rule execution, ensuring requirements, and examples of optimizations. The document aims to help users understand how Catalyst optimizes queries automatically and provides tips on exploring its code and writing optimizations.
The document discusses using Map-Reduce for machine learning algorithms on multi-core processors. It describes rewriting machine learning algorithms in "summation form" to express the independent computations as Map tasks and aggregating results as Reduce tasks. This formulation allows the algorithms to be parallelized efficiently across multiple cores. Specific machine learning algorithms that have been implemented or analyzed in this Map-Reduce framework are listed.
Spark is a new cluster computing engine that is rapidly gaining popularity — with over 150 contributors in the past year, it is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. Spark was designed to both make traditional MapReduce programming easier and to support new types of applications, with one of the earliest focus areas being machine learning. In this talk, we’ll introduce Spark and show how to use it to build fast, end-to-end machine learning workflows. Using Spark’s high-level API, we can process raw data with familiar libraries in Java, Scala or Python (e.g. NumPy) to extract the features for machine learning. Then, using MLlib, its built-in machine learning library, we can run scalable versions of popular algorithms. We’ll also cover upcoming development work including new built-in algorithms and R bindings. Bio: Xiangrui Meng is a software engineer at Databricks. He has been actively involved in the development of Spark MLlib since he joined. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His thesis work at Stanford is on randomized algorithms for large-scale linear regression.
The slides of a talk at Spark Taiwan User Group to share my experience and some general tips for participating kaggle competitions.
Generalized linear models (GLMs) are a class of models that include linear regression, logistic regression, and other forms. GLMs are implemented in both MLlib and SparkR in Spark. They support various solvers like gradient descent, L-BFGS, and iteratively re-weighted least squares. Performance is optimized through techniques like sparsity, tree aggregation, and avoiding unnecessary data copies. Future work includes better handling of categoricals, more model statistics, and model parallelism.
MLlib is an Apache Spark component that focuses on machine learning algorithms. It was initially contributed by the AMPLab at UC Berkeley and has supported sparse data since version 1.0. This document discusses how sparse data appears frequently in real-world big data problems and describes how MLlib exploits sparsity to improve storage needs and computation speed for algorithms like k-means, linear methods, and singular value decomposition. By avoiding unnecessary computations on zero values and leveraging sparse linear algebra, MLlib is able to efficiently handle sparse data problems at large scale.
This document summarizes a proposed method for discriminative unsupervised dimensionality reduction called DUDR. It begins by introducing traditional dimensionality reduction techniques like PCA and LDA. It then discusses limitations of existing graph embedding methods that require constructing a graph beforehand. The proposed DUDR method jointly learns the graph construction and dimensionality reduction to avoid this dependency. It formulates an optimization problem to learn a projection matrix and affinity matrix simultaneously. Experimental results on synthetic and real-world datasets show DUDR achieves better clustering performance than other methods like PCA, LPP, k-means and NMF.
This document summarizes techniques for mapping application topologies to interconnect network topologies. It discusses how improving data locality through topology mapping can reduce communication costs, execution time, and energy consumption. Several common mapping techniques are described, including linear programming formulations, greedy approaches, partitioning approaches, transformative approaches, and those based on graph similarity. The document notes that finding an optimal mapping is NP-complete and different techniques may work better depending on the topology.
This document discusses sensitivity analysis of machine learning hyperparameters. It begins by introducing the goal of tuning hyperparameters through trial and error while analyzing how sensitive models are to small parameter changes. Next, it outlines the machine learning process, including evaluating hypotheses, model selection with validation sets, and analyzing bias and variance. It then provides an example of applying these concepts to reinforcement learning in a gridworld environment. Specifically, it tunes learning rate and discount factor, finds the optimal point, and analyzes the hyperparameter surface around this point. In summary, the document advocates for techniques like Spearmint and canonical analysis to enable fair model comparisons and choose models with smooth hyperparameter surfaces.
- The document proposes ProxGen, a unified framework for stochastic proximal gradient descent methods that can handle arbitrary preconditioners and non-convex regularizers. - ProxGen derives proximal updates for popular optimizers like Adam that incorporate the preconditioner into the proximal mapping, which previous work had not addressed. - Experiments on sparse neural networks and binary neural networks demonstrate that ProxGen converges faster and achieves better generalization than subgradient-based methods and previous proximal gradient methods.
Overview of CVPR 2018 papers For mobile DL NetAdapt, ADC, Quantization and Training of ... (Tensorflow Lite quantization)
This document summarizes Joseph Bradley's presentation on designing distributed machine learning on Apache Spark. Bradley is a committer and PMC member of Apache Spark and works as a software engineer at Databricks. He discusses how Spark provides a unified engine for distributed workloads and libraries like MLlib make it possible to perform scalable machine learning. Bradley outlines different methods for distributing ML algorithms, using k-means clustering as an example of reorganizing an algorithm to fit the MapReduce framework in a way that minimizes communication costs.
Deep learning uses multilayered neural networks to process information in a robust, generalizable, and scalable way. It has various applications including image recognition, sentiment analysis, machine translation, and more. Deep learning concepts include computational graphs, artificial neural networks, and optimization techniques like gradient descent. Prominent deep learning architectures include convolutional neural networks, recurrent neural networks, autoencoders, and generative adversarial networks.
This document discusses dimensionality reduction using principal component analysis (PCA). It explains that PCA is used to reduce the number of variables in a dataset while retaining the variation present in the original data. The document outlines the PCA algorithm, which transforms the original variables into new uncorrelated variables called principal components. It provides an example of applying PCA to reduce data from 2D to 1D. The document also discusses key PCA concepts like covariance matrices, eigenvalues, eigenvectors, and transforming data to the principal component coordinate system. Finally, it presents an assignment applying PCA and classification to a handwritten digits dataset.
This document discusses various techniques for data preprocessing, including data integration, transformation, reduction, and discretization. It covers topics such as schema integration, handling redundant data, data normalization, dimensionality reduction, data cube aggregation, sampling, and entropy-based discretization. The goal of these techniques is to prepare raw data for knowledge discovery and data mining tasks by cleaning, transforming, and reducing the data into a suitable structure.
Supervised learning is a machine learning paradigm where the algorithm is trained on a labeled dataset, learning patterns and relationships between input features and corresponding output labels to make accurate predictions on new, unseen data. It involves a teacher-supervisor relationship, where the algorithm strives to minimize the error between its predictions and the actual outcomes during training.
This document discusses query processing in distributed databases. It describes query decomposition, which transforms a high-level query into an equivalent lower-level algebraic query. The main steps in query decomposition are normalization, analysis, redundancy elimination, and rewriting the query in relational algebra. Data localization then translates the algebraic query on global relations into a query on physical database fragments using fragmentation rules.