This document provides an overview of big data analytics with Scala, including common frameworks and techniques. It discusses Lambda architecture, MapReduce, word counting examples, Scalding for batch and streaming jobs, Apache Storm, Trident, SummingBird for unified batch and streaming, and Apache Spark for fast cluster computing with resilient distributed datasets. It also covers clustering with Mahout, streaming word counting, and analytics platforms that combine batch and stream processing.
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial: 1) SparkR (R on Spark) 2) SparkR DataFrames 3) Launch SparkR 4) Creating DataFrames from Local DataFrames 5) DataFrame Operation 6) Creating DataFrames - From JSON 7) Running SQL Queries from SparkR
The document discusses Spark's DataFrame API and the Tungsten project. DataFrames make Spark accessible to different users by providing a common API across languages like Python, R and Scala. Tungsten aims to improve Spark's performance for the next five years through techniques like runtime code generation and off-heap memory management. Initial results show Tungsten doubling performance. Together, DataFrames and Tungsten will help Spark scale to larger data and queries across different languages and execution backends.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial: 1) Hadoop Streaming and Why Do We Need it? 2) Writing Streaming Jobs 3) Testing Streaming jobs and Hands-on on CloudxLab
Spark Streaming allows processing of live data streams using the Spark framework. This document discusses using Spark Streaming to process event streams from Meetup.com, including RSVP data and event metadata. It describes extracting features from event descriptions, clustering events based on these features, and using the results to recommend connections between Meetup members with similar interests.
This was the first session about Hadoop and MapReduce. It introduces what Hadoop is and its main components. It also covers the how to program your first MapReduce task and how to run it on pseudo distributed Hadoop installation. This session was given in Arabic and i may provide a video for the session soon.
Sparkling Water provides a transparent integration of H2O algorithms and data structures into the Spark ecosystem. It allows users to use H2O machine learning algorithms on data stored in Spark and HDFS. The presentation demonstrates loading weather and flight data using Spark and H2O APIs, building regression models to predict flight delays, and accessing prediction results from R for residual analysis. Sparkling Water applications can be developed and run as standalone jobs by creating a SparkContext and H2OContext and submitting to a Spark cluster.
Talk that I gave at the #BcnDevCon13 about using Scalding and the strong points of using Scala for Big Data data processing
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens. The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
Big Data with Hadoop & Spark Training: http://bit.ly/2kyRTuW This CloudxLab Advanced Spark Programming tutorial helps you to understand Advanced Spark Programming in detail. Below are the topics covered in this slide: 1) Shared Variables - Accumulators & Broadcast Variables 2) Accumulators and Fault Tolerance 3) Custom Accumulators - Version 1.x & Version 2.x 4) Examples of Broadcast Variables 5) Key Performance Considerations - Level of Parallelism 6) Serialization Format - Kryo 7) Memory Management 8) Hardware Provisioning
This document summarizes a presentation about productionizing streaming jobs with Spark Streaming. It discusses: 1. The lifecycle of a Spark streaming application including how data is received in batches and processed through transformations. 2. Best practices for aggregations including reducing over windows, incremental aggregation, and checkpointing. 3. How to achieve high throughput by increasing parallelism through more receivers and partitions. 4. Tips for debugging streaming jobs using the Spark UI and ensuring processing time is less than the batch interval.
Cascalog is an internal DSL for Clojure that allows defining MapReduce workflows for Hadoop. It provides helper functions, a way to define custom functions analogous to UDFs, and functions to programmatically generate all possible data aggregations from an input based on business requirements. The workflows can be unit tested and executed on Hadoop. Cascalog abstracts away lower-level MapReduce details and allows defining the entire workflow within a single language.
Big Data with Hadoop & Spark Training: http://bit.ly/2sm9c61 This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide: 1) Loading XML 2) What is RPC - Remote Process Call 3) Loading AVRO 4) Data Sources - Parquet 5) Creating DataFrames From Hive Table 6) Setting up Distributed SQL Engine
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
The document discusses different approaches to sorting in MapReduce frameworks over time. It describes Hadoop versions between 0.10-0.22, where sorting was handled by buffering records in memory, spilling to disk when thresholds were exceeded, and merging the spilled files. Later versions improved by distributing the sorting work across maps and making the memory footprint more predictable.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
In this talk I talk about my recent experience working with Spark Data Frames and the Spark TimeSeries library. For data frames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics. For the time series library, I dive into the kind of use cases it supports and why it’s actually super useful.
Date: 16th November 2017 Location: Fast Data Theatre Time: 12:30 - 13:00 Speaker: Gerard Maas Organisation: Lightbend
Apache Spark is a fast and general-purpose cluster computing system for large-scale data processing. It improves on MapReduce by allowing data to be kept in memory across jobs, enabling faster iterative jobs. Spark consists of a core engine along with libraries for SQL, streaming, machine learning, and graph processing. The document discusses new APIs in Spark including DataFrames, which provide a tabular interface like in R/Python, and data sources, which allow plugging external data systems into Spark. These changes aim to make Spark easier for data scientists to use at scale.
After migrating a three year old C# project to Java we ending up with a significant portion of legacy code using lambdas in Java. What was some of the good use cases, code which could be written better and the problems we had migrating from C#. At the end we look at the performance implications of using Lambdas.
Paris Scala Group Event May 2019, No more struggles with Apache Spark workloads in production. Apache Spark Primary data structures (RDD, DataSet, Dataframe) Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark. Parallel read from JDBC: Challenges and best practices. Bulk Load API vs JDBC write An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin Avoid unnecessary shuffle Alternative to spark default sort Why dropDuplicates() doesn’t result consistency, What is alternative Optimize Spark stage generation plan Predicate pushdown with partitioning and bucketing Why not to use Scala Concurrent ‘Future’ explicitly!
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy. The talk was held at the Helsinki Data Science meetup on January 9th 2014.
Knoldus organized a Meetup on 1 April 2015. In this Meetup, we introduced Spark with Scala. Apache Spark is a fast and general engine for large-scale data processing. Spark is used at a wide range of organizations to process large datasets.
This document provides an overview of Spark Streaming concepts including: - Streams are sequences of data elements made available over time that can be accessed sequentially - Stream processing involves continuously and concurrently processing live data streams in micro-batches - Spark Streaming provides scalable and fault-tolerant stream processing using a micro-batch architecture where streams are divided into batches that are processed through transformations on resilient distributed datasets (RDDs) - Transformations on DStreams apply operations like map, filter, reduce to the underlying RDDs of each batch
This document discusses refactoring Java code to Clojure using macros. It provides examples of refactoring Java code that uses method chaining to equivalent Clojure code using the threading macros (->> and -<>). It also discusses other Clojure features like type hints, the doto macro, and polyglot projects using Leiningen.
Defining customized scalable aggregation logic is one of Apache Spark’s most powerful features. User Defined Aggregate Functions (UDAF) are a flexible mechanism for extending both Spark data frames and Structured Streaming with new functionality ranging from specialized summary techniques to building blocks for exploratory data analysis.
This document summarizes a user's journey developing a custom aggregation function for Apache Spark using a T-Digest sketch. The user initially implemented it as a User Defined Aggregate Function (UDAF) but ran into performance issues due to excessive serialization/deserialization. They then worked to resolve it by implementing the function as a custom Aggregator using Spark 3.0's new aggregation APIs, which avoided unnecessary serialization and provided a 70x performance improvement. The story highlights the importance of understanding how custom functions interact with Spark's execution model and optimization techniques like avoiding excessive serialization.
This document provides an agenda and overview for a Spark workshop covering Spark basics and streaming. The agenda includes sections on Scala, Spark, Spark SQL, and Spark Streaming. It discusses Scala concepts like vals, vars, defs, classes, objects, and pattern matching. It also covers Spark RDDs, transformations, actions, sources, and the spark-shell. Finally, it briefly introduces Spark concepts like broadcast variables, accumulators, and spark-submit.
The document discusses Apache Spark, an open-source cluster computing framework. It describes Spark's core components like Spark SQL, MLlib, and GraphX. It provides examples of using Spark from Python and Scala for word count tasks and joining datasets. It also demonstrates running Spark interactively on a Spark REPL and deploying Spark on Amazon EMR. Key points are that Spark can handle batch, interactive, and real-time processing and integrates with Python, Scala, and Java while programming at a higher level of abstraction than MapReduce.
At the Dublin Fashion Insights Centre, we are exploring methods of categorising the web into a set of known fashion related topics. This raises questions such as: How many fashion related topics are there? How closely are they related to each other, or to other non-fashion topics? Furthermore, what topic hierarchies exist in this landscape? Using Clojure and MLlib to harness the data available from crowd-sourced websites such as DMOZ (a categorisation of millions of websites) and Common Crawl (a monthly crawl of billions of websites), we are answering these questions to understand fashion in a quantitative manner. The latest generation of big data tools such as Apache Spark routinely handle petabytes of data while also addressing real-world realities like node and network failures. Spark's transformations and operations on data sets are a natural fit with Clojure's everyday use of transformations and reductions. Spark MLlib's excellent implementations of distributed machine learning algorithms puts the power of large-scale analytics in the hands of Clojure developers. At Zalando's Dublin Fashion Insights Centre, we're using the Clojure bindings to Spark and MLlib to answer fashion-related questions that until recently have been nearly impossible to answer quantitatively. Hunter Kelly @retnuh tech.zalando.com
Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.
This document summarizes Spark and Spark Streaming internals. It discusses the Resilient Distributed Dataset (RDD) model in Spark, which allows for fault tolerance through lineage-based recomputation. It provides an example of log mining using RDD transformations and actions. It then discusses Spark Streaming, which provides a simple API for stream processing by treating streams as series of small batch jobs on RDDs. Key concepts discussed include Discretized Stream (DStream), transformations, and output operations. An example Twitter hashtag extraction job is outlined.
This document summarizes a presentation on Spark SQL and its capabilities. Spark SQL allows users to run SQL queries on Spark, including HiveQL queries with UDFs, UDAFs, and SerDes. It provides a unified interface for reading and writing data in various formats. Spark SQL also allows users to express common operations like selecting columns, joining data, and aggregation concisely through its DataFrame API. This reduces the amount of code users need to write compared to lower-level APIs like RDDs.
Apache Spark is a cluster computing platform designed to be fast and general-purpose. It provides a unified analytics engine for large-scale data processing across SQL, streaming, machine learning, and graph processing. Spark programs can be written in Java, Scala, Python and R. It works by building resilient distributed datasets (RDDs) that can be operated on in parallel. RDDs support transformations like map, filter and join and actions like count, collect and save. Spark also provides caching of RDDs in memory for improved performance.
This document summarizes Spark SQL and DataFrames in Spark. It notes that Spark SQL is part of the core Spark distribution and allows running SQL and HiveQL queries. DataFrames provide a way to select, filter, aggregate and plot structured data like in R and Pandas. DataFrames allow writing less code through a high-level API and reading less data by using optimized formats and partitioning. The optimizer can optimize queries across functions and push down predicates to read less data. This allows creating and running Spark programs faster.
As a developer, data engineer, or data scientist, you’ve seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you’re solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark. You faced a learning curve when you first started using Spark, and you’ll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you’ll need to turn your code into a library that you can share with the world. We’ll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community. We’ll back up our advice with concrete examples from real packages built atop Spark. You’ll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.
Scala Toronto July 2019 event at 500px. Pure Functional API Integration Apache Spark Internals tuning Performance tuning Query execution plan optimisation Cats Effects for switching execution model runtime. Discovery / experience with Monix, Scala Future.
This document discusses challenges in running machine learning applications in production environments. It notes that while Kaggle competitions focus on accuracy, real-world applications require balancing accuracy with interpretability, speed and infrastructure constraints. It also emphasizes that machine learning in production is as much a software and systems problem as a modeling problem. Key aspects that are discussed include flexible and scalable deployment architectures, model versioning, packaging and serving, online evaluation and experiments, and ensuring reproducibility of results.
The document provides guidance on tuning Apache Spark jobs. It discusses tuning memory and garbage collection, optimizing shuffle operations, increasing parallelism through partitioning, monitoring jobs, and testing Spark applications.
The document discusses the use of CRDTs (Convergent Replicated Data Types) to achieve eventual consistency in distributed systems without consensus. It describes the CAP theorem and challenges with achieving consistency in a distributed manner. CRDTs are introduced as a way to build datatypes that can automatically resolve conflicts as they propagate through replicas. Examples of commonly used CRDTs include registers, counters, sets and graphs. The document outlines some real-world implementations of CRDTs and notes their limitations.
The document is a presentation on deep learning. It defines deep learning and describes techniques like convolutional neural networks and recurrent neural networks. It discusses how deep learning works by using neural networks with multiple layers to learn representations of data. It also covers challenges like vanishing gradients and overfitting when using deep networks. Examples of deep learning applications in machine translation and image captioning are provided. Finally, popular frameworks for developing deep learning models are mentioned.
This document summarizes a presentation about Finagle, a framework developed by Twitter for building reliable services. It discusses how Finagle uses asynchronous Futures and composable Filters and Services to provide high performance RPC. It also covers key Finagle concepts like load balancing, failure handling, and how it is used by many large companies for building distributed systems. The document provides code examples of defining Services and applying Filters in Finagle and Scala.