SlideShare a Scribd company logo
Introduction to Real-time
Big Data with Apache Spark
Introduction
About Me
https://ua.linkedin.com/in/tarasmatyashovsky
Agenda
• Buzzwords
• Spark in a Nutshell
• Spark Concepts
• Spark Core
• live demo session
• Spark SQL
• live demo session
• Road to Production
• Spark Drawbacks
• Our Spark Integration
• Spark is on a Rise
Introduction to real time big data with Apache Spark
Buzzword for large
and complex data sets
difficult to process using on-hand
database management tools or
traditional data processing applications
https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics-business-gordon
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Jesus Christ,
It is Big Data,
Get Hadoop!
by Sergey Shelpuk (https://ua.linkedin.com/in/shelpuk) at AI Club Meetup in Lviv
To Hadoop?
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
• Batch mode, not real-time
• Unstructured or semi-structured data
• MapReduce programming model, e.g.
key/value pairs
Not to Hadoop?
• Real-time, streaming
• Structures which could not be
decomposed to key-value pairs
• Jobs/algorithms which do not yield to
the MapReduce programming model
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
Not to Hadoop?
• Subset of data is enough
Remove excessive complexity or shrink data set via other
processing techniques, e.g.: hashing, clusterization
• Random, Interactive Access to Data
Well structured data
Bunch of scalable mature (No)SQL DB solutions exist
(Hbase/Cassandra/Columnar scalable DW engines)
• Sensitive Data
Security is still very challenging and immature
Why Spark?
As of mid 2014,
Spark is the most active Big Data project
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Contributors per month to Spark
Spark
Fast and general-purpose
cluster computing platform
for large-scale data processing
History
Time to Sort 100TB
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Why Spark is Faster?
Spark processes data in-memory while
Hadoop persists back to the disk
after a map/reduce action
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Powered by Spark
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
Components Stack
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
Core Concepts
automatically distribute data across cluster
and
parallelize operations performed on them
Distributed Application
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
http://spark.apache.org/docs/latest/cluster-overview.html
Spark Core Abstractions
Introduction to real time big data with Apache Spark
RDD API
Transformations:
• filter()
• map()
• flatMap()
• distinct()
• union()
• intersection()
• subtract()
• etc.
Actions:
• collect()
• reduce()
• count()
• countByValue()
• first()
• take()
• top()
• etc.
Introduction to real time big data with Apache Spark
RDD Operations
• transformations are executed on
workers
• actions may transfer data from the
workers to the driver
• сollect() sends all the partitions to the
single driver
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
Pair RDD
Transformations:
• reduceByKey()
• groupByKey()
• sortByKey()
• keys()
• values()
• join()
• etc.
Actions:
• countByKey()
• collectAsMap()
• lookup()
• etc.
Sample Application
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Requirements
Analytics about Morning@Lohika events:
• unique participants by companies
• most loyal participants
• participants by position
• etc.
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Data Format
Simple CSV files
all fields are optional
First Name Last Name Company Position Email Present
Vladimir Tsukur GlobalLogic
Tech/Team
Lead
flushdia@gmail.com 1
Mikalai Alimenkou XP Injection Tech Lead
mikalai.alimenkou@
xpinjection.com
1
Taras Matyashovsky Lohika
Software
Engineer
taras.matyashovsky@
gmail.com
0
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Technologies
Technologies:
• Spring Boot 1.2.3.RELEASE
• Spark 1.3.1 - released April 17, 2015
• 2 Spark jar dependencies
• Apache 2.0 license, i.e. free to use
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Features
• simple HTTP-based API
• file system: local and HDFS
• data formats: CSV and Parquet
• 3 compatible implementations based on:
• RDD (Spark Core)
• Data Frame DSL (Spark SQL)
• Data Frame SQL (Spark SQL)
• serialization: default Java and Kryo
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Demo Time
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Cluster
Manager
Worker
Driver
Spark
Context
Executor
Task
Worker
Executor
Task
http://spark.apache.org/docs/latest/cluster-overview.html
Task
Task
Demo Explained
Limited opportunities for
automatic optimization
Functional Programming API Drawback
Structured data processing
Spark SQL
Distributed collection of data
organized into named columns
Data Frame
Introduction to real time big data with Apache Spark
Data Frame API
• selecting columns
• joining different data sources
• aggregation, e.g. sum, count, average
• filtering
Introduction to real time big data with Apache Spark
Plan Optimization & Execution
http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf
Faster than RDD
http://www.slideshare.net/databricks/spark-sqlsse2015public
Demo Time
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Persistence & Caching
• by default stores the data in the JVM
heap as unserialized objects
• possibility to store on disk as
unserialized/serialized objects
• off-heap caching is experimental and
uses
Introduction to real time big data with Apache Spark
https://spark.apache.org/docs/latest/running-on-mesos.html
Cluster Manager
should be chosen and configured
properly
Monitoring
via web UI(s) and metrics
Monitoring
• master web UI
• worker web UI
• driver web UI
• available only during execution
• history server
• spark.eventLog.enabled = true
Metrics
• based on Coda Hale Metrics library
• can be reported via HTTP, JMX, and
CSV files
https://spark.apache.org/docs/latest/tuning.html
Serialization
https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization
Memory Management
Tune Executor Memory Fraction
RDD Storage (60%)
Shuffle and aggregation
buffers (20%)
User code (20%)
https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior
Memory Management
Tune storage level:
• store in memory and/or on disk
• store as unserialized/serialized objects
• replicate each partition on 1 or 2 cluster
nodes
• store in Tachyon
https://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose
Level of Parallelism
• spark.task.cpus
• 1 task per partition using 1 core to execute
• spark.default.parallelism
• can be controlled:
• repartition() and coalescence() functions
• degree of parallelism as a operations parameter
• storage system matters
Data Locality
• check data locality via UI
• configure data locality settings if
needed
• spark.locality.wait timeout
• execute certain jobs on a driver
• spark.localExecution.enabled
Introduction to real time big data with Apache Spark
Java API Drawbacks
• API can be experimental or used just
for development
• Spark Java API can be not up-to-date
as Scala API is main focus
Our Spark Integration
Product
Cloud-based analytics application
Use Cases
• supplement Neo4j database used to
store/query big dimensions
• supplement RDBMS for querying of
high volumes of data
Use Cases
• represent existing computational graph
as flow of Spark-based operations
• predictive analytics based on Spark
MLib component
Lessons Learned
• Spark simplicity is deceptive
• Each use case is unique
• Be really aware:
• Databricks blog
• Mailing lists & Jira
• Pull requests
Spark is kind of magic
Spark is on a Rise
http://www.techrepublic.com/article/can-anything-dim-apache-spark/
Project Tungsten
• the largest change to Spark’s execution
engine since the project’s inception
• focuses on substantially improving the
efficiency of memory and CPU for
Spark applications
• sun.misc.Unsafe
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
Thank you!
Taras Matyashovsky
taras.matyashovsky@gmail.com
@tmatyashovsky
http://www.filevych.com/
References
https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics-
business-gordon
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-
models/
Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early
release ebook from O'Reilly Media)
https://spark-prs.appspot.com/#all
https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details
http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-
sorting.html
http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-
better-spark-programs
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
http://www.slideshare.net/databricks/spark-sqlsse2015public
https://spark.apache.org/docs/latest/running-on-mesos.html
http://spark.apache.org/docs/latest/cluster-overview.html
http://www.techrepublic.com/article/can-anything-dim-apache-spark/
http://spark-packages.org/

More Related Content

Introduction to real time big data with Apache Spark

Editor's Notes

  1. Cluster Manager: Standalone, Apache Mesos, Hadoop Yarn Cluster Manager should be chosen and configured properly Monitoring via web UI(s) and metrics Web UI: master web UI worker web UI driver web UI - available only during execution history server - spark.eventLog.enabled = true Metrics based on Coda Hale Metrics library. Can be reported via HTTP, JMX, and CSV files.
  2. Serialization: default and Kryo Tune Executor Memory Fraction: RDD Storage (60%), Shuffle and Aggregation Buffers (20%), User code (20%) Tune storage level: store in memory and/or on disk store as unserialized/serialized objects replicate each partition on 1 or 2 cluster nodes store in Tachyon Level of Parallelism: spark.task.cpus 1 task per partition using 1 core to execute spark.default.parallelism can be controlled: repartition() and coalescence() functions degree of parallelism as a operations parameter storage system matters Data locality: check data locality via UI configure data locality settings if needed spark.locality.wait timeout execute certain jobs on a driver spark.localExecution.enabled