Introduction to real time big data with Apache Spark

Introduction to Real-time
Big Data with Apache Spark

About Me
https://ua.linkedin.com/in/tarasmatyashovsky

Agenda
• Buzzwords
• Spark in a Nutshell
• Spark Concepts
• Spark Core
• live demo session
• Spark SQL
• live demo session
• Road to Production
• Spark Drawbacks
• Our Spark Integration
• Spark is on a Rise

Buzzword for large
and complex data sets
difficult to process using on-hand
database management tools or
traditional data processing applications
https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics-business-gordon

http://www.ibmbigdatahub.com/infographic/four-vs-big-data

Jesus Christ,
It is Big Data,
Get Hadoop!
by Sergey Shelpuk (https://ua.linkedin.com/in/shelpuk) at AI Club Meetup in Lviv

To Hadoop?
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
• Batch mode, not real-time
• Unstructured or semi-structured data
• MapReduce programming model, e.g.
key/value pairs

Not to Hadoop?
• Real-time, streaming
• Structures which could not be
decomposed to key-value pairs
• Jobs/algorithms which do not yield to
the MapReduce programming model

Not to Hadoop?
• Subset of data is enough
Remove excessive complexity or shrink data set via other
processing techniques, e.g.: hashing, clusterization
• Random, Interactive Access to Data
Well structured data
Bunch of scalable mature (No)SQL DB solutions exist
(Hbase/Cassandra/Columnar scalable DW engines)
• Sensitive Data
Security is still very challenging and immature

Why Spark?
As of mid 2014,
Spark is the most active Big Data project
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Contributors per month to Spark

Spark
Fast and general-purpose
cluster computing platform
for large-scale data processing

Time to Sort 100TB

Why Spark is Faster?
Spark processes data in-memory while
Hadoop persists back to the disk
after a map/reduce action

Powered by Spark
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

Components Stack
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html

Core Concepts
automatically distribute data across cluster
and
parallelize operations performed on them

Distributed Application
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html

http://spark.apache.org/docs/latest/cluster-overview.html

RDD API
Transformations:
• filter()
• map()
• flatMap()
• distinct()
• union()
• intersection()
• subtract()
• etc.
Actions:
• collect()
• reduce()
• count()
• countByValue()
• first()
• take()
• top()
• etc.

RDD Operations
• transformations are executed on
workers
• actions may transfer data from the
workers to the driver
• сollect() sends all the partitions to the
single driver
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs

Pair RDD
Transformations:
• reduceByKey()
• groupByKey()
• sortByKey()
• keys()
• values()
• join()
• etc.
Actions:
• countByKey()
• collectAsMap()
• lookup()
• etc.

Sample Application
https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv

Requirements
Analytics about Morning@Lohika events:
• unique participants by companies
• most loyal participants
• participants by position
• etc.

Data Format
Simple CSV files
all fields are optional
First Name Last Name Company Position Email Present
Vladimir Tsukur GlobalLogic
Tech/Team
Lead
flushdia@gmail.com 1
Mikalai Alimenkou XP Injection Tech Lead
mikalai.alimenkou@
xpinjection.com
1
Taras Matyashovsky Lohika
Software
Engineer
taras.matyashovsky@
gmail.com
0

Technologies
Technologies:
• Spring Boot 1.2.3.RELEASE
• Spark 1.3.1 - released April 17, 2015
• 2 Spark jar dependencies
• Apache 2.0 license, i.e. free to use

Features
• simple HTTP-based API
• file system: local and HDFS
• data formats: CSV and Parquet
• 3 compatible implementations based on:
• RDD (Spark Core)
• Data Frame DSL (Spark SQL)
• Data Frame SQL (Spark SQL)
• serialization: default Java and Kryo

Demo Time

Cluster
Manager
Worker
Driver
Spark
Context
Executor
Task
Worker
Executor
Task
Task
Task
Demo Explained

Limited opportunities for
automatic optimization
Functional Programming API Drawback

Structured data processing
Spark SQL

Distributed collection of data
organized into named columns
Data Frame

Data Frame API
• selecting columns
• joining different data sources
• aggregation, e.g. sum, count, average
• filtering

Plan Optimization & Execution
http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf

Faster than RDD
http://www.slideshare.net/databricks/spark-sqlsse2015public

Persistence & Caching
• by default stores the data in the JVM
heap as unserialized objects
• possibility to store on disk as
unserialized/serialized objects
• off-heap caching is experimental and
uses

https://spark.apache.org/docs/latest/running-on-mesos.html

Cluster Manager
should be chosen and configured
properly

Monitoring
via web UI(s) and metrics

Monitoring
• master web UI
• worker web UI
• driver web UI
• available only during execution
• history server
• spark.eventLog.enabled = true

Metrics
• based on Coda Hale Metrics library
• can be reported via HTTP, JMX, and
CSV files

https://spark.apache.org/docs/latest/tuning.html

Serialization
https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization

Memory Management
Tune Executor Memory Fraction
RDD Storage (60%)
Shuffle and aggregation
buffers (20%)
User code (20%)
https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior

Memory Management
Tune storage level:
• store in memory and/or on disk
• store as unserialized/serialized objects
• replicate each partition on 1 or 2 cluster
nodes
• store in Tachyon
https://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose

Level of Parallelism
• spark.task.cpus
• 1 task per partition using 1 core to execute
• spark.default.parallelism
• can be controlled:
• repartition() and coalescence() functions
• degree of parallelism as a operations parameter
• storage system matters

Data Locality
• check data locality via UI
• configure data locality settings if
needed
• spark.locality.wait timeout
• execute certain jobs on a driver
• spark.localExecution.enabled

Java API Drawbacks
• API can be experimental or used just
for development
• Spark Java API can be not up-to-date
as Scala API is main focus

Product
Cloud-based analytics application

Use Cases
• supplement Neo4j database used to
store/query big dimensions
• supplement RDBMS for querying of
high volumes of data

Use Cases
• represent existing computational graph
as flow of Spark-based operations
• predictive analytics based on Spark
MLib component

Lessons Learned
• Spark simplicity is deceptive
• Each use case is unique
• Be really aware:
• Databricks blog
• Mailing lists & Jira
• Pull requests
Spark is kind of magic

http://www.techrepublic.com/article/can-anything-dim-apache-spark/

Project Tungsten
• the largest change to Spark’s execution
engine since the project’s inception
• focuses on substantially improving the
efficiency of memory and CPU for
Spark applications
• sun.misc.Unsafe
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

Thank you!
Taras Matyashovsky
taras.matyashovsky@gmail.com
@tmatyashovsky
http://www.filevych.com/

References
https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics-
business-gordon
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-
models/
Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early
release ebook from O'Reilly Media)
https://spark-prs.appspot.com/#all
https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details
http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-
sorting.html
http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-
better-spark-programs
http://www.slideshare.net/databricks/spark-sqlsse2015public
https://spark.apache.org/docs/latest/running-on-mesos.html
http://www.techrepublic.com/article/can-anything-dim-apache-spark/
http://spark-packages.org/

Introduction to real time big data with Apache Spark

More Related Content

Introduction to real time big data with Apache Spark

Editor's Notes