SlideShare a Scribd company logo
Analytics with Spark on EMR
Jonathan Fritz
Sr. Product Manager, AWS
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Spark moves at interactive speed
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• Massively parallel
• Uses DAGs instead of map-
reduce for execution
• Minimizes I/O by storing data
in RDDs in memory
• Partitioning-aware to avoid
network-intensive shuffle
Spark components to match your use case
Spark speaks your language
Use DataFrames to easily interact with data
• Distributed
collection of data
organized in
columns
• An extension of the
existing RDD API
• Optimized for query
execution
Easily create DataFrames from many formats
RDD
Additional libraries for Spark SQL Data Sources
at spark-packages.org
Load data with the Spark SQL Data Sources API
Additional libraries at spark-packages.org
Sample DataFrame manipulations
Use DataFrames for machine learning
• Spark ML libraries
(replacing MLlib) use
DataFrames as
input/output for
models
• Create ML pipelines
with a variety of
distributed algorithms
Create DataFrames on streaming data
• Access data in Spark Streaming DStream
• Create SQLContext on the SparkContext used for Spark
Streaming application for ad hoc queries
• Incorporate DataFrame in Spark Streaming application
• Checkpointing streaming jobs
Spark Pipeline
Use R to interact with DataFrames
• SparkR package for using R to manipulate DataFrames
• Create SparkR applications or interactively use the SparkR
shell (no Zeppelin support yet - ZEPPELIN-156)
• Comparable performance to Python and Scala
DataFrames
Spark SQL
• Seamlessly mix SQL with Spark programs
• Uniform data access
• Hive compatibility – run Hive queries without
modifications using HiveContext
• Connect through JDBC/ODBC using the Spark
ThriftServer (coming soon natively in EMR)
Spark architecture
• SparkContext runs as a
library in your program, one
instance per Spark app.
• Cluster managers:
Standalone, Mesos or YARN
• Accesses storage via Hadoop
InputFormat API, and can use
S3 with EMRFS, HBase,
HDFS, and more
Your application
SparkContext
Local
threads
Cluster
manager
Worker Worker
HDFS or other storage
Spark
executor
Spark
executor
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Amazon EMR runs Spark on YARN
• Dynamically share and centrally configure
the same pool of cluster resources across
engines
• Schedulers for categorizing, isolating, and
prioritizing workloads
• Choose the number of executors to use, or
allow YARN to choose (dynamic allocation)
• Kerberos authentication
Storage
S3, HDFS
YARN
Cluster Resource Management
Batch
MapReduce
In Memory
Spark
Applications
Pig, Hive, Cascading, Spark Streaming, Spark SQL
RDDs (and now DataFrames) and Fault
Tolerance
RDDs track the transformations used to build them
(their lineage) to recompute lost data
E.g:
messages = textFile(...).filter(lambda s: s.contains(“ERROR”))
.map(lambda s: s.split(‘t’)[2])
HadoopRDD
path = hdfs://…
FilteredRDD
func = contains(...)
MappedRDD
func = split(…)
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘t’)[2])
messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec
(vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Caching RDDs can boost performance
Load error messages from a log into memory, then interactively search for patterns
RDD Persistence
• Caching or Persisting dataset in memory
• Methods
• cache()
• persist()
• Small RDD  MEMORY_ONLY
• Big RDD  MEMORY_ONLY_SER (CPU intensive)
• Don’t spill to disk
• Use replicated storage for faster recovery
Inside Spark Executor on YARN
Max Container size on node
YARN Container Controls the max sum of memory used by the container
yarn.nodemanager.resource.memory-mb
→
Default: 116 GConfig File: yarn-site.xml
Inside Spark Executor on YARN
Max Container size on node
Executor space Where Spark executor Runs
Executor Container
→
Inside Spark Executor on YARN
Max Container size on node
Executor Memory Overhead - Off heap memory (VM overheads, interned strings etc.)
𝑠𝑝𝑎𝑟𝑘. 𝑦𝑎𝑟𝑛. 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟. 𝑚𝑒𝑚𝑜𝑟𝑦𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑀𝑒𝑚𝑜𝑟𝑦 ∗ 0.10
Executor Container
Memory
Overhead
Config File: spark-default.conf
Inside Spark Executor on YARN
Max Container size on node
Spark executor memory - Amount of memory to use per executor process
spark.executor.memory
Executor Container
Memory
Overhead
Spark Executor Memory
Config File: spark-default.conf
Inside Spark Executor on YARN
Max Container size on node
Shuffle Memory Fraction – pre-Spark 1.6
Executor Container
Memory
Overhead
Spark Executor Memory
Shuffle
memoryFraction
Default: 0.2
Inside Spark Executor on YARN
Max Container size on node
Storage storage Fraction - pre-Spark 1.6
Executor Container
Memory
Overhead
Spark Executor Memory
Shuffle
memoryFraction
Storage
memoryFraction
Default: 0.6
Inside Spark Executor on YARN
Max Container size on node
In Spark 1.6+, Spark automatically balances the amount of memory for execution
and cached data.
Executor Container
Memory
Overhead
Spark Executor Memory
Execution / Cache
Default: 0.6
Dynamic Allocation on YARN
Scaling up on executors
- Request when you want the job to complete faster
- Idle resources on cluster
- Exponential increase in executors over time
New default in EMR 4.4 (coming soon!)
Dynamic allocation setup
Property Value
Spark.dynamicAllocation.enabled true
Spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 5
spark.dynamicAllocation.maxExecutors 17
spark.dynamicAllocation.initalExecutors 0
sparkdynamicAllocation.executorIdleTime 60s
spark.dynamicAllocation.schedulerBacklogTimeout 5s
spark.dynamicAllocation.sustainedSchedulerBackl
ogTimeout
5s
Optional
Compress your input data set
• Always compress Data Files on Amazon S3
• Reduces storage cost
• Reduces bandwidth between Amazon S3 and Amazon
EMR, which can speed up bandwidth constrained jobs
Compressions
Compression Types:
– Some are fast BUT offer less space reduction
– Some are space efficient BUT Slower
– Some are splitable and some are not
Algorithm % Space
Remaining
Encoding
Speed
Decoding
Speed
GZIP 13% 21MB/s 118MB/s
LZO 20% 135MB/s 410MB/s
Snappy 22% 172MB/s 409MB/s
Data Serialization
• Data is serialized when cached or shuffled
Default: Java serializer
• Kyro serialization (10x faster than Java serialization)
• Does not support all Serializable types
• Register the class in advance
Usage: Set in SparkConf
conf.set("spark.serializer”,"org.apache.spark.serializer.KryoSerializer")
Running Spark on
Amazon EMR
Focus on deriving insights from your data
instead of manually configuring clusters
Easy to install and
configure Spark
Secured
Spark submit, Oozie or
use Zeppelin UI
Quickly add
and remove capacity
Hourly, reserved, or
EC2 Spot pricing
Use S3 to decouple
compute and storage
Launch the latest Spark version
Spark 1.6.0 is the current version on EMR.
< 3 week cadence with latest open source release
Create a fully configured cluster in minutes
AWS Management
Console
AWS Command Line
Interface (CLI)
Or use a AWS SDK directly with the Amazon EMR API
Or easily change your settings
Many storage layers to choose from
Amazon DynamoDB
EMR-DyanmoDB
connector
Amazon RDS
Amazon
Kinesis
Streaming data
connectorsJDBC Data Source
w/ Spark SQL
ElasticSearch
connector
Amazon Redshift
Spark-Redshift
connector
EMR File System
(EMRFS)
Amazon S3
Amazon EMR
Decouple compute and storage by using S3
as your data layer
HDFS
S3 is designed for 11
9’s of durability and is
massively scalable
EC2 Instance
Memory
Amazon S3
Amazon EMR
Amazon EMR
Amazon EMR
Easy to run your Spark workloads
Amazon EMR Step API
SSH to master node and use Spark
Submit, Oozie or Zeppelin
Submit a Spark
application
Amazon EMR
Secured Spark clusters
Encryption At-Rest
• HDFS transparent encryption (AES 256)
• Local disk encryption for temporary files using LUKS encryption
• EMRFS support for Amazon S3 client-side and server-side encryption
Encryption In-Flight
• Secure communication with SSL from S3 to EC2 (nodes of cluster)
• HDFS blocks encrypted in-transit when using HDFS encryption
• SASL encryption for Spark Shuffle
Permissions
• IAM roles, Kerberos, and IAM Users
Access
• VPC and Security Groups
Auditing
• AWS CloudTrailAmazon S3
Customer use cases
Some of our customers running Spark on EMR
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Integration Pattern – ETL with Spark
Amazon EMRAmazon S3
HDFSRead
Unstructure
d Data
Write
Structured
Extract
Load from
HDFS
Store Output Data
Integration Pattern – Tumbling Window Reporting
Amazon EMR
Amazon
Kinesis
Streaming Input
HDFS
Tumbling/Fixed
Window
Aggregation
Periodic Output
Amazon Redshift
COPY from
EMR
Or checkpoint to S3 and use
the Lambda loader app
Zeppelin demo
Jonathan Fritz
Sr. Product Manager
jonfritz@amazon.com

More Related Content

Data science with spark on amazon EMR - Pop-up Loft Tel Aviv

  • 1. Analytics with Spark on EMR Jonathan Fritz Sr. Product Manager, AWS
  • 3. Spark moves at interactive speed join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: = cached partition= RDD map • Massively parallel • Uses DAGs instead of map- reduce for execution • Minimizes I/O by storing data in RDDs in memory • Partitioning-aware to avoid network-intensive shuffle
  • 4. Spark components to match your use case
  • 5. Spark speaks your language
  • 6. Use DataFrames to easily interact with data • Distributed collection of data organized in columns • An extension of the existing RDD API • Optimized for query execution
  • 7. Easily create DataFrames from many formats RDD Additional libraries for Spark SQL Data Sources at spark-packages.org
  • 8. Load data with the Spark SQL Data Sources API Additional libraries at spark-packages.org
  • 10. Use DataFrames for machine learning • Spark ML libraries (replacing MLlib) use DataFrames as input/output for models • Create ML pipelines with a variety of distributed algorithms
  • 11. Create DataFrames on streaming data • Access data in Spark Streaming DStream • Create SQLContext on the SparkContext used for Spark Streaming application for ad hoc queries • Incorporate DataFrame in Spark Streaming application • Checkpointing streaming jobs
  • 13. Use R to interact with DataFrames • SparkR package for using R to manipulate DataFrames • Create SparkR applications or interactively use the SparkR shell (no Zeppelin support yet - ZEPPELIN-156) • Comparable performance to Python and Scala DataFrames
  • 14. Spark SQL • Seamlessly mix SQL with Spark programs • Uniform data access • Hive compatibility – run Hive queries without modifications using HiveContext • Connect through JDBC/ODBC using the Spark ThriftServer (coming soon natively in EMR)
  • 16. • SparkContext runs as a library in your program, one instance per Spark app. • Cluster managers: Standalone, Mesos or YARN • Accesses storage via Hadoop InputFormat API, and can use S3 with EMRFS, HBase, HDFS, and more Your application SparkContext Local threads Cluster manager Worker Worker HDFS or other storage Spark executor Spark executor
  • 18. Amazon EMR runs Spark on YARN • Dynamically share and centrally configure the same pool of cluster resources across engines • Schedulers for categorizing, isolating, and prioritizing workloads • Choose the number of executors to use, or allow YARN to choose (dynamic allocation) • Kerberos authentication Storage S3, HDFS YARN Cluster Resource Management Batch MapReduce In Memory Spark Applications Pig, Hive, Cascading, Spark Streaming, Spark SQL
  • 19. RDDs (and now DataFrames) and Fault Tolerance RDDs track the transformations used to build them (their lineage) to recompute lost data E.g: messages = textFile(...).filter(lambda s: s.contains(“ERROR”)) .map(lambda s: s.split(‘t’)[2]) HadoopRDD path = hdfs://… FilteredRDD func = contains(...) MappedRDD func = split(…)
  • 20. lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Caching RDDs can boost performance Load error messages from a log into memory, then interactively search for patterns
  • 21. RDD Persistence • Caching or Persisting dataset in memory • Methods • cache() • persist() • Small RDD  MEMORY_ONLY • Big RDD  MEMORY_ONLY_SER (CPU intensive) • Don’t spill to disk • Use replicated storage for faster recovery
  • 22. Inside Spark Executor on YARN Max Container size on node YARN Container Controls the max sum of memory used by the container yarn.nodemanager.resource.memory-mb → Default: 116 GConfig File: yarn-site.xml
  • 23. Inside Spark Executor on YARN Max Container size on node Executor space Where Spark executor Runs Executor Container →
  • 24. Inside Spark Executor on YARN Max Container size on node Executor Memory Overhead - Off heap memory (VM overheads, interned strings etc.) 𝑠𝑝𝑎𝑟𝑘. 𝑦𝑎𝑟𝑛. 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟. 𝑚𝑒𝑚𝑜𝑟𝑦𝑂𝑣𝑒𝑟ℎ𝑒𝑎𝑑 = 𝑒𝑥𝑒𝑐𝑢𝑡𝑜𝑟𝑀𝑒𝑚𝑜𝑟𝑦 ∗ 0.10 Executor Container Memory Overhead Config File: spark-default.conf
  • 25. Inside Spark Executor on YARN Max Container size on node Spark executor memory - Amount of memory to use per executor process spark.executor.memory Executor Container Memory Overhead Spark Executor Memory Config File: spark-default.conf
  • 26. Inside Spark Executor on YARN Max Container size on node Shuffle Memory Fraction – pre-Spark 1.6 Executor Container Memory Overhead Spark Executor Memory Shuffle memoryFraction Default: 0.2
  • 27. Inside Spark Executor on YARN Max Container size on node Storage storage Fraction - pre-Spark 1.6 Executor Container Memory Overhead Spark Executor Memory Shuffle memoryFraction Storage memoryFraction Default: 0.6
  • 28. Inside Spark Executor on YARN Max Container size on node In Spark 1.6+, Spark automatically balances the amount of memory for execution and cached data. Executor Container Memory Overhead Spark Executor Memory Execution / Cache Default: 0.6
  • 29. Dynamic Allocation on YARN Scaling up on executors - Request when you want the job to complete faster - Idle resources on cluster - Exponential increase in executors over time New default in EMR 4.4 (coming soon!)
  • 30. Dynamic allocation setup Property Value Spark.dynamicAllocation.enabled true Spark.shuffle.service.enabled true spark.dynamicAllocation.minExecutors 5 spark.dynamicAllocation.maxExecutors 17 spark.dynamicAllocation.initalExecutors 0 sparkdynamicAllocation.executorIdleTime 60s spark.dynamicAllocation.schedulerBacklogTimeout 5s spark.dynamicAllocation.sustainedSchedulerBackl ogTimeout 5s Optional
  • 31. Compress your input data set • Always compress Data Files on Amazon S3 • Reduces storage cost • Reduces bandwidth between Amazon S3 and Amazon EMR, which can speed up bandwidth constrained jobs
  • 32. Compressions Compression Types: – Some are fast BUT offer less space reduction – Some are space efficient BUT Slower – Some are splitable and some are not Algorithm % Space Remaining Encoding Speed Decoding Speed GZIP 13% 21MB/s 118MB/s LZO 20% 135MB/s 410MB/s Snappy 22% 172MB/s 409MB/s
  • 33. Data Serialization • Data is serialized when cached or shuffled Default: Java serializer • Kyro serialization (10x faster than Java serialization) • Does not support all Serializable types • Register the class in advance Usage: Set in SparkConf conf.set("spark.serializer”,"org.apache.spark.serializer.KryoSerializer")
  • 35. Focus on deriving insights from your data instead of manually configuring clusters Easy to install and configure Spark Secured Spark submit, Oozie or use Zeppelin UI Quickly add and remove capacity Hourly, reserved, or EC2 Spot pricing Use S3 to decouple compute and storage
  • 36. Launch the latest Spark version Spark 1.6.0 is the current version on EMR. < 3 week cadence with latest open source release
  • 37. Create a fully configured cluster in minutes AWS Management Console AWS Command Line Interface (CLI) Or use a AWS SDK directly with the Amazon EMR API
  • 38. Or easily change your settings
  • 39. Many storage layers to choose from Amazon DynamoDB EMR-DyanmoDB connector Amazon RDS Amazon Kinesis Streaming data connectorsJDBC Data Source w/ Spark SQL ElasticSearch connector Amazon Redshift Spark-Redshift connector EMR File System (EMRFS) Amazon S3 Amazon EMR
  • 40. Decouple compute and storage by using S3 as your data layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Amazon EMR
  • 41. Easy to run your Spark workloads Amazon EMR Step API SSH to master node and use Spark Submit, Oozie or Zeppelin Submit a Spark application Amazon EMR
  • 42. Secured Spark clusters Encryption At-Rest • HDFS transparent encryption (AES 256) • Local disk encryption for temporary files using LUKS encryption • EMRFS support for Amazon S3 client-side and server-side encryption Encryption In-Flight • Secure communication with SSL from S3 to EC2 (nodes of cluster) • HDFS blocks encrypted in-transit when using HDFS encryption • SASL encryption for Spark Shuffle Permissions • IAM roles, Kerberos, and IAM Users Access • VPC and Security Groups Auditing • AWS CloudTrailAmazon S3
  • 44. Some of our customers running Spark on EMR
  • 46. Integration Pattern – ETL with Spark Amazon EMRAmazon S3 HDFSRead Unstructure d Data Write Structured Extract Load from HDFS Store Output Data
  • 47. Integration Pattern – Tumbling Window Reporting Amazon EMR Amazon Kinesis Streaming Input HDFS Tumbling/Fixed Window Aggregation Periodic Output Amazon Redshift COPY from EMR Or checkpoint to S3 and use the Lambda loader app
  • 49. Jonathan Fritz Sr. Product Manager jonfritz@amazon.com