SlideShare a Scribd company logo
An Overview of Apache Spark
Ilya Gulman
CTO, Twingo
June, 2014
About Twingo
• A Big Data Company
• Established in 2006 by Golan Nahum
• 25 Employees
• Reseller and expert integrator in HP VERTICA
• Reseller and integrator in MAPR
• Reseller and expert integrator in MICROSTRATEGY
• Deep knowedge in Phyton and Linux
• We did more than 20 Big Data successful Projects
• Expertise in SAAS /OEM BIG DATA solutions
Agenda
• What is Spark?
• The Difference with Spark
• SQL on Spark
• Combining the power
• Real-World Use Cases
• Resources
What is Spark?
Fast and general MapReduce-like engine
for large-scale data processing
• Fast
In memory data storage for very fast interactive queries
Up to 100 times faster then Hadoop
• General - Unified platform that can combine:
SQL, Machine Learning , Streaming , Graph & Complex
analytics
• Ease of use
Can be developed in Java, Scala or Python
• Integrated with Hadoop
Can read from HDFS, HBase, Cassandra, and any Hadoop
data source.
What is Spark ?
Spark is the Most Active Open
Source Project in Big Data
Projectcontributorsinpastyear
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
The Spark Community
Unified Platform
Shark
(SQL)
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
Continued innovation bringing new functionality, e.g.:
• Java 8 (Closures, Lamba Expressions)
• Spark SQL (SQL on Spark, not just Hive)
• BlinkDB (Approximate Queries)
• SparkR (R wrapper for Spark)
Supported Languages
• Java
• Scala
• Python
• SQL
Data Sources
• Local Files
– file:///opt/httpd/logs/access_log
• S3
• Hadoop Distributed Filesystem
– Regular files, sequence files, any other Hadoop
InputFormat
• Hbase
• Can also read from any other Hadoop data
source.
The Difference with Spark
Easy and Fast Big Data
• Easy to Develop
– Rich APIs in Java, Scala,
Python
– Interactive shell
• Fast to Run
– General execution graphs
– In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory
Resilient Distributed Datasets (RDD)
• Spark revolves around RDDs
• Fault-tolerant collection of elements that can be
operated on in parallel
– Parallelized Collection: Scala collection which is run
in parallel
– Hadoop Dataset: records of files supported by
Hadoop
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
RDD Operations
• Transformations
– Creation of a new dataset from an existing
• map, filter, distinct, union, sample, groupByKey, join, etc…
• Actions
– Return a value after running a computation
• collect, count, first, takeSample, foreach, etc…
Check the documentation for a complete list
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
RDD Code Example
file = spark.textFile("hdfs://...")
errors = file.filter(lambda line: "ERROR" in line)
# Count all the errors
errors.count()
# Count errors mentioning MySQL
errors.filter(lambda line: "MySQL" in line).count()
# Fetch the MySQL errors as an array of strings
errors.filter(lambda line: "MySQL" in line).collect()
RDD Persistence / Caching
• Variety of storage levels
– memory_only (default), memory_and_disk, etc…
• API Calls
– persist(StorageLevel)
– cache() – shorthand for
persist(StorageLevel.MEMORY_ONLY)
• Considerations
– Read from disk vs. recompute (memory_and_disk)
– Total memory storage size (memory_only_ser)
– Replicate to second node for faster fault recovery
(memory_only_2)
• Think about this option if supporting a web application
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence
Cache Scaling Matters
69
58
41
30
12
0
20
40
60
80
100
Cache
disabled
25% 50% 75% Fully
cached
Executiontime(s)
% of working set in cache
RDD Fault Recovery
RDDs track lineage information that can be used to
efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
Interactive Shell
• Iterative Development
– Cache those RDDs
– Open the shell and ask questions
• We have all wished we could do this with MapReduce
– Compile / save your code for scheduled jobs later
• Scala – spark-shell
• Python – pyspark
SQL on Spark
Before Spark - Hive
• Puts structure/schema onto HDFS data
• Compiles HiveQL queries into MapReduce
jobs
• Very popular: 90+% of Facebook Hadoop
jobs generated by Hive
• Initially developed by Facebook
But.. Hive is slow
• Takes 20+ seconds even for simple
queries
• "A good day is when I can run 6 Hive
queries” – @mtraverso
SQL over Spark
Shark – SQL over Spark
• Hive-compatible (HiveQL, UDFs, metadata)
– Works in existing Hive warehouses without changing
queries or data!
• Augments Hive
– In-memory tables and columnar memory store
• Fast execution engine
– Uses Spark as the underlying execution engine
– Low-latency, interactive queries
– Scale-out and tolerates worker failures
Shark – SQL over Spark
• Hive-compatible (HiveQL, UDFs, metadata)
– Works in existing Hive warehouses without changing
queries or data!
• Augments Hive
– In-memory tables and columnar memory store
• Fast execution engine
– Uses Spark as the underlying execution engine
– Low-latency, interactive queries
– Scale-out and tolerates worker failures
Machine Learning
Machine Learning - MLlib
• K-Means
• L1 and L2-regularized Linear Regression
• L1 and L2-regularized Logistic Regression
• Alternating Least Squares
• Naive Bayes
• Stochastic Gradient Descent
* As of May 14, 2014
** Don’t be surprised if you see the Mahout library converting to Spark soon
Streaming
Comparison to Storm
• Higher throughput than Storm
– Spark Streaming: 670k records/sec/node
– Storm: 115k records/sec/node
– Commercial systems: 100-500k records/sec/node
0
20
40
100 1000
Throughputper
node(MB/s)
Record Size (bytes)
WordCount
Spark
Storm
0
20
40
60
100 1000
Throughputper
node(MB/s)
Record Size (bytes)
Grep
Spark
Storm
Combining the power
Combining the power
• Use Machine Learningg result as table.
GENERATE KMeans(tweet_locations)
SAVE AS TABLE tweet_clusters;
• Combine SQL, ML, and streaming (Scala)
val points = sc.runSql[Double, Double](
“select latitude, longitude from historic_tweets”)
val model = KMeans.train(points, 10)
sc.twitterStream(...)
.map(t => (model.closestCenter(t.location), 1))
.reduceByWindow(“5s”, _ + _)
Real-World Use Cases
Spark at Yahoo!
• Fast Machine Learning
Personalized news pages
Spark at Yahoo!
• Hive on Spark (Shark)
Using existing BI tools to view and query advertising
analytic data collected in Hadoop.
Any tool that plugs into Hive, like Tableau, automatically
works with Shark.
Spark at
• One of the largest streaming video companies
on the Internet
• 4+ billion video feeds per month
(second only to YouTube)
• CONVIVA uses Spark Streaming to learn
network conditions in real time
Resources
Remember
• If you want to use a new technology you must learn that
new technology
• For those who have been using Hadoop for a while, at
one time you had to learn all about MapReduce and how
to manage and tune it
• To get the most out of a new technology you need to
learn that technology, this includes tuning
– There are switches you can use to optimize your work
Configuration
http://spark.apache.org/docs/latest/
Most Important
• Application Configuration
http://spark.apache.org/docs/latest/configuration.html
• Standalone Cluster Configuration
http://spark.apache.org/docs/latest/spark-standalone.html
• Tuning Guide
http://spark.apache.org/docs/latest/tuning.html
Resources
• Pig on Spark
– http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html
– https://github.com/aniket486/pig
– https://github.com/twitter/pig/tree/spork
– http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1
– https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix
• Latest on Spark
– http://databricks.com/categories/spark/
– http://www.spark-stack.org/
Thank You
www.twingo.co.il
Optional - More Examples
SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]);
JavaRDD<String> file = sc.textFile("hdfs://...");
JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split("
")))
.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.reduceByKey((x, y) -> x + y);
counts.saveAsTextFile("hdfs://...");
val sc = new SparkContext(master, appName, [sparkHome], [jars])
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Word Count
• Java MapReduce (~15 lines of code)
• Java Spark (~ 7 lines of code)
• Scala and Python (4 lines of code)
– interactive shell: skip line 1 and replace the last line with counts.collect()
• Java8 (4 lines of code)
Network Word Count – Streaming
// Create the context with a 1 second batch size
val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1),
System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass))
// Create a NetworkInputDStream on target host:port and count the
// words in input stream of n delimited text (eg. generated by 'nc')
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
Deploying Spark – Cluster
Manager Types
• Standalone mode
– Comes bundled (EC2 capable)
• YARN
• Mesos

More Related Content

Intro to Apache Spark by CTO of Twingo

  • 1. An Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014
  • 2. About Twingo • A Big Data Company • Established in 2006 by Golan Nahum • 25 Employees • Reseller and expert integrator in HP VERTICA • Reseller and integrator in MAPR • Reseller and expert integrator in MICROSTRATEGY • Deep knowedge in Phyton and Linux • We did more than 20 Big Data successful Projects • Expertise in SAAS /OEM BIG DATA solutions
  • 3. Agenda • What is Spark? • The Difference with Spark • SQL on Spark • Combining the power • Real-World Use Cases • Resources
  • 5. Fast and general MapReduce-like engine for large-scale data processing • Fast In memory data storage for very fast interactive queries Up to 100 times faster then Hadoop • General - Unified platform that can combine: SQL, Machine Learning , Streaming , Graph & Complex analytics • Ease of use Can be developed in Java, Scala or Python • Integrated with Hadoop Can read from HDFS, HBase, Cassandra, and any Hadoop data source. What is Spark ?
  • 6. Spark is the Most Active Open Source Project in Big Data Projectcontributorsinpastyear Giraph Storm Tez 0 20 40 60 80 100 120 140
  • 8. Unified Platform Shark (SQL) Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Continued innovation bringing new functionality, e.g.: • Java 8 (Closures, Lamba Expressions) • Spark SQL (SQL on Spark, not just Hive) • BlinkDB (Approximate Queries) • SparkR (R wrapper for Spark)
  • 9. Supported Languages • Java • Scala • Python • SQL
  • 10. Data Sources • Local Files – file:///opt/httpd/logs/access_log • S3 • Hadoop Distributed Filesystem – Regular files, sequence files, any other Hadoop InputFormat • Hbase • Can also read from any other Hadoop data source.
  • 12. Easy and Fast Big Data • Easy to Develop – Rich APIs in Java, Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  • 13. Resilient Distributed Datasets (RDD) • Spark revolves around RDDs • Fault-tolerant collection of elements that can be operated on in parallel – Parallelized Collection: Scala collection which is run in parallel – Hadoop Dataset: records of files supported by Hadoop http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 14. RDD Operations • Transformations – Creation of a new dataset from an existing • map, filter, distinct, union, sample, groupByKey, join, etc… • Actions – Return a value after running a computation • collect, count, first, takeSample, foreach, etc… Check the documentation for a complete list http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
  • 15. RDD Code Example file = spark.textFile("hdfs://...") errors = file.filter(lambda line: "ERROR" in line) # Count all the errors errors.count() # Count errors mentioning MySQL errors.filter(lambda line: "MySQL" in line).count() # Fetch the MySQL errors as an array of strings errors.filter(lambda line: "MySQL" in line).collect()
  • 16. RDD Persistence / Caching • Variety of storage levels – memory_only (default), memory_and_disk, etc… • API Calls – persist(StorageLevel) – cache() – shorthand for persist(StorageLevel.MEMORY_ONLY) • Considerations – Read from disk vs. recompute (memory_and_disk) – Total memory storage size (memory_only_ser) – Replicate to second node for faster fault recovery (memory_only_2) • Think about this option if supporting a web application http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence
  • 17. Cache Scaling Matters 69 58 41 30 12 0 20 40 60 80 100 Cache disabled 25% 50% 75% Fully cached Executiontime(s) % of working set in cache
  • 18. RDD Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 19. Interactive Shell • Iterative Development – Cache those RDDs – Open the shell and ask questions • We have all wished we could do this with MapReduce – Compile / save your code for scheduled jobs later • Scala – spark-shell • Python – pyspark
  • 21. Before Spark - Hive • Puts structure/schema onto HDFS data • Compiles HiveQL queries into MapReduce jobs • Very popular: 90+% of Facebook Hadoop jobs generated by Hive • Initially developed by Facebook
  • 22. But.. Hive is slow • Takes 20+ seconds even for simple queries • "A good day is when I can run 6 Hive queries” – @mtraverso
  • 24. Shark – SQL over Spark • Hive-compatible (HiveQL, UDFs, metadata) – Works in existing Hive warehouses without changing queries or data! • Augments Hive – In-memory tables and columnar memory store • Fast execution engine – Uses Spark as the underlying execution engine – Low-latency, interactive queries – Scale-out and tolerates worker failures
  • 25. Shark – SQL over Spark • Hive-compatible (HiveQL, UDFs, metadata) – Works in existing Hive warehouses without changing queries or data! • Augments Hive – In-memory tables and columnar memory store • Fast execution engine – Uses Spark as the underlying execution engine – Low-latency, interactive queries – Scale-out and tolerates worker failures
  • 27. Machine Learning - MLlib • K-Means • L1 and L2-regularized Linear Regression • L1 and L2-regularized Logistic Regression • Alternating Least Squares • Naive Bayes • Stochastic Gradient Descent * As of May 14, 2014 ** Don’t be surprised if you see the Mahout library converting to Spark soon
  • 29. Comparison to Storm • Higher throughput than Storm – Spark Streaming: 670k records/sec/node – Storm: 115k records/sec/node – Commercial systems: 100-500k records/sec/node 0 20 40 100 1000 Throughputper node(MB/s) Record Size (bytes) WordCount Spark Storm 0 20 40 60 100 1000 Throughputper node(MB/s) Record Size (bytes) Grep Spark Storm
  • 31. Combining the power • Use Machine Learningg result as table. GENERATE KMeans(tweet_locations) SAVE AS TABLE tweet_clusters; • Combine SQL, ML, and streaming (Scala) val points = sc.runSql[Double, Double]( “select latitude, longitude from historic_tweets”) val model = KMeans.train(points, 10) sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)
  • 33. Spark at Yahoo! • Fast Machine Learning Personalized news pages
  • 34. Spark at Yahoo! • Hive on Spark (Shark) Using existing BI tools to view and query advertising analytic data collected in Hadoop. Any tool that plugs into Hive, like Tableau, automatically works with Shark.
  • 35. Spark at • One of the largest streaming video companies on the Internet • 4+ billion video feeds per month (second only to YouTube) • CONVIVA uses Spark Streaming to learn network conditions in real time
  • 37. Remember • If you want to use a new technology you must learn that new technology • For those who have been using Hadoop for a while, at one time you had to learn all about MapReduce and how to manage and tune it • To get the most out of a new technology you need to learn that technology, this includes tuning – There are switches you can use to optimize your work
  • 38. Configuration http://spark.apache.org/docs/latest/ Most Important • Application Configuration http://spark.apache.org/docs/latest/configuration.html • Standalone Cluster Configuration http://spark.apache.org/docs/latest/spark-standalone.html • Tuning Guide http://spark.apache.org/docs/latest/tuning.html
  • 39. Resources • Pig on Spark – http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html – https://github.com/aniket486/pig – https://github.com/twitter/pig/tree/spork – http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 – https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix • Latest on Spark – http://databricks.com/categories/spark/ – http://www.spark-stack.org/
  • 41. Optional - More Examples
  • 42. SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]); JavaRDD<String> file = sc.textFile("hdfs://..."); JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" "))) .mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y); counts.saveAsTextFile("hdfs://..."); val sc = new SparkContext(master, appName, [sparkHome], [jars]) val file = sc.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Word Count • Java MapReduce (~15 lines of code) • Java Spark (~ 7 lines of code) • Scala and Python (4 lines of code) – interactive shell: skip line 1 and replace the last line with counts.collect() • Java8 (4 lines of code)
  • 43. Network Word Count – Streaming // Create the context with a 1 second batch size val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1), System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass)) // Create a NetworkInputDStream on target host:port and count the // words in input stream of n delimited text (eg. generated by 'nc') val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start()
  • 44. Deploying Spark – Cluster Manager Types • Standalone mode – Comes bundled (EC2 capable) • YARN • Mesos

Editor's Notes

  1. Spark is really cool…
  2. השקענו בפרויקט מעל 12 שנות אדם , פרויקט FIX , בשיתוף עם מטריקס ביי
  3. Yahoo and Adobe are in production with Spark.
  4. This sounds a lot like the reason to consider Pig vs. Java MapReduce
  5. Gracefully
  6. You can import the MLlib to use here in the shell!
  7. Don’t forget to share your experiences. This is really what the community is about. Don’t have time to contribute to open source, use it and share your experiences!
  8. This isn’t all proven out yet, but some of it should just work already.
  9. Best use case? Standalone followed by Mesos… My personal opinion is that Mesos is where the future will take us.