Intro to Apache Spark by CTO of Twingo
- 2. About Twingo
• A Big Data Company
• Established in 2006 by Golan Nahum
• 25 Employees
• Reseller and expert integrator in HP VERTICA
• Reseller and integrator in MAPR
• Reseller and expert integrator in MICROSTRATEGY
• Deep knowedge in Phyton and Linux
• We did more than 20 Big Data successful Projects
• Expertise in SAAS /OEM BIG DATA solutions
- 3. Agenda
• What is Spark?
• The Difference with Spark
• SQL on Spark
• Combining the power
• Real-World Use Cases
• Resources
- 5. Fast and general MapReduce-like engine
for large-scale data processing
• Fast
In memory data storage for very fast interactive queries
Up to 100 times faster then Hadoop
• General - Unified platform that can combine:
SQL, Machine Learning , Streaming , Graph & Complex
analytics
• Ease of use
Can be developed in Java, Scala or Python
• Integrated with Hadoop
Can read from HDFS, HBase, Cassandra, and any Hadoop
data source.
What is Spark ?
- 6. Spark is the Most Active Open
Source Project in Big Data
Projectcontributorsinpastyear
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
- 10. Data Sources
• Local Files
– file:///opt/httpd/logs/access_log
• S3
• Hadoop Distributed Filesystem
– Regular files, sequence files, any other Hadoop
InputFormat
• Hbase
• Can also read from any other Hadoop data
source.
- 12. Easy and Fast Big Data
• Easy to Develop
– Rich APIs in Java, Scala,
Python
– Interactive shell
• Fast to Run
– General execution graphs
– In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory
- 13. Resilient Distributed Datasets (RDD)
• Spark revolves around RDDs
• Fault-tolerant collection of elements that can be
operated on in parallel
– Parallelized Collection: Scala collection which is run
in parallel
– Hadoop Dataset: records of files supported by
Hadoop
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
- 14. RDD Operations
• Transformations
– Creation of a new dataset from an existing
• map, filter, distinct, union, sample, groupByKey, join, etc…
• Actions
– Return a value after running a computation
• collect, count, first, takeSample, foreach, etc…
Check the documentation for a complete list
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
- 15. RDD Code Example
file = spark.textFile("hdfs://...")
errors = file.filter(lambda line: "ERROR" in line)
# Count all the errors
errors.count()
# Count errors mentioning MySQL
errors.filter(lambda line: "MySQL" in line).count()
# Fetch the MySQL errors as an array of strings
errors.filter(lambda line: "MySQL" in line).collect()
- 16. RDD Persistence / Caching
• Variety of storage levels
– memory_only (default), memory_and_disk, etc…
• API Calls
– persist(StorageLevel)
– cache() – shorthand for
persist(StorageLevel.MEMORY_ONLY)
• Considerations
– Read from disk vs. recompute (memory_and_disk)
– Total memory storage size (memory_only_ser)
– Replicate to second node for faster fault recovery
(memory_only_2)
• Think about this option if supporting a web application
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence
- 18. RDD Fault Recovery
RDDs track lineage information that can be used to
efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
- 19. Interactive Shell
• Iterative Development
– Cache those RDDs
– Open the shell and ask questions
• We have all wished we could do this with MapReduce
– Compile / save your code for scheduled jobs later
• Scala – spark-shell
• Python – pyspark
- 21. Before Spark - Hive
• Puts structure/schema onto HDFS data
• Compiles HiveQL queries into MapReduce
jobs
• Very popular: 90+% of Facebook Hadoop
jobs generated by Hive
• Initially developed by Facebook
- 22. But.. Hive is slow
• Takes 20+ seconds even for simple
queries
• "A good day is when I can run 6 Hive
queries” – @mtraverso
- 24. Shark – SQL over Spark
• Hive-compatible (HiveQL, UDFs, metadata)
– Works in existing Hive warehouses without changing
queries or data!
• Augments Hive
– In-memory tables and columnar memory store
• Fast execution engine
– Uses Spark as the underlying execution engine
– Low-latency, interactive queries
– Scale-out and tolerates worker failures
- 25. Shark – SQL over Spark
• Hive-compatible (HiveQL, UDFs, metadata)
– Works in existing Hive warehouses without changing
queries or data!
• Augments Hive
– In-memory tables and columnar memory store
• Fast execution engine
– Uses Spark as the underlying execution engine
– Low-latency, interactive queries
– Scale-out and tolerates worker failures
- 27. Machine Learning - MLlib
• K-Means
• L1 and L2-regularized Linear Regression
• L1 and L2-regularized Logistic Regression
• Alternating Least Squares
• Naive Bayes
• Stochastic Gradient Descent
* As of May 14, 2014
** Don’t be surprised if you see the Mahout library converting to Spark soon
- 29. Comparison to Storm
• Higher throughput than Storm
– Spark Streaming: 670k records/sec/node
– Storm: 115k records/sec/node
– Commercial systems: 100-500k records/sec/node
0
20
40
100 1000
Throughputper
node(MB/s)
Record Size (bytes)
WordCount
Spark
Storm
0
20
40
60
100 1000
Throughputper
node(MB/s)
Record Size (bytes)
Grep
Spark
Storm
- 31. Combining the power
• Use Machine Learningg result as table.
GENERATE KMeans(tweet_locations)
SAVE AS TABLE tweet_clusters;
• Combine SQL, ML, and streaming (Scala)
val points = sc.runSql[Double, Double](
“select latitude, longitude from historic_tweets”)
val model = KMeans.train(points, 10)
sc.twitterStream(...)
.map(t => (model.closestCenter(t.location), 1))
.reduceByWindow(“5s”, _ + _)
- 34. Spark at Yahoo!
• Hive on Spark (Shark)
Using existing BI tools to view and query advertising
analytic data collected in Hadoop.
Any tool that plugs into Hive, like Tableau, automatically
works with Shark.
- 35. Spark at
• One of the largest streaming video companies
on the Internet
• 4+ billion video feeds per month
(second only to YouTube)
• CONVIVA uses Spark Streaming to learn
network conditions in real time
- 37. Remember
• If you want to use a new technology you must learn that
new technology
• For those who have been using Hadoop for a while, at
one time you had to learn all about MapReduce and how
to manage and tune it
• To get the most out of a new technology you need to
learn that technology, this includes tuning
– There are switches you can use to optimize your work
- 39. Resources
• Pig on Spark
– http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html
– https://github.com/aniket486/pig
– https://github.com/twitter/pig/tree/spork
– http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1
– https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix
• Latest on Spark
– http://databricks.com/categories/spark/
– http://www.spark-stack.org/
- 42. SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]);
JavaRDD<String> file = sc.textFile("hdfs://...");
JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split("
")))
.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.reduceByKey((x, y) -> x + y);
counts.saveAsTextFile("hdfs://...");
val sc = new SparkContext(master, appName, [sparkHome], [jars])
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Word Count
• Java MapReduce (~15 lines of code)
• Java Spark (~ 7 lines of code)
• Scala and Python (4 lines of code)
– interactive shell: skip line 1 and replace the last line with counts.collect()
• Java8 (4 lines of code)
- 43. Network Word Count – Streaming
// Create the context with a 1 second batch size
val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1),
System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass))
// Create a NetworkInputDStream on target host:port and count the
// words in input stream of n delimited text (eg. generated by 'nc')
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
- 44. Deploying Spark – Cluster
Manager Types
• Standalone mode
– Comes bundled (EC2 capable)
• YARN
• Mesos
Editor's Notes
- Spark is really cool…
- השקענו בפרויקט מעל 12 שנות אדם , פרויקט FIX , בשיתוף עם מטריקס ביי
- Yahoo and Adobe are in production with Spark.
- This sounds a lot like the reason to consider Pig vs. Java MapReduce
- Gracefully
- You can import the MLlib to use here in the shell!
- Don’t forget to share your experiences. This is really what the community is about.
Don’t have time to contribute to open source, use it and share your experiences!
- This isn’t all proven out yet, but some of it should just work already.
- Best use case? Standalone followed by Mesos… My personal opinion is that Mesos is where the future will take us.