SlideShare a Scribd company logo
A Fast Intro to Spark
And a glance at BEAM
Lightning fast cluster computing*
Who am I?
Who am I?
● So this is kind of a long shot, but American TV gets everywhere so….
● I’m not a doctor but I did stay at an IHG property last night
● Which is like a fancy version of Holiday Inn Express
○ I’m honestly not sure if this makes me more or less qualified
● And I did get my IHG points restored
● Ok but for real
Who am I?
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google focused on OSS Big Data
● Apache Spark PMC
● Contributor to a lot of other projects (including BEAM)
● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon
● co-author of High Performance Spark & Learning Spark (+ more)
● Twitter: @holdenkarau
● Slideshare http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Related Spark Videos http://bit.ly/holdenSparkVideos
A super fast introduction to Spark and glance at BEAM
Who do I think you all are?
● Nice people*
● Getting started with Spark or BEAM
○ Or wondering if you need it
● Familiar-ish with Scala or Java or Python
Amanda
What we are going to explore together!
● What is Spark?
● Getting Spark setup locally
● Spark primary distributed collection
● Word count in Spark
● Spark SQL / DataFrames
Then a glance at BEAM
● What is BEAM & what’s its current state
● Streaming wordcount because of course
Some things that may color my views:
● I’m on the Spark PMC -- Spark’s success => I can probably make more $s
● My employer cares about BEAM (and Spark and other things)
● I work primarily in Python & Scala these days
● I like functional programming
● Probably some others I’m forgetting
On the other hand:
● I’ve worked on Spark for a long time and know a lot of its faults
● My goals are pretty flexible
● I have x86 assembly code tattooed on my back
What is Spark?
● General purpose distributed system
○ With a really nice API including Python :)
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
Companion (optional!) notebook funtimes:
http://bit.ly/sparkDocs
http://bit.ly/sparkPyDocs
http://bit.ly/PySparkIntroExamples (has a notebook!)
● Did you know? You can run Spark on Dataproc there by
giving my employer money. You can also run it
elsewhere.
http://bit.ly/learningSparkExamples (lots of code files)
http://bit.ly/hkMLExample (has a notebook, ML focused)
David DeHetre
SparkContext: entry to the world
● Can be used to create RDDs from many input sources
○ Native collections, local & remote FS
○ Any Hadoop Data Source
● Also create counters & accumulators
● Automatically created in the shells (called sc)
● Specify master & app name when creating
○ Master can be local[*], spark:// , yarn, etc.
○ app name should be human readable and make sense
● etc.
Petfu
l
RDDs: Spark’s Primary abstraction
RDD (Resilient Distributed Dataset)
● Distributed collection
● Recomputed on node failure
● Distributes data & work across the cluster
● Lazily evaluated (transformations & actions)
Helen Olney
Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Photo By: Will
Keightley
Why laziness is cool (and not)
● Pipelining (can put maps, filter, flatMap together)
● Can do interesting optimizations by delaying work
● We use the DAG to recompute on failure
○ (writing data out to 3 disks on different machines is so last season)
○ Or the DAG puts the R is Resilient RDD, except DAG doesn’t have an
R :(
How it hurts:
● Debugging is confusing
● Re-using data - lazyness only sees up to the first action
● Some people really hate immutability
Matthew Hurst
Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile("output")
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
daniilr
RDD re-use - sadly not magic
● If we know we are going to re-use the RDD what should we do?
○ If it fits nicely in memory caching in memory
○ persisting at another level
■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK,
MEMORY_AND_DISK_SER
○ checkpointing
● Noisey clusters
○ _2 & checkpointing can help
● persist first for checkpointing
Richard Gillin
Some common transformations & actions
Transformations (lazy)
● map
● filter
● flatMap
● reduceByKey
● join
● cogroup
Actions (eager)
● count
● reduce
● collect
● take
● saveAsTextFile
● saveAsHadoop
● countByValue
Photo by Steve
Photo by Dan G
This can feel like magic* sometimes :)
Steven Saus
*I mean not good magic.
Magic has it’s limits: key-skew + black boxes
● There is a worse way to do WordCount
● We can use the seemingly safe thing called groupByKey
● Then compute the sum...
_torne
Bad word count RDD :(
words = rdd.flatMap(lambda x: x.split(" "))
wordPairs = words.map(lambda w: (w, 1))
grouped = wordPairs.groupByKey()
counted_words = grouped.mapValues(lambda counts: sum(counts))
counted_words.saveAsTextFile("boop")
Tomomi
f ford Pinto by Morven
f ford Pinto by Morven
ayphen
Why should we consider Datasets?
● Performance
○ Smart optimizer
○ More efficient storage
○ Faster serialization
● Simplicity
○ Windowed operations
○ Multi-column & multi-type aggregates
Rikki's Refuge
Why are Datasets so awesome?
● Easier to mix functional style and relational style
○ No more hive UDFs!
● Nice performance of Spark SQL flexibility of RDDs
○ Tungsten (better serialization)
○ Equivalent of Sortable trait
● Strongly typed
● The future (ML, Graph, etc.)
● Potential for better language interop
○ Something like Arrow has a much better chance with Datasets
○ Cross-platform libraries are easier to make & use
Will Folsom
What is the performance like?
Andrew Skudder
How is it so fast?
● Optimizer has more information (schema & operations)
● More efficient storage formats
● Faster serialization
● Some operations directly on serialized data formats
● non-JVM languages: does more computation in the JVM
Andrew Skudder
Word count w/Dataframes
df = spark.read.load(src)
# Returns an RDD
words = df.select("text").flatMap(lambda x: x.text.split(" "))
words_df = words.map(
lambda x: Row(word=x, cnt=1)).toDF()
word_count = words_df.groupBy("word").sum()
word_count.write.format("parquet").save("wc.parquet")
Still have the double
serialization here :(
Word count w/Datasets
val df = spark.read.load(src).select("text")
val ds = df.as[String]
# Returns an Dataset!
val words = ds.flatMap(x => x.split(" "))
val grouped = words.groupBy("value")
val word_count = grouped.agg(count("*") as
"count")
word_count.write.format("parquet").save("wc")
Can’t push down
filters from here
If it’s a simple type we don’t
have to define a case class
What can the optimizer do now?
● Sort on the serialized data
● Understand the aggregate (“partial aggregates”)
○ Could sort of do this before but not as awesomely, and only if we used
reduceByKey - not groupByKey
● Pack them bits nice and tight
So whats this new groupBy?
● No longer causes explosions like RDD groupBy
○ Able to introspect and pipeline the aggregation
● Returns a GroupedData (or GroupedDataset)
● Makes it easy to perform multiple aggregations
● Built in shortcuts for aggregates like avg, min, max
● Longer list at
http://spark.apache.org/docs/latest/api/scala/index.html#
org.apache.spark.sql.functions$
● Allows the optimizer to see what aggregates are being
performed
Sherrie Thai
Computing some aggregates by age code:
df.groupBy("age").min("hours-per-week")
OR
import org.apache.spark.sql.catalyst.expressions.aggregate._
df.groupBy("age").agg(min("hours-per-week"))
Easily compute multiple aggregates:
df.groupBy("age").agg(min("hours-per-week"),
avg("hours-per-week"),
max("capital-gain"))
PhotoAtelier
Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
convert a Dataset to a
DataFrame to access
more DataFrame
functions (pre-2.0)
Convert DataFrame
back to a Dataset
A typed query (specifies the
return type).Traditional functional
reduction:
arbitrary scala code :)
And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
Chris Isherwood
But where DataFrames explode?
● Iterative algorithms - large plans
○ Use your escape hatch to RDDs!
● Some push downs are sad pandas :(
● Default shuffle size is sometimes too small for big data
(200 partitions)
● Default partition size when reading in is also sad
Our ever growing ecosystem:
http://mattturck.com/bigdata2017/
General purpose eating the world
● Operations overhead
● Moving data from System 1 to System 2 (sqoop and friends)
● We still have specialized tools, but being built on top of general frameworks
○ e.g. see mahout on Spark
○ Less closely tied things like Hive/Pig on Spark
○ TF.Transform etc.
Photo by D Coetzee
Even then, lots of general purpose tools:
Resulting in:
flink/
h2o/
hdfs/
integration/
mr/
spark/
viennacl/
And language silos (Scala, Python, Go, etc.!)
Photo by: photobom Photo: Fritz Schuman (ScalaDays CPH)
And cloud silos….
Photo By: Zechariah Judy
And the proliferation of pagers :(
Photo by: Hades2k
Mike Knell
What’s the state of non-JVM big data?
Most of the tools are built in the JVM, so how do we play together?
● Pickling, Strings, JSON, XML, oh my!
● Unix pipes
● Sockets
What about if we don’t want to copy the data all the time?
● Or standalone “pure”* re-implementations of everything
○ Reasonable option for things like Kafka where you would have the I/O regardless.
○ Also cool projects like dask -- but hard to talk to existing ecosystem
David Brown
Spark in Scala, how does PySpark work?
● Py4J + pickling + JSON and magic
○ This can be kind of slow sometimes
● Distributed collections are often collections of pickled
objects
● Spark SQL (and DataFrames) avoid some of this
○ Sometimes we can make them go fast and compile them to the JVM
● Features aren’t automatically exposed, but exposing
them is normally simple.
● SparkR depends on similar magic
kristin klein
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
The “future”*: faster interchange
● By future I mean availability starting in the next 3-6 months (with more
improvements after).
○ Yes much of this code exists, it just isn’t released yet so I’m sure we’ll find all sorts of bugs
and ways to improve.
○ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early!
● Unifying our cross-language experience
○ And not just “normal” languages, CUDA counts yo
Tambako The Jaguar
Andrew Skudder
*Arrow: likely the future. I really hope so. Spark 2.3 and beyond!
* *
With early work happening to
support GPUs/ TF.
BEAM backends:
● BEAM nominally supports*
○ Dataflow
○ Flink*
○ Spark*
○ IBM Streams, etc.
● Goal of more than just lowest-common-demoninator, think of it like a
compiler**
*Supports as in early-stage, but we’re working on it (and we’d love your help!)
**But you know, in the same sense I compare Spark Streams to pandas coming a
wooden slide.
BEAM Languages
● JVM: Scala, Java, etc.
● non-JVM: Python w/Go and more coming
BEAM Beyond the JVM
● This part doesn’t work outside of Google’s hosted environment yet, so I’m
going to be light on the details
● tl;dr : uses grpc / protobuf
● But exciting new plans (w/ some early code) to unify the runners and ease the
support of different languages (called SDKS)
○ See https://beam.apache.org/contribute/portability/
What do the different APIs look like?
● Everyone's favourite: Streaming Word Count Example
● And then windowed wordcount!
● (And also a peak at Tensorflow incase anyone is trying to raise a series A)
Spark wordcount (Python*) - “pure” relational
# Split the lines into words
words = lines.select(
# explode turns each item in an array into a separate row
explode(
split(lines.value, ' ')
).alias('word')
)
# Generate running word count
wordCounts = words.groupBy('word').count()
BEAM wordcount (Java)
p.apply("ExtractWords", ParDo.of(new DoFn<String, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
for (String word : c.element().split("W+")) {
c.output(word);
}}}))
.apply(Count.<String>perElement())
What about windowed word count?
Trish Hamme
Christer van der Meeren
What else might happen?
● One execution engine becomes super amazing at everything
● Instead of a compiler like unifier we see something like streaming SQL
become our unifier
○ Relatedly BEAM, Spark & Flink, and Kafka all have streaming SQL implementations.
● People realize there big data problem is actually three small data problems in
a trench coat
bnilsen
And some upcoming talks:
● Jan
○ If interest tomorrow: Office Hours? Tweet me @holdenkarau
○ LinuxConf AU - next week
○ Sydney Spark meetup 23rd
○ Data Day Texas - Nate will be there too!
● Feb
○ FOSDEM - One on testing one on scaling
○ JFokus in Stockholm - Adding deep learning to Spark
○ I disappear for a week and pretend computers work
● March
○ Strata San Jose - Big Data Beyond the JVM
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
High Performance SparkComing Soon:
Learning PySpark
High Performance Spark!
Available today!
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark
Cat wave photo by Quinn Dombrowski
k thnx bye!
If you <3 testing & want to fill out
survey: http://bit.ly/holdenTestingSpark
Want to tell me (and or my boss) how
I’m doing?
http://bit.ly/holdenTalkFeedback
Want to e-mail me? Promise not to be
creepy? Ok:
holden@pigscanfly.ca

More Related Content

A super fast introduction to Spark and glance at BEAM

  • 1. A Fast Intro to Spark And a glance at BEAM Lightning fast cluster computing*
  • 3. Who am I? ● So this is kind of a long shot, but American TV gets everywhere so…. ● I’m not a doctor but I did stay at an IHG property last night ● Which is like a fancy version of Holiday Inn Express ○ I’m honestly not sure if this makes me more or less qualified ● And I did get my IHG points restored ● Ok but for real
  • 4. Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google focused on OSS Big Data ● Apache Spark PMC ● Contributor to a lot of other projects (including BEAM) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of High Performance Spark & Learning Spark (+ more) ● Twitter: @holdenkarau ● Slideshare http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Related Spark Videos http://bit.ly/holdenSparkVideos
  • 6. Who do I think you all are? ● Nice people* ● Getting started with Spark or BEAM ○ Or wondering if you need it ● Familiar-ish with Scala or Java or Python Amanda
  • 7. What we are going to explore together! ● What is Spark? ● Getting Spark setup locally ● Spark primary distributed collection ● Word count in Spark ● Spark SQL / DataFrames Then a glance at BEAM ● What is BEAM & what’s its current state ● Streaming wordcount because of course
  • 8. Some things that may color my views: ● I’m on the Spark PMC -- Spark’s success => I can probably make more $s ● My employer cares about BEAM (and Spark and other things) ● I work primarily in Python & Scala these days ● I like functional programming ● Probably some others I’m forgetting On the other hand: ● I’ve worked on Spark for a long time and know a lot of its faults ● My goals are pretty flexible ● I have x86 assembly code tattooed on my back
  • 9. What is Spark? ● General purpose distributed system ○ With a really nice API including Python :) ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 10. The different pieces of Spark Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  • 11. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  • 12. Why people come to Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  • 13. Companion (optional!) notebook funtimes: http://bit.ly/sparkDocs http://bit.ly/sparkPyDocs http://bit.ly/PySparkIntroExamples (has a notebook!) ● Did you know? You can run Spark on Dataproc there by giving my employer money. You can also run it elsewhere. http://bit.ly/learningSparkExamples (lots of code files) http://bit.ly/hkMLExample (has a notebook, ML focused) David DeHetre
  • 14. SparkContext: entry to the world ● Can be used to create RDDs from many input sources ○ Native collections, local & remote FS ○ Any Hadoop Data Source ● Also create counters & accumulators ● Automatically created in the shells (called sc) ● Specify master & app name when creating ○ Master can be local[*], spark:// , yarn, etc. ○ app name should be human readable and make sense ● etc. Petfu l
  • 15. RDDs: Spark’s Primary abstraction RDD (Resilient Distributed Dataset) ● Distributed collection ● Recomputed on node failure ● Distributes data & work across the cluster ● Lazily evaluated (transformations & actions) Helen Olney
  • 16. Word count (in python) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(“output”) Photo By: Will Keightley
  • 17. Why laziness is cool (and not) ● Pipelining (can put maps, filter, flatMap together) ● Can do interesting optimizations by delaying work ● We use the DAG to recompute on failure ○ (writing data out to 3 disks on different machines is so last season) ○ Or the DAG puts the R is Resilient RDD, except DAG doesn’t have an R :( How it hurts: ● Debugging is confusing ● Re-using data - lazyness only sees up to the first action ● Some people really hate immutability Matthew Hurst
  • 18. Word count (in python) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile("output") No data is read or processed until after this line This is an “action” which forces spark to evaluate the RDD daniilr
  • 19. RDD re-use - sadly not magic ● If we know we are going to re-use the RDD what should we do? ○ If it fits nicely in memory caching in memory ○ persisting at another level ■ MEMORY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER ○ checkpointing ● Noisey clusters ○ _2 & checkpointing can help ● persist first for checkpointing Richard Gillin
  • 20. Some common transformations & actions Transformations (lazy) ● map ● filter ● flatMap ● reduceByKey ● join ● cogroup Actions (eager) ● count ● reduce ● collect ● take ● saveAsTextFile ● saveAsHadoop ● countByValue Photo by Steve Photo by Dan G
  • 21. This can feel like magic* sometimes :) Steven Saus *I mean not good magic.
  • 22. Magic has it’s limits: key-skew + black boxes ● There is a worse way to do WordCount ● We can use the seemingly safe thing called groupByKey ● Then compute the sum... _torne
  • 23. Bad word count RDD :( words = rdd.flatMap(lambda x: x.split(" ")) wordPairs = words.map(lambda w: (w, 1)) grouped = wordPairs.groupByKey() counted_words = grouped.mapValues(lambda counts: sum(counts)) counted_words.saveAsTextFile("boop") Tomomi
  • 24. f ford Pinto by Morven
  • 25. f ford Pinto by Morven ayphen
  • 26. Why should we consider Datasets? ● Performance ○ Smart optimizer ○ More efficient storage ○ Faster serialization ● Simplicity ○ Windowed operations ○ Multi-column & multi-type aggregates Rikki's Refuge
  • 27. Why are Datasets so awesome? ● Easier to mix functional style and relational style ○ No more hive UDFs! ● Nice performance of Spark SQL flexibility of RDDs ○ Tungsten (better serialization) ○ Equivalent of Sortable trait ● Strongly typed ● The future (ML, Graph, etc.) ● Potential for better language interop ○ Something like Arrow has a much better chance with Datasets ○ Cross-platform libraries are easier to make & use Will Folsom
  • 28. What is the performance like? Andrew Skudder
  • 29. How is it so fast? ● Optimizer has more information (schema & operations) ● More efficient storage formats ● Faster serialization ● Some operations directly on serialized data formats ● non-JVM languages: does more computation in the JVM Andrew Skudder
  • 30. Word count w/Dataframes df = spark.read.load(src) # Returns an RDD words = df.select("text").flatMap(lambda x: x.text.split(" ")) words_df = words.map( lambda x: Row(word=x, cnt=1)).toDF() word_count = words_df.groupBy("word").sum() word_count.write.format("parquet").save("wc.parquet") Still have the double serialization here :(
  • 31. Word count w/Datasets val df = spark.read.load(src).select("text") val ds = df.as[String] # Returns an Dataset! val words = ds.flatMap(x => x.split(" ")) val grouped = words.groupBy("value") val word_count = grouped.agg(count("*") as "count") word_count.write.format("parquet").save("wc") Can’t push down filters from here If it’s a simple type we don’t have to define a case class
  • 32. What can the optimizer do now? ● Sort on the serialized data ● Understand the aggregate (“partial aggregates”) ○ Could sort of do this before but not as awesomely, and only if we used reduceByKey - not groupByKey ● Pack them bits nice and tight
  • 33. So whats this new groupBy? ● No longer causes explosions like RDD groupBy ○ Able to introspect and pipeline the aggregation ● Returns a GroupedData (or GroupedDataset) ● Makes it easy to perform multiple aggregations ● Built in shortcuts for aggregates like avg, min, max ● Longer list at http://spark.apache.org/docs/latest/api/scala/index.html# org.apache.spark.sql.functions$ ● Allows the optimizer to see what aggregates are being performed Sherrie Thai
  • 34. Computing some aggregates by age code: df.groupBy("age").min("hours-per-week") OR import org.apache.spark.sql.catalyst.expressions.aggregate._ df.groupBy("age").agg(min("hours-per-week"))
  • 35. Easily compute multiple aggregates: df.groupBy("age").agg(min("hours-per-week"), avg("hours-per-week"), max("capital-gain")) PhotoAtelier
  • 36. Using Datasets to mix functional & relational style: val ds: Dataset[RawPanda] = ... val happiness = ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y)
  • 37. So what was that? ds.toDF().filter($"happy" === true).as[RawPanda]. select($"attributes"(0).as[Double]). reduce((x, y) => x + y) convert a Dataset to a DataFrame to access more DataFrame functions (pre-2.0) Convert DataFrame back to a Dataset A typed query (specifies the return type).Traditional functional reduction: arbitrary scala code :)
  • 38. And functional style maps: /** * Functional map + Dataset, sums the positive attributes for the pandas */ def funMap(ds: Dataset[RawPanda]): Dataset[Double] = { ds.map{rp => rp.attributes.filter(_ > 0).sum} } Chris Isherwood
  • 39. But where DataFrames explode? ● Iterative algorithms - large plans ○ Use your escape hatch to RDDs! ● Some push downs are sad pandas :( ● Default shuffle size is sometimes too small for big data (200 partitions) ● Default partition size when reading in is also sad
  • 40. Our ever growing ecosystem: http://mattturck.com/bigdata2017/
  • 41. General purpose eating the world ● Operations overhead ● Moving data from System 1 to System 2 (sqoop and friends) ● We still have specialized tools, but being built on top of general frameworks ○ e.g. see mahout on Spark ○ Less closely tied things like Hive/Pig on Spark ○ TF.Transform etc. Photo by D Coetzee
  • 42. Even then, lots of general purpose tools:
  • 44. And language silos (Scala, Python, Go, etc.!) Photo by: photobom Photo: Fritz Schuman (ScalaDays CPH)
  • 45. And cloud silos…. Photo By: Zechariah Judy
  • 46. And the proliferation of pagers :( Photo by: Hades2k Mike Knell
  • 47. What’s the state of non-JVM big data? Most of the tools are built in the JVM, so how do we play together? ● Pickling, Strings, JSON, XML, oh my! ● Unix pipes ● Sockets What about if we don’t want to copy the data all the time? ● Or standalone “pure”* re-implementations of everything ○ Reasonable option for things like Kafka where you would have the I/O regardless. ○ Also cool projects like dask -- but hard to talk to existing ecosystem David Brown
  • 48. Spark in Scala, how does PySpark work? ● Py4J + pickling + JSON and magic ○ This can be kind of slow sometimes ● Distributed collections are often collections of pickled objects ● Spark SQL (and DataFrames) avoid some of this ○ Sometimes we can make them go fast and compile them to the JVM ● Features aren’t automatically exposed, but exposing them is normally simple. ● SparkR depends on similar magic kristin klein
  • 49. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 50. The “future”*: faster interchange ● By future I mean availability starting in the next 3-6 months (with more improvements after). ○ Yes much of this code exists, it just isn’t released yet so I’m sure we’ll find all sorts of bugs and ways to improve. ○ Relatedly you can help us test in Spark 2.3 when we start the RC process to catch bug early! ● Unifying our cross-language experience ○ And not just “normal” languages, CUDA counts yo Tambako The Jaguar
  • 51. Andrew Skudder *Arrow: likely the future. I really hope so. Spark 2.3 and beyond! * * With early work happening to support GPUs/ TF.
  • 52. BEAM backends: ● BEAM nominally supports* ○ Dataflow ○ Flink* ○ Spark* ○ IBM Streams, etc. ● Goal of more than just lowest-common-demoninator, think of it like a compiler** *Supports as in early-stage, but we’re working on it (and we’d love your help!) **But you know, in the same sense I compare Spark Streams to pandas coming a wooden slide.
  • 53. BEAM Languages ● JVM: Scala, Java, etc. ● non-JVM: Python w/Go and more coming
  • 54. BEAM Beyond the JVM ● This part doesn’t work outside of Google’s hosted environment yet, so I’m going to be light on the details ● tl;dr : uses grpc / protobuf ● But exciting new plans (w/ some early code) to unify the runners and ease the support of different languages (called SDKS) ○ See https://beam.apache.org/contribute/portability/
  • 55. What do the different APIs look like? ● Everyone's favourite: Streaming Word Count Example ● And then windowed wordcount! ● (And also a peak at Tensorflow incase anyone is trying to raise a series A)
  • 56. Spark wordcount (Python*) - “pure” relational # Split the lines into words words = lines.select( # explode turns each item in an array into a separate row explode( split(lines.value, ' ') ).alias('word') ) # Generate running word count wordCounts = words.groupBy('word').count()
  • 57. BEAM wordcount (Java) p.apply("ExtractWords", ParDo.of(new DoFn<String, String>() { @ProcessElement public void processElement(ProcessContext c) { for (String word : c.element().split("W+")) { c.output(word); }}})) .apply(Count.<String>perElement())
  • 58. What about windowed word count? Trish Hamme Christer van der Meeren
  • 59. What else might happen? ● One execution engine becomes super amazing at everything ● Instead of a compiler like unifier we see something like streaming SQL become our unifier ○ Relatedly BEAM, Spark & Flink, and Kafka all have streaming SQL implementations. ● People realize there big data problem is actually three small data problems in a trench coat bnilsen
  • 60. And some upcoming talks: ● Jan ○ If interest tomorrow: Office Hours? Tweet me @holdenkarau ○ LinuxConf AU - next week ○ Sydney Spark meetup 23rd ○ Data Day Texas - Nate will be there too! ● Feb ○ FOSDEM - One on testing one on scaling ○ JFokus in Stockholm - Adding deep learning to Spark ○ I disappear for a week and pretend computers work ● March ○ Strata San Jose - Big Data Beyond the JVM
  • 61. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Coming soon: Spark in Action High Performance SparkComing Soon: Learning PySpark
  • 62. High Performance Spark! Available today! You can buy it from that scrappy Seattle bookstore, Jeff Bezos needs another newspaper and I want a cup of coffee. http://bit.ly/hkHighPerfSpark
  • 63. Cat wave photo by Quinn Dombrowski k thnx bye! If you <3 testing & want to fill out survey: http://bit.ly/holdenTestingSpark Want to tell me (and or my boss) how I’m doing? http://bit.ly/holdenTalkFeedback Want to e-mail me? Promise not to be creepy? Ok: holden@pigscanfly.ca