SlideShare a Scribd company logo
Spark ETL
How to create an optimal daily fantasy baseball roster.
Chicago Hadoop Users Group
May 12, 2015
Don Drake
don@drakeconsulting.com
@dondrake
Overview
• Who am I?
• ETL
• Daily Fantasy Baseball
• Spark
• Data Flow
• Extracting - web crawler
• Parquet
• DataFrames
• Python or Scala?
• Transforming - RDDs
• Transforming - Moving Average
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Who Am I?
• Don Drake @dondrake
• Currently: Principal Big Data Consultant @ Allstate
• 5 years consulting on Hadoop
• Independent consultant for last 14 years
• Previous clients:
• Navteq / Nokia
• Sprint
• Mobile Meridian - Co-Founder
• MailLaunder.com - my SaaS anti-spam service
• cars.com
• Tribune Media Services
• Family Video
• Museum of Science and Industry
ETL
• Informally: Any repeatable programmed data movement
• Extraction - Get data from another source
• Oracle/PostgreSQL
• CSV
• Web crawler
• Transform - Normalize formatting of phone #’s, addresses
• Create surrogate keys
• Joining data sources
• Aggregate
• Load -
• Load into data warehouse
• CSV
• Sent to predictive model
Spark
• Apache Spark is a fast and general purpose engine for large-scale data
processing.
• It provides high-level API’s in Java, Scala, and Python (REPL’s for Scala
and Python)
• Includes an advanced DAG execution engine that supports in-memory
computing
• RDD - Resilient Distributed Dataset — Core construct of the framework
• Includes a set of high-level tools including Spark SQL for SQL and
structured data processing, MLlib for machine learning, GraphX for graph
processing and Spark Streaming
• Can run in a cluster (Hadoop (YARN), EC2, Mesos), Standalone, Local
• Open Source, core committers from DataBricks
• Latest version is 1.3.1, which includes DataFrames
• Started in 2009 (AMPLab) as research project, Apache project since 2013.
• LOTS of momentum.
Daily Fantasy Baseball
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark 101- Execution
• Driver - your program’s main() method
• Only 1 per application
• Executors - do the distributed work
• As many as your cluster can handle
• You determine ahead of time how many
• You determine the amount of RAM required
Spark 101 - RDD
• RDD - Resilient Distributed Dataset
• Can be created from Hadoop Input formats (text
file, sequence file, parquet file, HBase, etc.) OR by
transforming other RDDs.
• RDDs have actions and transformations which
return pointers to new RDDs
• RDD’s can contain anything

scala> val textFile = sc.textFile("README.md") 

textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

scala> textFile.count() // Number of items in this RDD 

res0: Long = 126 

scala> textFile.first() // First item in this RDD 

res1: String = # Apache Spark
DEMO
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark 101 - DEMO
1 #!/bin/env python
2
3 import pyspark.SparkContext
4
5
6 sc = new SparkContext()
7
8 # create some numbers
9 nums = sc.parallelize(xrange(1000))
10
11 nums.getNumPartitions()
12
13 nums.count()
Spark 101 - Lazy Demo
1 #!/bin/env python
2
3 import pyspark.SparkContext
4
5 sc = new SparkContext()
6
7 # create some numbers
8 nums = sc.parallelize(xrange(1000))
9
10 check = sc.accumulator(0)
11
12 def isEven(x):
13 check.add(1)
14 return x % 2 == 0
15
16 evens = nums.filter(isEven)
17
18 evens.count()
19
20 evens.collect()
21
22 check
Spark 101 - Lazy RDDs
• RDD’s are executed *only* when an action is called
upon it.
• This allows multiple transformations on a RDD,
allowing Spark to compute an optimal execution plan.
• Uncached RDD’s are evaluated *every* time an action
is called upon it
• Cache RDD’s if you know you will iterate over an RDD
more than once.
• You can cache to RAM + Disk, by default, persists to
RAM only.
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Data Flow
Overview
Extraction & Ingestion
Extraction
• We have seen it’s easy to create a parallelized data
structure that we can execute code against
• Pro-tip: If extracting from relational database, use
Sqoop + save as Parquet
• We need to download a set of files for each
baseball game previously played (player statistics)
• TODO List Pattern
• We know the filenames to download for each
game (they are static)
• Download all files for a game in parallel
1 def getFiles(game):
2 session = requests.Session()
3 files = scrape.get_files([game], session=session)
4 count, fails = scrape.download(files, DownloadMLB.cache)
5 return (count, fails)
6
7 def summarize(a, x):
8 total = a[0] + x[0]
9 alist = a[1]
10 alist.extend(x[1])
11 return (total, alist)
12
13 def run(self):
14
15 sc = SparkContext()
16 start_scrape = datetime.now()
17 begin, begin_parts = scrape.get_boundary(self.begin)
18 end, end_parts = scrape.get_boundary(self.end)
19
20 session = requests.Session()
21
22 print "here"
23 all_years_months_days = self.getYearsMonths(self.WEB_ROOT,
session)
24 games = scrape.get_games(all_years_months_days, session=session)
25
26 gamesRDD = sc.parallelize(games)
27 print "fileRDD=", gamesRDD
28
29 gamesRDD.foreach(dump)
30 print "# parttions:", gamesRDD.getNumPartitions()
31 print "count=", gamesRDD.count()
32 res = gamesRDD.map(getFiles).reduce(summarize)
33 print "res=", res
34
35 count = res[0]
36 fails = res[1]
37 end_scrape = datetime.now()
38 self.log.info("%d files downloaded in %s", count,
39 str(end_scrape - start_scrape))
40 if fails:
41 for url in fails:
42 self.log.error("failed to download %s", url)
43
44 sc.stop()
TODO DEMO
Butter…..Butter……
Parquet?
• Features
• Interoperability - Spark, Impala, Sqoop, much more
• Space efficient
• Query efficient
• Schema - (field name, data type, nullable)
• Columnar Format
• Different Encoding (Compression) Algorithms
• Delta Encoding - diffs per row per column
• Dictionary Encoding (~60k items). e.g. Dates, IP address,
• Run Length Encoding - (for repetitive data)
• Example: http://blog.cloudera.com/blog/2014/05/using-
impala-at-scale-at-allstate/
http://parquet.apache.org/presentations/
http://parquet.apache.org/presentations/
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Using DataFrame -
SparkSQL
• Previously called SchemaRDD, now with DataFrame contain extra
functionality to query/filter
• Allow you to write SQL (joins, filter, etc.) against DataFrames (or
RDD’s with a little effort)
• All DataFrame’s contain a schema
1 batter_mov_avg = sqlContext.parquetFile(self.rddDir + "/" +
2 "batter_moving_averages.parquet")
3 batter_mov_avg.registerTempTable("bma")
4 batter_mov_avg.persist(storageLevel=StorageLevel.MEMORY_AND_DISK)
5 print "batter_mov_avg=", batter_mov_avg.take(2)
6
7 batter_games = sqlContext.sql("select * from games g, bma,
8 game_players gp 
9 where bma.player_id = gp.player_id 
10 and bma.game_date = gp.game_date 
11 and gp.game_id = g.game_id 
12 and g.game_date = gp.game_date 
13 and g.game_date = bma.game_date")
DataFrame’s to the rescue
• DataFrames offer a DSL that provides a distributed
data manipulation
• In Python, you can convert a DF to Pandas data
frame and vice versa
More DataFrames
• We can select columns from DataFrame
• Run our own transform on it with map()
• Transform function gets a Row() object, and returns one
• The toDF() function will infer data types for you.
• https://issues.apache.org/jira/browse/SPARK-7182
1 batting_features = batter_games.select(*unique_cols)
2 print "batting_features=", batting_features.schema
3 #print "batting_features=", batting_features.show()
4
5 def transformBatters(row_object):
6 row = row_object.asDict()
7 row = commonTransform(row)
8
9 return Row(**row)
10
11 batting_features = batting_features.map(transformBatters).toDF()
12 batting_features.persist(storageLevel=StorageLevel.MEMORY_AND_DISK)
13
14 self.rmtree(self.rddDir + "/" + "batting_features.parquet")
15 batting_features.saveAsParquetFile(self.rddDir + "/" +
16 "batting_features.parquet")
Programmatically Specifying
the Schema1 class Game(AbstractDF):
2 schema = StructType( sorted(
3 [
4 StructField("game_id", StringType()),
5 StructField("game_date", DateType()),
6 StructField("id", IntegerType()),
7 StructField("type", StringType()),
8 StructField("local_game_time", StringType()),
9 StructField("game_pk", IntegerType()),
10 StructField("gameday_sw", StringType()),
11 StructField("game_time_et", TimestampType()),
12 StructField("home_code", StringType()),
13 StructField("home_abbrev", StringType()),
14 StructField("home_name", StringType()),
15 StructField("home_won", IntegerType()),
16 StructField("home_loss", IntegerType()),
17 StructField("home_division_id", IntegerType()),
18 StructField("home_league", StringType()),
19 StructField("away_code", StringType()),
20 StructField("away_abbrev", StringType()),
21 StructField("away_name", StringType()),
22 StructField("away_won", IntegerType()),
23 StructField("away_loss", IntegerType()),
24 StructField("away_division_id", IntegerType()),
25 StructField("away_league", StringType()),
26 StructField("stadium_id", IntegerType()),
27 StructField("stadium_name", StringType()),
28 StructField("stadium_venue_w_chan_loc", StringType()),
29 StructField("stadium_location", StringType()),
30 StructField("modified", TimestampType()),
31 ],
32 key = lambda x: x.name))
33 skipSelectFields = ['modified']
AbstractDF
• Download: https://gist.github.com/dondrake/
c7fcf42cf051492fdd91
• AbstractDF adds 3 major features to the class
• Creates a python object with attributes based on the field
name defined in the schema
• e.g. g = Game(); g.game_id = ‘123’
• Exposes a helper function to create a Row object
containing all of the fields in schema stored with correct
data types.
• Needed so Scala can correctly infer data types of the
values sent.
• Provides method that will return a list of columns to use in
a select statement.
Python or Scala??
Python or Scala???
• Use Python for prototyping
• Access to pandas, scikit-learn, etc.
• Python is strongly typed (but not statically typed)
• Spark Python API support lags, not as popular as Scala. Bugs might
not get fixed right away or at all.
• Python is slower due to Gateway required. All data must be serialized
and sent through Gateway to/from Scala. (Not as bad for DataFrames)
• Use Scala for application development
• Scala learning curve is steep.
• Functional Programming learning curve is steep.
• Scala is a statically typed language
• Java was intentionally left off
• Don’t bother with Java 7 (IMO)
Transformations
Building a moving average
Example of a moving average
How do we compute a 5-
day MA for batting average?
• For an individual player, on an particular date, we need
the previous 5 days batting average.
• Please note: not all players play every day.
• To build a predictive model, we need historical data, so
we would need to calculate this for each day, every
player, for (possibly) each metric.
• We would also want to use different Moving Average
durations (e.g. a 5, 7 ,14 day moving average) for each
player - day.
• Our compute space just got big and is also
embarrassingly parallel
Some Stats
• load table (2 full years + this year so far ~17%):
• gamePlayers = sqlContext.parquetFile(rddDir +
‘game_players.parquet').cache()
• # Hitters:
• gamePlayers.filter (gamePlayers.fd_position !=
‘P').count()
• 248856L (# of hitter-gamedate combinations)
• # Pitchers
• gamePlayers.filter (gamePlayers.fd_position ==
'P').count()
• 244926L
• Season-level moving average would require about 248,000 *
(162/2) rows of data (~20 million rows)
MapReduce is dead.
Long live MapReduce.
• Spark has a set of transformations that can perform on
RDD’s of key / value pairs.
• Transformations
• groupByKey([numTasks])
• reduceByKey(func, [numTasks])
• aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
• sortByKey([ascending], [numTasks])
• join(otherDataset, [numTasks])
• cogroup(otherDataset, [numTasks])
• Actions
• reduce(func)
• countByKey()
Broadcast Map-side join
• “Broadcast” a dictionary of lists of game_dates,
keys in dictionary is the year the game took place.
• FlatMap operation looping over broadcast
game_dates for the stats took place
• We use flatMap because we emit many rows
from a single input (7-day, 14-day, etc)
• Key: player_id + asOfDate + moving_length
• Value: Game Stats dictionary from input
• Since this generates a lot of output, I
repartition(50) the output of flatMap
Reduce
• now call groupByKey()
• Creates a RDD that have a list of values with the same key.
These values are the stats to calculate the average
• If performing aggregation, use reduceByKey, less shuffling
involved
• Calculate the average (and stddev, etc.)
• Emit new key of player_id + asOfDate
• Value is the dictionary of all moving average fields
• Run reduceByKey() to combine (concatenate) 7-day, 14-day
metrics to a single row
• Save results as Parquet, will be joined with other features to
create a predictive model
Useful Links
1. https://databricks.com/blog/2015/04/28/project-
tungsten-bringing-spark-closer-to-bare-metal.html
2. https://www.usenix.org/system/files/conference/
nsdi15/nsdi15-paper-ousterhout.pdf
3. https://zeppelin.incubator.apache.org/
4. https://codewords.recurse.com/issues/one/an-
introduction-to-functional-programming
5. http://apache-spark-user-list.
1001560.n3.nabble.com/
Spark Books
Q & A

More Related Content

Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster

  • 1. Spark ETL How to create an optimal daily fantasy baseball roster. Chicago Hadoop Users Group May 12, 2015 Don Drake don@drakeconsulting.com @dondrake
  • 2. Overview • Who am I? • ETL • Daily Fantasy Baseball • Spark • Data Flow • Extracting - web crawler • Parquet • DataFrames • Python or Scala? • Transforming - RDDs • Transforming - Moving Average
  • 4. Who Am I? • Don Drake @dondrake • Currently: Principal Big Data Consultant @ Allstate • 5 years consulting on Hadoop • Independent consultant for last 14 years • Previous clients: • Navteq / Nokia • Sprint • Mobile Meridian - Co-Founder • MailLaunder.com - my SaaS anti-spam service • cars.com • Tribune Media Services • Family Video • Museum of Science and Industry
  • 5. ETL • Informally: Any repeatable programmed data movement • Extraction - Get data from another source • Oracle/PostgreSQL • CSV • Web crawler • Transform - Normalize formatting of phone #’s, addresses • Create surrogate keys • Joining data sources • Aggregate • Load - • Load into data warehouse • CSV • Sent to predictive model
  • 6. Spark • Apache Spark is a fast and general purpose engine for large-scale data processing. • It provides high-level API’s in Java, Scala, and Python (REPL’s for Scala and Python) • Includes an advanced DAG execution engine that supports in-memory computing • RDD - Resilient Distributed Dataset — Core construct of the framework • Includes a set of high-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing and Spark Streaming • Can run in a cluster (Hadoop (YARN), EC2, Mesos), Standalone, Local • Open Source, core committers from DataBricks • Latest version is 1.3.1, which includes DataFrames • Started in 2009 (AMPLab) as research project, Apache project since 2013. • LOTS of momentum.
  • 9. Spark 101- Execution • Driver - your program’s main() method • Only 1 per application • Executors - do the distributed work • As many as your cluster can handle • You determine ahead of time how many • You determine the amount of RAM required
  • 10. Spark 101 - RDD • RDD - Resilient Distributed Dataset • Can be created from Hadoop Input formats (text file, sequence file, parquet file, HBase, etc.) OR by transforming other RDDs. • RDDs have actions and transformations which return pointers to new RDDs • RDD’s can contain anything
 scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> textFile.count() // Number of items in this RDD res0: Long = 126 scala> textFile.first() // First item in this RDD res1: String = # Apache Spark
  • 11. DEMO
  • 13. Spark 101 - DEMO 1 #!/bin/env python 2 3 import pyspark.SparkContext 4 5 6 sc = new SparkContext() 7 8 # create some numbers 9 nums = sc.parallelize(xrange(1000)) 10 11 nums.getNumPartitions() 12 13 nums.count()
  • 14. Spark 101 - Lazy Demo 1 #!/bin/env python 2 3 import pyspark.SparkContext 4 5 sc = new SparkContext() 6 7 # create some numbers 8 nums = sc.parallelize(xrange(1000)) 9 10 check = sc.accumulator(0) 11 12 def isEven(x): 13 check.add(1) 14 return x % 2 == 0 15 16 evens = nums.filter(isEven) 17 18 evens.count() 19 20 evens.collect() 21 22 check
  • 15. Spark 101 - Lazy RDDs • RDD’s are executed *only* when an action is called upon it. • This allows multiple transformations on a RDD, allowing Spark to compute an optimal execution plan. • Uncached RDD’s are evaluated *every* time an action is called upon it • Cache RDD’s if you know you will iterate over an RDD more than once. • You can cache to RAM + Disk, by default, persists to RAM only.
  • 19. Extraction • We have seen it’s easy to create a parallelized data structure that we can execute code against • Pro-tip: If extracting from relational database, use Sqoop + save as Parquet • We need to download a set of files for each baseball game previously played (player statistics) • TODO List Pattern • We know the filenames to download for each game (they are static) • Download all files for a game in parallel
  • 20. 1 def getFiles(game): 2 session = requests.Session() 3 files = scrape.get_files([game], session=session) 4 count, fails = scrape.download(files, DownloadMLB.cache) 5 return (count, fails) 6 7 def summarize(a, x): 8 total = a[0] + x[0] 9 alist = a[1] 10 alist.extend(x[1]) 11 return (total, alist) 12 13 def run(self): 14 15 sc = SparkContext() 16 start_scrape = datetime.now() 17 begin, begin_parts = scrape.get_boundary(self.begin) 18 end, end_parts = scrape.get_boundary(self.end) 19 20 session = requests.Session() 21 22 print "here" 23 all_years_months_days = self.getYearsMonths(self.WEB_ROOT, session) 24 games = scrape.get_games(all_years_months_days, session=session) 25 26 gamesRDD = sc.parallelize(games) 27 print "fileRDD=", gamesRDD 28 29 gamesRDD.foreach(dump) 30 print "# parttions:", gamesRDD.getNumPartitions() 31 print "count=", gamesRDD.count() 32 res = gamesRDD.map(getFiles).reduce(summarize) 33 print "res=", res 34 35 count = res[0] 36 fails = res[1] 37 end_scrape = datetime.now() 38 self.log.info("%d files downloaded in %s", count, 39 str(end_scrape - start_scrape)) 40 if fails: 41 for url in fails: 42 self.log.error("failed to download %s", url) 43 44 sc.stop()
  • 22. Butter…..Butter…… Parquet? • Features • Interoperability - Spark, Impala, Sqoop, much more • Space efficient • Query efficient • Schema - (field name, data type, nullable) • Columnar Format • Different Encoding (Compression) Algorithms • Delta Encoding - diffs per row per column • Dictionary Encoding (~60k items). e.g. Dates, IP address, • Run Length Encoding - (for repetitive data) • Example: http://blog.cloudera.com/blog/2014/05/using- impala-at-scale-at-allstate/
  • 26. Using DataFrame - SparkSQL • Previously called SchemaRDD, now with DataFrame contain extra functionality to query/filter • Allow you to write SQL (joins, filter, etc.) against DataFrames (or RDD’s with a little effort) • All DataFrame’s contain a schema 1 batter_mov_avg = sqlContext.parquetFile(self.rddDir + "/" + 2 "batter_moving_averages.parquet") 3 batter_mov_avg.registerTempTable("bma") 4 batter_mov_avg.persist(storageLevel=StorageLevel.MEMORY_AND_DISK) 5 print "batter_mov_avg=", batter_mov_avg.take(2) 6 7 batter_games = sqlContext.sql("select * from games g, bma, 8 game_players gp 9 where bma.player_id = gp.player_id 10 and bma.game_date = gp.game_date 11 and gp.game_id = g.game_id 12 and g.game_date = gp.game_date 13 and g.game_date = bma.game_date")
  • 27. DataFrame’s to the rescue • DataFrames offer a DSL that provides a distributed data manipulation • In Python, you can convert a DF to Pandas data frame and vice versa
  • 28. More DataFrames • We can select columns from DataFrame • Run our own transform on it with map() • Transform function gets a Row() object, and returns one • The toDF() function will infer data types for you. • https://issues.apache.org/jira/browse/SPARK-7182 1 batting_features = batter_games.select(*unique_cols) 2 print "batting_features=", batting_features.schema 3 #print "batting_features=", batting_features.show() 4 5 def transformBatters(row_object): 6 row = row_object.asDict() 7 row = commonTransform(row) 8 9 return Row(**row) 10 11 batting_features = batting_features.map(transformBatters).toDF() 12 batting_features.persist(storageLevel=StorageLevel.MEMORY_AND_DISK) 13 14 self.rmtree(self.rddDir + "/" + "batting_features.parquet") 15 batting_features.saveAsParquetFile(self.rddDir + "/" + 16 "batting_features.parquet")
  • 29. Programmatically Specifying the Schema1 class Game(AbstractDF): 2 schema = StructType( sorted( 3 [ 4 StructField("game_id", StringType()), 5 StructField("game_date", DateType()), 6 StructField("id", IntegerType()), 7 StructField("type", StringType()), 8 StructField("local_game_time", StringType()), 9 StructField("game_pk", IntegerType()), 10 StructField("gameday_sw", StringType()), 11 StructField("game_time_et", TimestampType()), 12 StructField("home_code", StringType()), 13 StructField("home_abbrev", StringType()), 14 StructField("home_name", StringType()), 15 StructField("home_won", IntegerType()), 16 StructField("home_loss", IntegerType()), 17 StructField("home_division_id", IntegerType()), 18 StructField("home_league", StringType()), 19 StructField("away_code", StringType()), 20 StructField("away_abbrev", StringType()), 21 StructField("away_name", StringType()), 22 StructField("away_won", IntegerType()), 23 StructField("away_loss", IntegerType()), 24 StructField("away_division_id", IntegerType()), 25 StructField("away_league", StringType()), 26 StructField("stadium_id", IntegerType()), 27 StructField("stadium_name", StringType()), 28 StructField("stadium_venue_w_chan_loc", StringType()), 29 StructField("stadium_location", StringType()), 30 StructField("modified", TimestampType()), 31 ], 32 key = lambda x: x.name)) 33 skipSelectFields = ['modified']
  • 30. AbstractDF • Download: https://gist.github.com/dondrake/ c7fcf42cf051492fdd91 • AbstractDF adds 3 major features to the class • Creates a python object with attributes based on the field name defined in the schema • e.g. g = Game(); g.game_id = ‘123’ • Exposes a helper function to create a Row object containing all of the fields in schema stored with correct data types. • Needed so Scala can correctly infer data types of the values sent. • Provides method that will return a list of columns to use in a select statement.
  • 32. Python or Scala??? • Use Python for prototyping • Access to pandas, scikit-learn, etc. • Python is strongly typed (but not statically typed) • Spark Python API support lags, not as popular as Scala. Bugs might not get fixed right away or at all. • Python is slower due to Gateway required. All data must be serialized and sent through Gateway to/from Scala. (Not as bad for DataFrames) • Use Scala for application development • Scala learning curve is steep. • Functional Programming learning curve is steep. • Scala is a statically typed language • Java was intentionally left off • Don’t bother with Java 7 (IMO)
  • 34. Example of a moving average
  • 35. How do we compute a 5- day MA for batting average? • For an individual player, on an particular date, we need the previous 5 days batting average. • Please note: not all players play every day. • To build a predictive model, we need historical data, so we would need to calculate this for each day, every player, for (possibly) each metric. • We would also want to use different Moving Average durations (e.g. a 5, 7 ,14 day moving average) for each player - day. • Our compute space just got big and is also embarrassingly parallel
  • 36. Some Stats • load table (2 full years + this year so far ~17%): • gamePlayers = sqlContext.parquetFile(rddDir + ‘game_players.parquet').cache() • # Hitters: • gamePlayers.filter (gamePlayers.fd_position != ‘P').count() • 248856L (# of hitter-gamedate combinations) • # Pitchers • gamePlayers.filter (gamePlayers.fd_position == 'P').count() • 244926L • Season-level moving average would require about 248,000 * (162/2) rows of data (~20 million rows)
  • 37. MapReduce is dead. Long live MapReduce. • Spark has a set of transformations that can perform on RDD’s of key / value pairs. • Transformations • groupByKey([numTasks]) • reduceByKey(func, [numTasks]) • aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) • sortByKey([ascending], [numTasks]) • join(otherDataset, [numTasks]) • cogroup(otherDataset, [numTasks]) • Actions • reduce(func) • countByKey()
  • 38. Broadcast Map-side join • “Broadcast” a dictionary of lists of game_dates, keys in dictionary is the year the game took place. • FlatMap operation looping over broadcast game_dates for the stats took place • We use flatMap because we emit many rows from a single input (7-day, 14-day, etc) • Key: player_id + asOfDate + moving_length • Value: Game Stats dictionary from input • Since this generates a lot of output, I repartition(50) the output of flatMap
  • 39. Reduce • now call groupByKey() • Creates a RDD that have a list of values with the same key. These values are the stats to calculate the average • If performing aggregation, use reduceByKey, less shuffling involved • Calculate the average (and stddev, etc.) • Emit new key of player_id + asOfDate • Value is the dictionary of all moving average fields • Run reduceByKey() to combine (concatenate) 7-day, 14-day metrics to a single row • Save results as Parquet, will be joined with other features to create a predictive model
  • 40. Useful Links 1. https://databricks.com/blog/2015/04/28/project- tungsten-bringing-spark-closer-to-bare-metal.html 2. https://www.usenix.org/system/files/conference/ nsdi15/nsdi15-paper-ousterhout.pdf 3. https://zeppelin.incubator.apache.org/ 4. https://codewords.recurse.com/issues/one/an- introduction-to-functional-programming 5. http://apache-spark-user-list. 1001560.n3.nabble.com/
  • 42. Q & A