SlideShare a Scribd company logo
ACCELERATING DATA
SCIENCE WITH BETTER DATA
ENGINEERING ON DATABRICKS
Andrew Candela
WHAT IS MEDIAMATH?
• MediaMath is a demand-side media buying platform.
• We bid on ad opportunities from exchanges like
google and facebook and serve our client’s ads in those
spots.
• We leverage any information we can use in order to
increase performance of adds and target more
efficiently.
WHAT IS ANALYTICS AT
MEDIAMATH?
• A bunch of wannabe Data Scientists turned Data
Engineers
• Have a ton of data and good ideas but were limited
by computational capabilities
• Learned as we went. Databricks accelerated this
journey
WE NEED TO PROCESS TERABYTES OF
DATA WRITTEN TO S3 EVERY DAY IN
ORDER TO:
• Build new features for models
• Build new reporting for clients
• Set up internal data pipelines
THE VISION
Aggregate hundreds of TBs from S3 and write results to
PostgreSQL to get things like…
THE AUDIENCE INDEX REPORT (AIR)
Allows advertisers to gain insights into demographics of
users visiting their sites
BUILDING THE REPORT
THE INDEX FOR SITE P
• a measure of how many
users of a certain type
that we saw vs how
many we expect to see
• To find the index we
need to compute the
size of 4 groups of users
THE RAW DATA IN S3
• The data is provided by our
partners and is made available
for our team in S3.
• It consists of at least one
record for each user-segment
combination per day.
• MASSIVE redundancy (all we
care about is membership as
of a certain time)
User
ID
(String)
Segment
ID
(Integer)
Unix
Timestamp
(Int
eger)
A 1 1495129113
A 2 1495129245
A 2 1495129250
B 1 1495129245
This is really all we’re trying to do
HAVEN’T YOU HEARD OF HIVE, SON?
WE STARTED BY TRYING HIVE.
WE STRUGGLED.
• There is a lot of data: Segments table has about 3
trillion rows. Pixel table has 16 billion.
• Naively joining and aggregating with hive is the worst
way to do it
• Data can be transformed into a manageable format,
but one that is awkward to express with SQL
LIFE BEFORE DATABRICKS
(HIVE)
• One row per user/segment joined to one row per
user/pixel on userID
• Ran once a week and took days to complete on a cluster of
65 M4.2xl nodes.
• Had to take care - one M/R job would write > 1TB to
HDFS
• The join and then the shuffle after the join was killing us
KEY:VALUE FORMAT (UDB)
• One record per user - key
• Value is a python dictionary of segment information (max/min
timestamps)
• Much wider than original but far fewer records (on the order of
200x less)
• Users expire out of UDB after not being loaded to a segment
for 30 days
• Persist to s3 as sequence files bucketed by key
Accelerating Data Science with Better Data Engineering on Databricks
WHY NOT USE
DATAFRAMES?
• The nesting of the records makes it awkward to
deal with using SQL
• flatMap() and combineByKey() are the main drivers
of increased performance
THE MAIN DRAWBACK:
MAINTAINING UDB
• Add new users
• Expire old users
• Update existing users
• Hard, but worth it if we run this multiple times
• All this logic is conveniently expressed with python
THE JOIN
• Super easy as the data is already in pair RDDs.
• Since the records are wide this step is not as
painful as it used to be
• Data must be shuffled (this sucks). It’s in S3 already
clustered by user (the join key) but spark doesn’t
know and shuffles anyway
AGGREGATING THE DATA
COUNTING RESULTS IS NOT
EASY
• Each record represents a unique user. Just need to count up how
many pixel/segment pairs I see across all records.
• Initially tried exploding on pixel/segment and converting to a
dataframe
• Records are so heavily nested that fully exploding causes spill to
disk. We were filling up 100GB EBS volumes attached to the
nodes. Our production cluster uses 65 nodes.
• Skew was not an issue in this case
EXPLODE, BUT CAREFULLY
• flatMap - create one row per pixel for each user
(pixelA, {seg1, segA, …}), #from user 1
(pixelB, {seg1, segB …}), #from user 1
(pixelA, {seg1, seg3, …}), # from user 2
• combineByKey - keep a running tally of how many segments are seen by each pixel. Since data is first
combined by pixel on the nodes, the shuffle stage has far less data to deal with
(pixelA,{seg1:2, seg2:1, seg3:1, …}),
(pixelB,{seg1:1, seg3:1, …})
• flatMap again - make the final dataset with one row per pixel/segment combination
pixelA, seg1,2
pixelA,seg2,1
• Much better. Now nothing spills to disk, takes about half the time.
WRITING THE DATA:
JDBC
• Convert aggregated RDD to dataframe. Filter out garbage
records and use the jdbc connector
• df.write.jdbc(jdbcURL, MyPostgresTable, mode=‘overwrite’)
• Fair performance (1.6 hours for about 41MM rows) - not
affected by presence of index. Write takes just as long on
unindexed table
• Write to a staging table, then swap the staging and prod
tables so the view that the app uses points to the refreshed
data
• Careful indexing of tables makes selects fast!
PUTTING IT ALL TOGETHER
• Wrap logic, execution and monitoring into
objects/functions in a notebook
• Run notebook as a job
• Schedule that job right from the Databricks UI.
• Reporting, monitoring and some retry logic comes for
free!
CONCLUSION: HIVE VS SPARK
• Cannot compare directly because implementation is
different
• Spark implementation performs FAR better: run time
approximately 11 hours vs two days on similar
hardware. 1/4 time and price!
• Development time is the real win. It’s far faster and
easier to develop new pipelines
OUR CURRENT WORKFLOW
Now we have a very versatile tool that allows us to
monetize the data in S3
LIFE WITH DATABRICKS
• If you are familiar with python then Databricks and
PySpark unlocks a huge range of capabilities.
• We are much more productive.
• Our jobs are fun again! PySpark RDD APIs make it
easy to work with big data in an accessible way.

More Related Content

Accelerating Data Science with Better Data Engineering on Databricks

  • 1. ACCELERATING DATA SCIENCE WITH BETTER DATA ENGINEERING ON DATABRICKS Andrew Candela
  • 2. WHAT IS MEDIAMATH? • MediaMath is a demand-side media buying platform. • We bid on ad opportunities from exchanges like google and facebook and serve our client’s ads in those spots. • We leverage any information we can use in order to increase performance of adds and target more efficiently.
  • 3. WHAT IS ANALYTICS AT MEDIAMATH? • A bunch of wannabe Data Scientists turned Data Engineers • Have a ton of data and good ideas but were limited by computational capabilities • Learned as we went. Databricks accelerated this journey
  • 4. WE NEED TO PROCESS TERABYTES OF DATA WRITTEN TO S3 EVERY DAY IN ORDER TO: • Build new features for models • Build new reporting for clients • Set up internal data pipelines
  • 5. THE VISION Aggregate hundreds of TBs from S3 and write results to PostgreSQL to get things like…
  • 6. THE AUDIENCE INDEX REPORT (AIR) Allows advertisers to gain insights into demographics of users visiting their sites
  • 8. THE INDEX FOR SITE P • a measure of how many users of a certain type that we saw vs how many we expect to see • To find the index we need to compute the size of 4 groups of users
  • 9. THE RAW DATA IN S3 • The data is provided by our partners and is made available for our team in S3. • It consists of at least one record for each user-segment combination per day. • MASSIVE redundancy (all we care about is membership as of a certain time) User ID (String) Segment ID (Integer) Unix Timestamp
(Int eger) A 1 1495129113 A 2 1495129245 A 2 1495129250 B 1 1495129245
  • 10. This is really all we’re trying to do
  • 11. HAVEN’T YOU HEARD OF HIVE, SON?
  • 12. WE STARTED BY TRYING HIVE. WE STRUGGLED. • There is a lot of data: Segments table has about 3 trillion rows. Pixel table has 16 billion. • Naively joining and aggregating with hive is the worst way to do it • Data can be transformed into a manageable format, but one that is awkward to express with SQL
  • 13. LIFE BEFORE DATABRICKS (HIVE) • One row per user/segment joined to one row per user/pixel on userID • Ran once a week and took days to complete on a cluster of 65 M4.2xl nodes. • Had to take care - one M/R job would write > 1TB to HDFS • The join and then the shuffle after the join was killing us
  • 14. KEY:VALUE FORMAT (UDB) • One record per user - key • Value is a python dictionary of segment information (max/min timestamps) • Much wider than original but far fewer records (on the order of 200x less) • Users expire out of UDB after not being loaded to a segment for 30 days • Persist to s3 as sequence files bucketed by key
  • 16. WHY NOT USE DATAFRAMES? • The nesting of the records makes it awkward to deal with using SQL • flatMap() and combineByKey() are the main drivers of increased performance
  • 17. THE MAIN DRAWBACK: MAINTAINING UDB • Add new users • Expire old users • Update existing users • Hard, but worth it if we run this multiple times • All this logic is conveniently expressed with python
  • 18. THE JOIN • Super easy as the data is already in pair RDDs. • Since the records are wide this step is not as painful as it used to be • Data must be shuffled (this sucks). It’s in S3 already clustered by user (the join key) but spark doesn’t know and shuffles anyway
  • 20. COUNTING RESULTS IS NOT EASY • Each record represents a unique user. Just need to count up how many pixel/segment pairs I see across all records. • Initially tried exploding on pixel/segment and converting to a dataframe • Records are so heavily nested that fully exploding causes spill to disk. We were filling up 100GB EBS volumes attached to the nodes. Our production cluster uses 65 nodes. • Skew was not an issue in this case
  • 21. EXPLODE, BUT CAREFULLY • flatMap - create one row per pixel for each user (pixelA, {seg1, segA, …}), #from user 1 (pixelB, {seg1, segB …}), #from user 1 (pixelA, {seg1, seg3, …}), # from user 2 • combineByKey - keep a running tally of how many segments are seen by each pixel. Since data is first combined by pixel on the nodes, the shuffle stage has far less data to deal with (pixelA,{seg1:2, seg2:1, seg3:1, …}), (pixelB,{seg1:1, seg3:1, …}) • flatMap again - make the final dataset with one row per pixel/segment combination pixelA, seg1,2 pixelA,seg2,1 • Much better. Now nothing spills to disk, takes about half the time.
  • 23. • Convert aggregated RDD to dataframe. Filter out garbage records and use the jdbc connector • df.write.jdbc(jdbcURL, MyPostgresTable, mode=‘overwrite’) • Fair performance (1.6 hours for about 41MM rows) - not affected by presence of index. Write takes just as long on unindexed table • Write to a staging table, then swap the staging and prod tables so the view that the app uses points to the refreshed data • Careful indexing of tables makes selects fast!
  • 24. PUTTING IT ALL TOGETHER • Wrap logic, execution and monitoring into objects/functions in a notebook • Run notebook as a job • Schedule that job right from the Databricks UI. • Reporting, monitoring and some retry logic comes for free!
  • 25. CONCLUSION: HIVE VS SPARK • Cannot compare directly because implementation is different • Spark implementation performs FAR better: run time approximately 11 hours vs two days on similar hardware. 1/4 time and price! • Development time is the real win. It’s far faster and easier to develop new pipelines
  • 26. OUR CURRENT WORKFLOW Now we have a very versatile tool that allows us to monetize the data in S3
  • 27. LIFE WITH DATABRICKS • If you are familiar with python then Databricks and PySpark unlocks a huge range of capabilities. • We are much more productive. • Our jobs are fun again! PySpark RDD APIs make it easy to work with big data in an accessible way.