Accelerating Data Science with Better Data Engineering on Databricks

ACCELERATING DATA
SCIENCE WITH BETTER DATA
ENGINEERING ON DATABRICKS
Andrew Candela

WHAT IS MEDIAMATH?
• MediaMath is a demand-side media buying platform.
• We bid on ad opportunities from exchanges like
google and facebook and serve our client’s ads in those
spots.
• We leverage any information we can use in order to
increase performance of adds and target more
efficiently.

WHAT IS ANALYTICS AT
MEDIAMATH?
• A bunch of wannabe Data Scientists turned Data
Engineers
• Have a ton of data and good ideas but were limited
by computational capabilities
• Learned as we went. Databricks accelerated this
journey

WE NEED TO PROCESS TERABYTES OF
DATA WRITTEN TO S3 EVERY DAY IN
ORDER TO:
• Build new features for models
• Build new reporting for clients
• Set up internal data pipelines

THE VISION
Aggregate hundreds of TBs from S3 and write results to
PostgreSQL to get things like…

THE AUDIENCE INDEX REPORT (AIR)
Allows advertisers to gain insights into demographics of
users visiting their sites

THE INDEX FOR SITE P
• a measure of how many
users of a certain type
that we saw vs how
many we expect to see
• To find the index we
need to compute the
size of 4 groups of users

THE RAW DATA IN S3
• The data is provided by our
partners and is made available
for our team in S3.
• It consists of at least one
record for each user-segment
combination per day.
• MASSIVE redundancy (all we
care about is membership as
of a certain time)
User
ID
(String)
Segment
ID
(Integer)
Unix
Timestamp (Int
eger)
A 1 1495129113
A 2 1495129245
A 2 1495129250
B 1 1495129245

This is really all we’re trying to do

HAVEN’T YOU HEARD OF HIVE, SON?

WE STARTED BY TRYING HIVE.
WE STRUGGLED.
• There is a lot of data: Segments table has about 3
trillion rows. Pixel table has 16 billion.
• Naively joining and aggregating with hive is the worst
way to do it
• Data can be transformed into a manageable format,
but one that is awkward to express with SQL

LIFE BEFORE DATABRICKS
(HIVE)
• One row per user/segment joined to one row per
user/pixel on userID
• Ran once a week and took days to complete on a cluster of
65 M4.2xl nodes.
• Had to take care - one M/R job would write > 1TB to
HDFS
• The join and then the shuffle after the join was killing us

KEY:VALUE FORMAT (UDB)
• One record per user - key
• Value is a python dictionary of segment information (max/min
timestamps)
• Much wider than original but far fewer records (on the order of
200x less)
• Users expire out of UDB after not being loaded to a segment
for 30 days
• Persist to s3 as sequence files bucketed by key

Accelerating Data Science with Better Data Engineering on Databricks

WHY NOT USE
DATAFRAMES?
• The nesting of the records makes it awkward to
deal with using SQL
• flatMap() and combineByKey() are the main drivers
of increased performance

THE MAIN DRAWBACK:
MAINTAINING UDB
• Add new users
• Expire old users
• Update existing users
• Hard, but worth it if we run this multiple times
• All this logic is conveniently expressed with python

THE JOIN
• Super easy as the data is already in pair RDDs.
• Since the records are wide this step is not as
painful as it used to be
• Data must be shuffled (this sucks). It’s in S3 already
clustered by user (the join key) but spark doesn’t
know and shuffles anyway

COUNTING RESULTS IS NOT
EASY
• Each record represents a unique user. Just need to count up how
many pixel/segment pairs I see across all records.
• Initially tried exploding on pixel/segment and converting to a
dataframe
• Records are so heavily nested that fully exploding causes spill to
disk. We were filling up 100GB EBS volumes attached to the
nodes. Our production cluster uses 65 nodes.
• Skew was not an issue in this case

EXPLODE, BUT CAREFULLY
• flatMap - create one row per pixel for each user
(pixelA, {seg1, segA, …}), #from user 1
(pixelB, {seg1, segB …}), #from user 1
(pixelA, {seg1, seg3, …}), # from user 2
• combineByKey - keep a running tally of how many segments are seen by each pixel. Since data is first
combined by pixel on the nodes, the shuffle stage has far less data to deal with
(pixelA,{seg1:2, seg2:1, seg3:1, …}),
(pixelB,{seg1:1, seg3:1, …})
• flatMap again - make the final dataset with one row per pixel/segment combination
pixelA, seg1,2
pixelA,seg2,1
• Much better. Now nothing spills to disk, takes about half the time.

• Convert aggregated RDD to dataframe. Filter out garbage
records and use the jdbc connector
• df.write.jdbc(jdbcURL, MyPostgresTable, mode=‘overwrite’)
• Fair performance (1.6 hours for about 41MM rows) - not
affected by presence of index. Write takes just as long on
unindexed table
• Write to a staging table, then swap the staging and prod
tables so the view that the app uses points to the refreshed
data
• Careful indexing of tables makes selects fast!

PUTTING IT ALL TOGETHER
• Wrap logic, execution and monitoring into
objects/functions in a notebook
• Run notebook as a job
• Schedule that job right from the Databricks UI.
• Reporting, monitoring and some retry logic comes for
free!

CONCLUSION: HIVE VS SPARK
• Cannot compare directly because implementation is
different
• Spark implementation performs FAR better: run time
approximately 11 hours vs two days on similar
hardware. 1/4 time and price!
• Development time is the real win. It’s far faster and
easier to develop new pipelines

OUR CURRENT WORKFLOW
Now we have a very versatile tool that allows us to
monetize the data in S3

LIFE WITH DATABRICKS
• If you are familiar with python then Databricks and
PySpark unlocks a huge range of capabilities.
• We are much more productive.
• Our jobs are fun again! PySpark RDD APIs make it
easy to work with big data in an accessible way.

Accelerating Data Science with Better Data Engineering on Databricks

More Related Content

Accelerating Data Science with Better Data Engineering on Databricks