Whether you’re processing IoT data from millions of sensors or building a recommendation engine to provide a more engaging customer experience, the ability to derive actionable insights from massive volumes of diverse data is critical to success. MediaMath, a leading adtech company, relies on Apache Spark to process billions of data points ranging from ads, user cookies, impressions, clicks, and more — translating to several terabytes of data per day. To support the needs of the data science teams, data engineering must build data pipelines for both ETL and feature engineering that are scalable, performant, and reliable.
Join this webinar to learn how MediaMath leverages Databricks to simplify mission-critical data engineering tasks that surface data directly to clients and drive actionable business outcomes. This webinar will cover:
- Transforming TBs of data with RDDs and PySpark responsibly
- Using the JDBC connector to write results to production databases seamlessly
- Comparisons with a similar approach using Hive
Report
Share
Report
Share
1 of 27
Download to read offline
More Related Content
Accelerating Data Science with Better Data Engineering on Databricks
2. WHAT IS MEDIAMATH?
• MediaMath is a demand-side media buying platform.
• We bid on ad opportunities from exchanges like
google and facebook and serve our client’s ads in those
spots.
• We leverage any information we can use in order to
increase performance of adds and target more
efficiently.
3. WHAT IS ANALYTICS AT
MEDIAMATH?
• A bunch of wannabe Data Scientists turned Data
Engineers
• Have a ton of data and good ideas but were limited
by computational capabilities
• Learned as we went. Databricks accelerated this
journey
4. WE NEED TO PROCESS TERABYTES OF
DATA WRITTEN TO S3 EVERY DAY IN
ORDER TO:
• Build new features for models
• Build new reporting for clients
• Set up internal data pipelines
8. THE INDEX FOR SITE P
• a measure of how many
users of a certain type
that we saw vs how
many we expect to see
• To find the index we
need to compute the
size of 4 groups of users
9. THE RAW DATA IN S3
• The data is provided by our
partners and is made available
for our team in S3.
• It consists of at least one
record for each user-segment
combination per day.
• MASSIVE redundancy (all we
care about is membership as
of a certain time)
User
ID
(String)
Segment
ID
(Integer)
Unix
Timestamp (Int
eger)
A 1 1495129113
A 2 1495129245
A 2 1495129250
B 1 1495129245
12. WE STARTED BY TRYING HIVE.
WE STRUGGLED.
• There is a lot of data: Segments table has about 3
trillion rows. Pixel table has 16 billion.
• Naively joining and aggregating with hive is the worst
way to do it
• Data can be transformed into a manageable format,
but one that is awkward to express with SQL
13. LIFE BEFORE DATABRICKS
(HIVE)
• One row per user/segment joined to one row per
user/pixel on userID
• Ran once a week and took days to complete on a cluster of
65 M4.2xl nodes.
• Had to take care - one M/R job would write > 1TB to
HDFS
• The join and then the shuffle after the join was killing us
14. KEY:VALUE FORMAT (UDB)
• One record per user - key
• Value is a python dictionary of segment information (max/min
timestamps)
• Much wider than original but far fewer records (on the order of
200x less)
• Users expire out of UDB after not being loaded to a segment
for 30 days
• Persist to s3 as sequence files bucketed by key
16. WHY NOT USE
DATAFRAMES?
• The nesting of the records makes it awkward to
deal with using SQL
• flatMap() and combineByKey() are the main drivers
of increased performance
17. THE MAIN DRAWBACK:
MAINTAINING UDB
• Add new users
• Expire old users
• Update existing users
• Hard, but worth it if we run this multiple times
• All this logic is conveniently expressed with python
18. THE JOIN
• Super easy as the data is already in pair RDDs.
• Since the records are wide this step is not as
painful as it used to be
• Data must be shuffled (this sucks). It’s in S3 already
clustered by user (the join key) but spark doesn’t
know and shuffles anyway
20. COUNTING RESULTS IS NOT
EASY
• Each record represents a unique user. Just need to count up how
many pixel/segment pairs I see across all records.
• Initially tried exploding on pixel/segment and converting to a
dataframe
• Records are so heavily nested that fully exploding causes spill to
disk. We were filling up 100GB EBS volumes attached to the
nodes. Our production cluster uses 65 nodes.
• Skew was not an issue in this case
21. EXPLODE, BUT CAREFULLY
• flatMap - create one row per pixel for each user
(pixelA, {seg1, segA, …}), #from user 1
(pixelB, {seg1, segB …}), #from user 1
(pixelA, {seg1, seg3, …}), # from user 2
• combineByKey - keep a running tally of how many segments are seen by each pixel. Since data is first
combined by pixel on the nodes, the shuffle stage has far less data to deal with
(pixelA,{seg1:2, seg2:1, seg3:1, …}),
(pixelB,{seg1:1, seg3:1, …})
• flatMap again - make the final dataset with one row per pixel/segment combination
pixelA, seg1,2
pixelA,seg2,1
• Much better. Now nothing spills to disk, takes about half the time.
23. • Convert aggregated RDD to dataframe. Filter out garbage
records and use the jdbc connector
• df.write.jdbc(jdbcURL, MyPostgresTable, mode=‘overwrite’)
• Fair performance (1.6 hours for about 41MM rows) - not
affected by presence of index. Write takes just as long on
unindexed table
• Write to a staging table, then swap the staging and prod
tables so the view that the app uses points to the refreshed
data
• Careful indexing of tables makes selects fast!
24. PUTTING IT ALL TOGETHER
• Wrap logic, execution and monitoring into
objects/functions in a notebook
• Run notebook as a job
• Schedule that job right from the Databricks UI.
• Reporting, monitoring and some retry logic comes for
free!
25. CONCLUSION: HIVE VS SPARK
• Cannot compare directly because implementation is
different
• Spark implementation performs FAR better: run time
approximately 11 hours vs two days on similar
hardware. 1/4 time and price!
• Development time is the real win. It’s far faster and
easier to develop new pipelines
26. OUR CURRENT WORKFLOW
Now we have a very versatile tool that allows us to
monetize the data in S3
27. LIFE WITH DATABRICKS
• If you are familiar with python then Databricks and
PySpark unlocks a huge range of capabilities.
• We are much more productive.
• Our jobs are fun again! PySpark RDD APIs make it
easy to work with big data in an accessible way.