SlideShare a Scribd company logo
The future is open
@ItaiYaffe, @ettigur
Digital advertising - a multi-billion dollar industry
● In 2019, internet advertising spending worldwide was over
$290B(statista.com, June 2020)
@ItaiYaffe, @ettigur
Digital advertising - a multi-billion dollar industry
● In 2019, internet advertising spending worldwide was over
$290B(statista.com, June 2020)
● Apple spent over $110M on iPhone & TV+ advertising during
September and October 2019 (9To5Mac.com, November 2019)
@ItaiYaffe, @ettigur
Digital advertising - a multi-billion dollar industry
● In 2019, internet advertising spending worldwide was over
$290B(statista.com, June 2020)
● Apple spent over $110M on iPhone & TV+ advertising during
September and October 2019 (9To5Mac.com, November 2019)
● GM is shifting ‘SIGNIFICANT’ dollars to connected TV advertising
(AdAge.com, January 2020)
@ItaiYaffe, @ettigur
Digital advertising - a multi-billion dollar industry
● In 2019, internet advertising spending worldwide was over
$290B (statista.com, June 2020)
● Apple spent over $110M on iPhone & TV+ advertising during
September and October 2019 (9To5Mac.com, November 2019)
● GM is shifting ‘SIGNIFICANT’ dollars to connected TV advertising
(AdAge.com, January 2020)
$$$ spent each year
on digital advertising campaigns
@ItaiYaffe, @ettigur
What does a funnel look like?
PRODUCT PAGE
10M
HOMEPAGE
15M
5M
Drop-off
AD EXPOSURE
100M
85M
Drop-off
So everybody wants to measure their campaigns’ efficiency!
@ItaiYaffe, @ettigur
What does a funnel look like?
PRODUCT PAGE
10M
HOMEPAGE
15M
5M
Drop-off
AD EXPOSURE
100M
85M
Drop-off
So everybody wants to measure their campaigns’ efficiency!
But how???
Funnel Analysis with
Apache Spark and Druid
Etti Gur, Nielsen
Itai Yaffe, Imply
@ItaiYaffe, @ettigur
Introduction
Etti Gur
● Senior Big Data Engineer @ Nielsen
● Building data pipelines using Spark,
Kafka, Druid, Airflow and more
Etti Gur @ettigur
Itai Yaffe
● Principal Solutions Architect @ Imply
Prev. Big Data Tech Lead @ Nielsen
● Dealing with Big Data challenges since 2012
● Itai Yaffe @ItaiYaffe
@ItaiYaffe, @ettigur
Nielsen Identity
● Data and Measurement company
● Media consumption
● Single source of truth of individuals and households
○ Unifies many proprietary datasets
○ Generates holistic view of a consumer
@ItaiYaffe, @ettigur
Nielsen Identity in numbers
>10B events/day 60TB/day
S3
6000 nodes/day
10’s of TB
ingested/day
druid
@ItaiYaffe, @ettigur
Scalability
Cost Efficiency
Fault-tolerance
The challenges
@ItaiYaffe, @ettigur
What you will learn?
How to overcome the technical challenges of Funnel Analysis
@ItaiYaffe, @ettigur
What you will learn?
How to overcome the technical challenges of Funnel Analysis
using Apache Spark, Druid and DataSketches,
@ItaiYaffe, @ettigur
What you will learn?
How to overcome the technical challenges of Funnel Analysis
using Apache Spark, Druid and DataSketches,
and why you should even care
@ItaiYaffe, @ettigur
Campaign phases - user’s point-of-view
Awareness
Exposed to
campaign (e.g
via online ad)
Consideration
Interest is
expressed (e.g
clicked ad)
Intent
Steps taken towards
making a purchase (e.g
added product to cart)
Purchase
@ItaiYaffe, @ettigur
Campaign phases - user’s point-of-view
Awareness
Exposed to
campaign (e.g
via online ad)
Consideration
Interest is
expressed (e.g
clicked ad)
Intent
Steps taken towards
making a purchase (e.g
added product to cart)
Purchase
Tactic Stages
@ItaiYaffe, @ettigur
Campaign phases - campaign owner’s point-of-view
Awareness Consideration Intent Purchase
Drop-
off
Drop-
off
Drop-
off
@ItaiYaffe, @ettigur
PRODUCT PAGE
10M UUs
HOMEPAGE
15M UUs
7M
Drop-off
5M
Drop-off
AD EXPOSURE
100M UUs
85M
Drop-off
Campaign phases - why is it called “a funnel”?
* UUs = Unique Users
CHECKOUT
3M UUs
@ItaiYaffe, @ettigur
PRODUCT PAGE
10M UUs
HOMEPAGE
15M UUs
7M
Drop-off
5M
Drop-off
AD EXPOSURE
100M UUs
85M
Drop-off
Campaign phases - why is it called “a funnel”?
* UUs = Unique Users
CHECKOUT
3M UUs
We need to analyze the funnel, hence:
“Funnel Analysis”
@ItaiYaffe, @ettigur
Views vs Unique Users
2 Unique Users
7 Views
2 Purchases $$$ $$$
@ItaiYaffe, @ettigur
Everybody wants to measure their campaigns’ efficiency!
What does a funnel look like?
PRODUCT PAGE
10M
HOMEPAGE
15M
5M
Drop-off
AD EXPOSURE
100M
85M
Drop-off
@ItaiYaffe, @ettigur
But how can one measure campaign efficiency?
● Collect a huge stream of events (i.e user activities)
while the campaign is live
● Map events to funnel stages
○ E.g ad exposure = tactic
● Provide insights quickly
@ItaiYaffe, @ettigur
So… what’s wrong with off-the-shelf alternatives?
Topic Off-the-shelf alternatives
Scalability Limited
Access to raw data Lack access
Count-distinct operations Very slow
* Based on tinyurl.com/qqza5ur
@ItaiYaffe, @ettigur
Introducing: Apache Druid
@ItaiYaffe, @ettigur
Why is it cool?
● Store trillions of events, petabytes of data
● Sub-second analytic queries
● Highly scalable
● Cost effective
● Decoupled architecture
○ E.g ingestion is separated from query
@ItaiYaffe, @ettigur
Roll-up - Simple Count (Views)
LongSumAggregator
2021-05-26
Timestamp Website Device ID
www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02
2021-05-26 www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02
2021-05-26 www.a.com 5dd59f9bd068f802a7c6dd832bf60d02
2021-05-26 www.b.com 5dd59f9bd068f802a7c6dd832bf60d02
2021-05-26 www.c.com 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Website Views
2021-05-26
2021-05-26
2021-05-26
www.a.com 3
1
1
www.b.com
www.c.com
@ItaiYaffe, @ettigur
Druid architecture
@ItaiYaffe, @ettigur
Powered by Druid
@ItaiYaffe, @ettigur
Common use-cases for Druid
● Clickstream analytics
○ Funnel analysis
● Network performance monitoring
● Application performance management
● Supply chain analytics
○ Manufacturing (IoT and device) metrics
● BI and OLAP
● And more...
@ItaiYaffe, @ettigur
Druid in a nutshell
● A real-time analytics database
○ Time-series, columnar
● Can ingest and store trillions of events, and serve analytic queries in
sub-second
● Highly-scalable, cost-effective
● Widely used among Big Data companies for:
○ Application performance management
○ Clickstream analytics and funnel analysis
○ And more
@ItaiYaffe, @ettigur
Druid in a nutshell
● A real-time analytics database
○ Time-series, columnar
● Can ingest and store trillions of events, and serve analytic queries in
sub-second
● Highly-scalable, cost-effective
● Widely used among Big Data companies for:
○ Application performance management
○ Clickstream analytics and funnel analysis
○ And more
@ItaiYaffe, @ettigur
Why is Druid suitable for the task?
Topic Off-the-shelf
alternatives
Druid
Scalability Limited Highly scalable
Access to raw
data
Lack access Can store trillions of events
Count-distinct
operations
Very slow Sub-second approximate count distinct
with set operations
using the Theta Sketch module
* Based on tinyurl.com/qqza5ur
@ItaiYaffe, @ettigur
Why is Druid suitable for the task?
Topic Off-the-shelf
alternatives
Druid
Scalability Limited Highly scalable
Access to raw
data
Lack access Can store trillions of events
Count-distinct
operations
Very slow Sub-second approximate count distinct
with set operations
using the Theta Sketch module
Theta Sketch???
* Based on tinyurl.com/qqza5ur
@ItaiYaffe, @ettigur
What is Theta Sketch?
● ThetaSketch mathematical framework - generalization of KMV
● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations
Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
Error as function of K
Theta Sketch error
* Larger K = more memory & storage needed
@ItaiYaffe
@ItaiYaffe, @ettigur
Theta Sketch demo
tinyurl.com/ugk6p67
@ItaiYaffe, @ettigur
The Theta Sketch module in Druid
● Part of the Apache DataSketches library (datasketches.apache.org)
● At ingestion time
○ Sketches are created and stored in Druid segments
● At query time
○ Sketches are aggregated (i.e union, intersection or difference
between sketches)
○ The result - estimated number of unique entries in the aggregated
sketch
● Also see this short video - tinyurl.com/vdwojh6
@ItaiYaffe, @ettigur
Roll-up - Count Distinct (Unique Users)
2021-05-26
Timestamp Website Device ID
www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02
2021-05-26 www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02
2021-05-26 www.a.com 5dd59f9bd068f802a7c6dd832bf60d02
2021-05-26 www.b.com 5dd59f9bd068f802a7c6dd832bf60d02
2021-05-26 www.c.com 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Website Unique Users*
2021-05-26
2021-05-26
2021-05-26
www.a.com 2*
1*
1*
www.b.com
www.c.com
ThetaSketchAggregator
* What is actually stored is a
ThetaSketch object.
The actual result is calculated
in real-time, which allows us
to do UNIONs and INTERSECTIONs
@ItaiYaffe, @ettigur
Cool, so… Back to funnel analysis?
@ItaiYaffe, @ettigur
Funnel analysis - simple use-case
How many unique users viewed online ad?
VS
How many unique users viewed
online ad AND viewed product X page?
@ItaiYaffe, @ettigur
Funnel analysis - simple use-case
5/1/2021 - 5/26/2021
5/1/2021 - 5/26/2021
@ItaiYaffe, @ettigur
Funnel analysis - simple use-case
@ItaiYaffe, @ettigur
Funnel analysis pipeline - high-level architecture
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Data Lake
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
{event_time=2021-05-26T..., userid=uid1, attribute=online_ad}
{event_time=2021-05-26T..., userid=uid1, attribute=homepage}
{event_time=2021-05-26T..., userid=uid1, attribute=productX_page}
....
date=2021-05-24
date=2021-05-25
date=2021-05-26
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Mart Generator
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_time=2021-05-26T... , userid=uid1, attribute=online_ad, type=Tactic}
{event_time=2021-05-26T... , userid=uid1, attribute=homepage, type=Stage}
{event_time=2021-05-26T... , userid=uid1, attribute=productX_page , type=Stage}
....
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Enricher
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage}
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page }
....
....
@ItaiYaffe, @ettigur
Funnel analysis pipeline - ingesting data into Druid
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
"type": "index_hadoop",
"spec": {
"dataSchema": {
"dataSource": "campaign_1472",
"granularitySpec": {
"queryGranularity": "day",
"segmentGranularity": "day",
"type": "uniform",
"intervals": ["2021-05-01/2021-05-27"]
...
@ItaiYaffe, @ettigur
Funnel analysis pipeline - ingesting data into Druid
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
"timestampSpec": {
"column": "event_date", "format": "yyyy-MM-dd"
},
"dimensionsSpec": {
"dimensions": ["tactic", "stage"]
},
"metricsSpec": [{
"fieldName": "userid", "type": "thetaSketch",
"name": "user_id_sketch", "size": 65536}],
...
@ItaiYaffe, @ettigur
Funnel analysis pipeline - ingesting data into Druid
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
"inputSpec": {"type": " multi",
"children": [
{"type": " dataSource",
"ingestionSpec": {
"intervals": ["2021-05-01/2021-05-27"],
"dataSource": "campaign_1472", ...}},
{"type": " static",
"Paths": "s3://<BUCKET_NAME>/date=2021-05-26/campaign=1472",
...},
...
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Druid datasources
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
{__time=2021-05-26, tactic=online_ad, stage=homepage, user_id_sketch=<Object>}
{__time=2021-05-26, tactic=online_ad, stage=productX_page , user_id_sketch=<Object>}
....
....
campaign_1210
campaign_1319
campaign_1472
@ItaiYaffe, @ettigur
Funnel analysis pipeline - querying Druid (SQL)
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
SELECT
APPROX_COUNT_DISTINCT_DS_THETA(user_id_sketch,65536)
as homepage_sketch
FROM campaign_1472
WHERE (("tactic" = 'online_ad')
AND ("stage" = 'homepage'))
AND __time BETWEEN '2021-05-01T00:00:00.000'
AND '2021-05-26T23:59:59.000'
* This specific query returns the estimated number of unique users
that viewed the online ad AND viewed the homepage
@ItaiYaffe, @ettigur
Funnel analysis - simple use-case revisited
5/1/2021 - 5/26/2021
@ItaiYaffe, @ettigur
Funnel analysis - simple use-case revisited
@ItaiYaffe, @ettigur
Funnel analysis - simple use-case revisited
@ItaiYaffe, @ettigur
Funnel analysis - simple use-case revisited
3,100 - 2,500 != 1000
@ItaiYaffe, @ettigur
PRODUCT PAGE
1K UUs
...
HOMEPAGE
3.1K UUs
2.5K
Drop-off
ONLINE AD
8.1M UUs
Funnel analysis - simple use-case revisited
* UUs = Unique Users
@ItaiYaffe, @ettigur
PRODUCT PAGE
1K UUs
...
HOMEPAGE
3.1K UUs
ONLINE AD
8.1M UUs
Funnel analysis - simple use-case revisited
* UUs = Unique Users
2.5K
Drop-off
@ItaiYaffe, @ettigur
Funnel analysis - simple complex use-case
How many unique users viewed online ad?
VS
How many unique users
viewed online ad FIRST and
THEN viewed product X page?
@ItaiYaffe, @ettigur
Funnel analysis - complex use-case
● This is what we call a sequential funnel
○ Chronological order of events is important
● The data pipeline is very similar, but…
○ Taking into account only events that happened in the pre-defined order
of the funnel
● That way we better represent the efficiency of a specific tactic
(i.e advertisement)
@ItaiYaffe, @ettigur
Funnel analysis pipeline - reminder
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Data Lake
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
{event_time=2021-05-26T09:15, userid=uid1, attribute=productX_page}
{event_time=2021-05-26T10:10, userid=uid1, attribute=online_ad}
{event_time=2021-05-26T10:11, userid=uid1, attribute=homepage}
....
date=2021-05-24
date=2021-05-25
date=2021-05-26
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Mart Generator
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_time=2021-05-26T09:15 , userid=uid1, attribute=productX_page , type=Stage}
{event_time=2021-05-26T10:10 , userid=uid1, attribute=online_ad, type=Tactic}
{event_time=2021-05-26T10:11 , userid=uid1, attribute=homepage, type=Stage}
....
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Enricher
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page }
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage}
....
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Enricher
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page }
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage}
....
@ItaiYaffe, @ettigur
Funnel analysis pipeline - Enricher
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
campaign=1210
campaign=1319
campaign=1472
date=2021-05-24
date=2021-05-25
date=2021-05-26
{event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage}
....
....
@ItaiYaffe, @ettigur
Funnel analysis pipeline - querying Druid (SQL)
1.
Read files
of last day
Data
Lake
2.
Write files by
campaign,date
Campaigns'
marts
3.
Read files
per campaign
4.
Write files by
date,campaign
Enriched
data
5.
Load data
by campaign
Mart Generator Enricher
SELECT APPROX_COUNT_DISTINCT_DS_THETA(THETA_SKETCH_NOT(65536,
THETA_SKETCH_INTERSECT(65536,a,b), THETA_SKETCH_UNION(65536,c,d,e))) as dropoff_sketch
FROM ( SELECT
DS_THETA("user_id_sketch") FILTER (WHERE tactic = 'online_ad') as a,
DS_THETA("user_id_sketch") FILTER (WHERE stage = 'homepage') as b,
DS_THETA("user_id_sketch") FILTER (WHERE stage = 'productX_page') as c,
DS_THETA("user_id_sketch") FILTER (WHERE stage = 'add_to_cart') as d,
DS_THETA("user_id_sketch") FILTER (WHERE stage = 'checkout') as e
FROM campaign_1472
WHERE stage in ('homepage','productX_page','add_to_cart','checkout')
AND tactic = 'online_ad'
AND __time BETWEEN '2021-05-01T00:00:00.000' AND '2021-05-26T23:59:59.000' )
subquery
* This specific query should return the estimated number of unique
users for the drop-off between the homepage and product X page
@ItaiYaffe, @ettigur
Funnel analysis - complex use-case
5/1/2021 - 5/26/2021
@ItaiYaffe, @ettigur
Funnel analysis - complex use-case
@ItaiYaffe, @ettigur
Funnel analysis - complex use-case
@ItaiYaffe, @ettigur
Funnel analysis - complex use-case
3,100 - 2,500 = 600
@ItaiYaffe, @ettigur
PRODUCT PAGE
0.6K UUs
...
HOMEPAGE
3.1K UUs
2.5K
Drop-off
ONLINE AD
8.1M UUs
Funnel analysis - complex use-case
* UUs = Unique Users
@ItaiYaffe, @ettigur
PRODUCT PAGE
0.6K UUs
...
HOMEPAGE
3.1K UUs
ONLINE AD
8.1M UUs
Funnel analysis - complex use-case
* UUs = Unique Users
2.5K
Drop-off
@ItaiYaffe, @ettigur
A few tips
● Use Druid with Theta Sketch for fast approximate count distinct
○ Allows set operations (intersection/union/negation)
● Use Spark to pre-process incoming events
○ Allows you to take into account only events that happened in the
pre-defined order of the funnel
○ Check out Etti’s “Optimizing Spark-based data pipelines” talk
(video - tinyurl.com/7hvyxtc8, slides - tinyurl.com/3rvc9mus)
● Optimize your ingestion process
○ Write Theta Sketch objects from Spark app
○ Load to Druid using isInputThetaSketch=true flag
@ItaiYaffe, @ettigur
What have we learned?
● Funnel analysis
○ Very important for advertisers
○ Not easy to solve technically (especially if chronological order of events matters)
@ItaiYaffe, @ettigur
What have we learned?
● Funnel analysis
○ Very important for advertisers
○ Not easy to solve technically (especially if chronological order of events matters)
● Druid is a very powerful tool for real-time analytics
○ Highly scalable, can ingest and store trillions of events, and serve analytic queries in
sub-second
○ Used for many different use-cases
@ItaiYaffe, @ettigur
What have we learned?
● Funnel analysis
○ Very important for advertisers
○ Not easy to solve technically (especially if chronological order of events matters)
● Druid is a very powerful tool for real-time analytics
○ Highly scalable, can ingest and store trillions of events, and serve analytic queries in
sub-second
○ Used for many different use-cases
● Combining Apache Spark, Druid and DataSkecthes FTW!
○ Pre-process events before ingesting into Druid
○ Decide how to handle out-of-order events
@ItaiYaffe, @ettigur
DRUID
ES
Want to know more?
● Women in Big Data
○ A world-wide program that aims :
■ To inspire, connect, grow, and champion success of women in the Big Data & analytics field
○ 30+ chapters and 17,000+ members world-wide
○ Everyone can join (regardless of gender), so find a chapter near you -
www.womeninbigdata.org/membership/
● Conference talks
○ Casting the Spell: Druid in Practice (Berlin Buzzwords, June 17th 2021) - tinyurl.com/559hufnj
○ Migrating Airflow-based Spark Jobs to K8s (Data+AI Summit Europe 2020) - tinyurl.com/cbm42mn8
● Our Tech Blog - medium.com/nmc-techblog
○ Data Retention and Deletion in Apache Druid - tinyurl.com/yymrvrn2
QUESTIONS
THANK YOU
Etti Gur Etti Gur
Itai Yaffe Itai Yaffe
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

Funnel Analysis with Apache Spark and Druid

  • 2. @ItaiYaffe, @ettigur Digital advertising - a multi-billion dollar industry ● In 2019, internet advertising spending worldwide was over $290B(statista.com, June 2020)
  • 3. @ItaiYaffe, @ettigur Digital advertising - a multi-billion dollar industry ● In 2019, internet advertising spending worldwide was over $290B(statista.com, June 2020) ● Apple spent over $110M on iPhone & TV+ advertising during September and October 2019 (9To5Mac.com, November 2019)
  • 4. @ItaiYaffe, @ettigur Digital advertising - a multi-billion dollar industry ● In 2019, internet advertising spending worldwide was over $290B(statista.com, June 2020) ● Apple spent over $110M on iPhone & TV+ advertising during September and October 2019 (9To5Mac.com, November 2019) ● GM is shifting ‘SIGNIFICANT’ dollars to connected TV advertising (AdAge.com, January 2020)
  • 5. @ItaiYaffe, @ettigur Digital advertising - a multi-billion dollar industry ● In 2019, internet advertising spending worldwide was over $290B (statista.com, June 2020) ● Apple spent over $110M on iPhone & TV+ advertising during September and October 2019 (9To5Mac.com, November 2019) ● GM is shifting ‘SIGNIFICANT’ dollars to connected TV advertising (AdAge.com, January 2020) $$$ spent each year on digital advertising campaigns
  • 6. @ItaiYaffe, @ettigur What does a funnel look like? PRODUCT PAGE 10M HOMEPAGE 15M 5M Drop-off AD EXPOSURE 100M 85M Drop-off So everybody wants to measure their campaigns’ efficiency!
  • 7. @ItaiYaffe, @ettigur What does a funnel look like? PRODUCT PAGE 10M HOMEPAGE 15M 5M Drop-off AD EXPOSURE 100M 85M Drop-off So everybody wants to measure their campaigns’ efficiency! But how???
  • 8. Funnel Analysis with Apache Spark and Druid Etti Gur, Nielsen Itai Yaffe, Imply
  • 9. @ItaiYaffe, @ettigur Introduction Etti Gur ● Senior Big Data Engineer @ Nielsen ● Building data pipelines using Spark, Kafka, Druid, Airflow and more Etti Gur @ettigur Itai Yaffe ● Principal Solutions Architect @ Imply Prev. Big Data Tech Lead @ Nielsen ● Dealing with Big Data challenges since 2012 ● Itai Yaffe @ItaiYaffe
  • 10. @ItaiYaffe, @ettigur Nielsen Identity ● Data and Measurement company ● Media consumption ● Single source of truth of individuals and households ○ Unifies many proprietary datasets ○ Generates holistic view of a consumer
  • 11. @ItaiYaffe, @ettigur Nielsen Identity in numbers >10B events/day 60TB/day S3 6000 nodes/day 10’s of TB ingested/day druid
  • 13. @ItaiYaffe, @ettigur What you will learn? How to overcome the technical challenges of Funnel Analysis
  • 14. @ItaiYaffe, @ettigur What you will learn? How to overcome the technical challenges of Funnel Analysis using Apache Spark, Druid and DataSketches,
  • 15. @ItaiYaffe, @ettigur What you will learn? How to overcome the technical challenges of Funnel Analysis using Apache Spark, Druid and DataSketches, and why you should even care
  • 16. @ItaiYaffe, @ettigur Campaign phases - user’s point-of-view Awareness Exposed to campaign (e.g via online ad) Consideration Interest is expressed (e.g clicked ad) Intent Steps taken towards making a purchase (e.g added product to cart) Purchase
  • 17. @ItaiYaffe, @ettigur Campaign phases - user’s point-of-view Awareness Exposed to campaign (e.g via online ad) Consideration Interest is expressed (e.g clicked ad) Intent Steps taken towards making a purchase (e.g added product to cart) Purchase Tactic Stages
  • 18. @ItaiYaffe, @ettigur Campaign phases - campaign owner’s point-of-view Awareness Consideration Intent Purchase Drop- off Drop- off Drop- off
  • 19. @ItaiYaffe, @ettigur PRODUCT PAGE 10M UUs HOMEPAGE 15M UUs 7M Drop-off 5M Drop-off AD EXPOSURE 100M UUs 85M Drop-off Campaign phases - why is it called “a funnel”? * UUs = Unique Users CHECKOUT 3M UUs
  • 20. @ItaiYaffe, @ettigur PRODUCT PAGE 10M UUs HOMEPAGE 15M UUs 7M Drop-off 5M Drop-off AD EXPOSURE 100M UUs 85M Drop-off Campaign phases - why is it called “a funnel”? * UUs = Unique Users CHECKOUT 3M UUs We need to analyze the funnel, hence: “Funnel Analysis”
  • 21. @ItaiYaffe, @ettigur Views vs Unique Users 2 Unique Users 7 Views 2 Purchases $$$ $$$
  • 22. @ItaiYaffe, @ettigur Everybody wants to measure their campaigns’ efficiency! What does a funnel look like? PRODUCT PAGE 10M HOMEPAGE 15M 5M Drop-off AD EXPOSURE 100M 85M Drop-off
  • 23. @ItaiYaffe, @ettigur But how can one measure campaign efficiency? ● Collect a huge stream of events (i.e user activities) while the campaign is live ● Map events to funnel stages ○ E.g ad exposure = tactic ● Provide insights quickly
  • 24. @ItaiYaffe, @ettigur So… what’s wrong with off-the-shelf alternatives? Topic Off-the-shelf alternatives Scalability Limited Access to raw data Lack access Count-distinct operations Very slow * Based on tinyurl.com/qqza5ur
  • 26. @ItaiYaffe, @ettigur Why is it cool? ● Store trillions of events, petabytes of data ● Sub-second analytic queries ● Highly scalable ● Cost effective ● Decoupled architecture ○ E.g ingestion is separated from query
  • 27. @ItaiYaffe, @ettigur Roll-up - Simple Count (Views) LongSumAggregator 2021-05-26 Timestamp Website Device ID www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02 2021-05-26 www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02 2021-05-26 www.a.com 5dd59f9bd068f802a7c6dd832bf60d02 2021-05-26 www.b.com 5dd59f9bd068f802a7c6dd832bf60d02 2021-05-26 www.c.com 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Website Views 2021-05-26 2021-05-26 2021-05-26 www.a.com 3 1 1 www.b.com www.c.com
  • 30. @ItaiYaffe, @ettigur Common use-cases for Druid ● Clickstream analytics ○ Funnel analysis ● Network performance monitoring ● Application performance management ● Supply chain analytics ○ Manufacturing (IoT and device) metrics ● BI and OLAP ● And more...
  • 31. @ItaiYaffe, @ettigur Druid in a nutshell ● A real-time analytics database ○ Time-series, columnar ● Can ingest and store trillions of events, and serve analytic queries in sub-second ● Highly-scalable, cost-effective ● Widely used among Big Data companies for: ○ Application performance management ○ Clickstream analytics and funnel analysis ○ And more
  • 32. @ItaiYaffe, @ettigur Druid in a nutshell ● A real-time analytics database ○ Time-series, columnar ● Can ingest and store trillions of events, and serve analytic queries in sub-second ● Highly-scalable, cost-effective ● Widely used among Big Data companies for: ○ Application performance management ○ Clickstream analytics and funnel analysis ○ And more
  • 33. @ItaiYaffe, @ettigur Why is Druid suitable for the task? Topic Off-the-shelf alternatives Druid Scalability Limited Highly scalable Access to raw data Lack access Can store trillions of events Count-distinct operations Very slow Sub-second approximate count distinct with set operations using the Theta Sketch module * Based on tinyurl.com/qqza5ur
  • 34. @ItaiYaffe, @ettigur Why is Druid suitable for the task? Topic Off-the-shelf alternatives Druid Scalability Limited Highly scalable Access to raw data Lack access Can store trillions of events Count-distinct operations Very slow Sub-second approximate count distinct with set operations using the Theta Sketch module Theta Sketch??? * Based on tinyurl.com/qqza5ur
  • 35. @ItaiYaffe, @ettigur What is Theta Sketch? ● ThetaSketch mathematical framework - generalization of KMV ● K Minimum Values (KMV) ● Estimate set cardinality ● Supports set-theoretic operations
  • 36. Number of Std Dev 1 2 Confidence Interval 68.27% 95.45% 16,384 0.78% 1.56% 32,768 0.55% 1.10% 65,536 0.39% 0.78% Error as function of K Theta Sketch error * Larger K = more memory & storage needed @ItaiYaffe
  • 37. @ItaiYaffe, @ettigur Theta Sketch demo tinyurl.com/ugk6p67
  • 38. @ItaiYaffe, @ettigur The Theta Sketch module in Druid ● Part of the Apache DataSketches library (datasketches.apache.org) ● At ingestion time ○ Sketches are created and stored in Druid segments ● At query time ○ Sketches are aggregated (i.e union, intersection or difference between sketches) ○ The result - estimated number of unique entries in the aggregated sketch ● Also see this short video - tinyurl.com/vdwojh6
  • 39. @ItaiYaffe, @ettigur Roll-up - Count Distinct (Unique Users) 2021-05-26 Timestamp Website Device ID www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02 2021-05-26 www.a.com 3a4c1f2d84a5c179435c1fea86e6ae02 2021-05-26 www.a.com 5dd59f9bd068f802a7c6dd832bf60d02 2021-05-26 www.b.com 5dd59f9bd068f802a7c6dd832bf60d02 2021-05-26 www.c.com 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Website Unique Users* 2021-05-26 2021-05-26 2021-05-26 www.a.com 2* 1* 1* www.b.com www.c.com ThetaSketchAggregator * What is actually stored is a ThetaSketch object. The actual result is calculated in real-time, which allows us to do UNIONs and INTERSECTIONs
  • 40. @ItaiYaffe, @ettigur Cool, so… Back to funnel analysis?
  • 41. @ItaiYaffe, @ettigur Funnel analysis - simple use-case How many unique users viewed online ad? VS How many unique users viewed online ad AND viewed product X page?
  • 42. @ItaiYaffe, @ettigur Funnel analysis - simple use-case 5/1/2021 - 5/26/2021 5/1/2021 - 5/26/2021
  • 44. @ItaiYaffe, @ettigur Funnel analysis pipeline - high-level architecture 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  • 45. @ItaiYaffe, @ettigur Funnel analysis pipeline - Data Lake 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher {event_time=2021-05-26T..., userid=uid1, attribute=online_ad} {event_time=2021-05-26T..., userid=uid1, attribute=homepage} {event_time=2021-05-26T..., userid=uid1, attribute=productX_page} .... date=2021-05-24 date=2021-05-25 date=2021-05-26
  • 46. @ItaiYaffe, @ettigur Funnel analysis pipeline - Mart Generator 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_time=2021-05-26T... , userid=uid1, attribute=online_ad, type=Tactic} {event_time=2021-05-26T... , userid=uid1, attribute=homepage, type=Stage} {event_time=2021-05-26T... , userid=uid1, attribute=productX_page , type=Stage} ....
  • 47. @ItaiYaffe, @ettigur Funnel analysis pipeline - Enricher 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage} {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page } .... ....
  • 48. @ItaiYaffe, @ettigur Funnel analysis pipeline - ingesting data into Druid 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher "type": "index_hadoop", "spec": { "dataSchema": { "dataSource": "campaign_1472", "granularitySpec": { "queryGranularity": "day", "segmentGranularity": "day", "type": "uniform", "intervals": ["2021-05-01/2021-05-27"] ...
  • 49. @ItaiYaffe, @ettigur Funnel analysis pipeline - ingesting data into Druid 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher "timestampSpec": { "column": "event_date", "format": "yyyy-MM-dd" }, "dimensionsSpec": { "dimensions": ["tactic", "stage"] }, "metricsSpec": [{ "fieldName": "userid", "type": "thetaSketch", "name": "user_id_sketch", "size": 65536}], ...
  • 50. @ItaiYaffe, @ettigur Funnel analysis pipeline - ingesting data into Druid 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher "inputSpec": {"type": " multi", "children": [ {"type": " dataSource", "ingestionSpec": { "intervals": ["2021-05-01/2021-05-27"], "dataSource": "campaign_1472", ...}}, {"type": " static", "Paths": "s3://<BUCKET_NAME>/date=2021-05-26/campaign=1472", ...}, ...
  • 51. @ItaiYaffe, @ettigur Funnel analysis pipeline - Druid datasources 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher {__time=2021-05-26, tactic=online_ad, stage=homepage, user_id_sketch=<Object>} {__time=2021-05-26, tactic=online_ad, stage=productX_page , user_id_sketch=<Object>} .... .... campaign_1210 campaign_1319 campaign_1472
  • 52. @ItaiYaffe, @ettigur Funnel analysis pipeline - querying Druid (SQL) 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher SELECT APPROX_COUNT_DISTINCT_DS_THETA(user_id_sketch,65536) as homepage_sketch FROM campaign_1472 WHERE (("tactic" = 'online_ad') AND ("stage" = 'homepage')) AND __time BETWEEN '2021-05-01T00:00:00.000' AND '2021-05-26T23:59:59.000' * This specific query returns the estimated number of unique users that viewed the online ad AND viewed the homepage
  • 53. @ItaiYaffe, @ettigur Funnel analysis - simple use-case revisited 5/1/2021 - 5/26/2021
  • 54. @ItaiYaffe, @ettigur Funnel analysis - simple use-case revisited
  • 55. @ItaiYaffe, @ettigur Funnel analysis - simple use-case revisited
  • 56. @ItaiYaffe, @ettigur Funnel analysis - simple use-case revisited 3,100 - 2,500 != 1000
  • 57. @ItaiYaffe, @ettigur PRODUCT PAGE 1K UUs ... HOMEPAGE 3.1K UUs 2.5K Drop-off ONLINE AD 8.1M UUs Funnel analysis - simple use-case revisited * UUs = Unique Users
  • 58. @ItaiYaffe, @ettigur PRODUCT PAGE 1K UUs ... HOMEPAGE 3.1K UUs ONLINE AD 8.1M UUs Funnel analysis - simple use-case revisited * UUs = Unique Users 2.5K Drop-off
  • 59. @ItaiYaffe, @ettigur Funnel analysis - simple complex use-case How many unique users viewed online ad? VS How many unique users viewed online ad FIRST and THEN viewed product X page?
  • 60. @ItaiYaffe, @ettigur Funnel analysis - complex use-case ● This is what we call a sequential funnel ○ Chronological order of events is important ● The data pipeline is very similar, but… ○ Taking into account only events that happened in the pre-defined order of the funnel ● That way we better represent the efficiency of a specific tactic (i.e advertisement)
  • 61. @ItaiYaffe, @ettigur Funnel analysis pipeline - reminder 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher
  • 62. @ItaiYaffe, @ettigur Funnel analysis pipeline - Data Lake 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher {event_time=2021-05-26T09:15, userid=uid1, attribute=productX_page} {event_time=2021-05-26T10:10, userid=uid1, attribute=online_ad} {event_time=2021-05-26T10:11, userid=uid1, attribute=homepage} .... date=2021-05-24 date=2021-05-25 date=2021-05-26
  • 63. @ItaiYaffe, @ettigur Funnel analysis pipeline - Mart Generator 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_time=2021-05-26T09:15 , userid=uid1, attribute=productX_page , type=Stage} {event_time=2021-05-26T10:10 , userid=uid1, attribute=online_ad, type=Tactic} {event_time=2021-05-26T10:11 , userid=uid1, attribute=homepage, type=Stage} ....
  • 64. @ItaiYaffe, @ettigur Funnel analysis pipeline - Enricher 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page } {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage} ....
  • 65. @ItaiYaffe, @ettigur Funnel analysis pipeline - Enricher 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=productX_page } {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage} ....
  • 66. @ItaiYaffe, @ettigur Funnel analysis pipeline - Enricher 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher campaign=1210 campaign=1319 campaign=1472 date=2021-05-24 date=2021-05-25 date=2021-05-26 {event_date=2021-05-26, userid=uid1, tactic=online_ad, stage=homepage} .... ....
  • 67. @ItaiYaffe, @ettigur Funnel analysis pipeline - querying Druid (SQL) 1. Read files of last day Data Lake 2. Write files by campaign,date Campaigns' marts 3. Read files per campaign 4. Write files by date,campaign Enriched data 5. Load data by campaign Mart Generator Enricher SELECT APPROX_COUNT_DISTINCT_DS_THETA(THETA_SKETCH_NOT(65536, THETA_SKETCH_INTERSECT(65536,a,b), THETA_SKETCH_UNION(65536,c,d,e))) as dropoff_sketch FROM ( SELECT DS_THETA("user_id_sketch") FILTER (WHERE tactic = 'online_ad') as a, DS_THETA("user_id_sketch") FILTER (WHERE stage = 'homepage') as b, DS_THETA("user_id_sketch") FILTER (WHERE stage = 'productX_page') as c, DS_THETA("user_id_sketch") FILTER (WHERE stage = 'add_to_cart') as d, DS_THETA("user_id_sketch") FILTER (WHERE stage = 'checkout') as e FROM campaign_1472 WHERE stage in ('homepage','productX_page','add_to_cart','checkout') AND tactic = 'online_ad' AND __time BETWEEN '2021-05-01T00:00:00.000' AND '2021-05-26T23:59:59.000' ) subquery * This specific query should return the estimated number of unique users for the drop-off between the homepage and product X page
  • 68. @ItaiYaffe, @ettigur Funnel analysis - complex use-case 5/1/2021 - 5/26/2021
  • 71. @ItaiYaffe, @ettigur Funnel analysis - complex use-case 3,100 - 2,500 = 600
  • 72. @ItaiYaffe, @ettigur PRODUCT PAGE 0.6K UUs ... HOMEPAGE 3.1K UUs 2.5K Drop-off ONLINE AD 8.1M UUs Funnel analysis - complex use-case * UUs = Unique Users
  • 73. @ItaiYaffe, @ettigur PRODUCT PAGE 0.6K UUs ... HOMEPAGE 3.1K UUs ONLINE AD 8.1M UUs Funnel analysis - complex use-case * UUs = Unique Users 2.5K Drop-off
  • 74. @ItaiYaffe, @ettigur A few tips ● Use Druid with Theta Sketch for fast approximate count distinct ○ Allows set operations (intersection/union/negation) ● Use Spark to pre-process incoming events ○ Allows you to take into account only events that happened in the pre-defined order of the funnel ○ Check out Etti’s “Optimizing Spark-based data pipelines” talk (video - tinyurl.com/7hvyxtc8, slides - tinyurl.com/3rvc9mus) ● Optimize your ingestion process ○ Write Theta Sketch objects from Spark app ○ Load to Druid using isInputThetaSketch=true flag
  • 75. @ItaiYaffe, @ettigur What have we learned? ● Funnel analysis ○ Very important for advertisers ○ Not easy to solve technically (especially if chronological order of events matters)
  • 76. @ItaiYaffe, @ettigur What have we learned? ● Funnel analysis ○ Very important for advertisers ○ Not easy to solve technically (especially if chronological order of events matters) ● Druid is a very powerful tool for real-time analytics ○ Highly scalable, can ingest and store trillions of events, and serve analytic queries in sub-second ○ Used for many different use-cases
  • 77. @ItaiYaffe, @ettigur What have we learned? ● Funnel analysis ○ Very important for advertisers ○ Not easy to solve technically (especially if chronological order of events matters) ● Druid is a very powerful tool for real-time analytics ○ Highly scalable, can ingest and store trillions of events, and serve analytic queries in sub-second ○ Used for many different use-cases ● Combining Apache Spark, Druid and DataSkecthes FTW! ○ Pre-process events before ingesting into Druid ○ Decide how to handle out-of-order events
  • 78. @ItaiYaffe, @ettigur DRUID ES Want to know more? ● Women in Big Data ○ A world-wide program that aims : ■ To inspire, connect, grow, and champion success of women in the Big Data & analytics field ○ 30+ chapters and 17,000+ members world-wide ○ Everyone can join (regardless of gender), so find a chapter near you - www.womeninbigdata.org/membership/ ● Conference talks ○ Casting the Spell: Druid in Practice (Berlin Buzzwords, June 17th 2021) - tinyurl.com/559hufnj ○ Migrating Airflow-based Spark Jobs to K8s (Data+AI Summit Europe 2020) - tinyurl.com/cbm42mn8 ● Our Tech Blog - medium.com/nmc-techblog ○ Data Retention and Deletion in Apache Druid - tinyurl.com/yymrvrn2
  • 80. THANK YOU Etti Gur Etti Gur Itai Yaffe Itai Yaffe
  • 81. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.