SlideShare a Scribd company logo
Building a healthy data ecosystem
around Kafka and Hadoop:
Lessons learned at LinkedIn
Mar 16, 2017
Shirshanka Das, Principal Staff Engineer, LinkedIn
Yael Garten, Director of Data Science, LinkedIn
@shirshanka, @yaelgarten
The Pursuit of #DataScienceHappiness
A original
@yaelgarten @shirshanka
write code
Share Learnings
at Strata!
Three (Naïve) Steps to #DataScienceHappiness
circa 2010
write code
Share Learnings
at Strata!
Three (Naïve) Steps to #DataScienceHappiness
circa 2010
Achieving Data
“… helping everybody to access and understand data .…
breaking down silos… providing access to data when and where
it is needed at any given moment.”
Collect, flow, store as much data as you can
Provide efficient access to data in all its stages of evolution
The forms of data
Key-Value++ Message Bus Fast-OLAP Search Graph Crunchable
Pinot Galene Graph DB
Blob, Data
The forms of data
At RestIn Motion
Graph DBDocument
Blob, Data
The forms of data
At RestIn Motion
O(10) clusters
~1.7 Trillion messages
~450 TB
O(10) clusters
~10K machines
~100 PB
At RestIn Motion
Data Integration
Blob, Data
Data Integration: key requirements
Source, Sink
So, we built
Simplifying Data Integration
Hundreds of TB per day
Thousands of datasets
~30 different source systems
80%+ of data ingest
Open source @
Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal,
NerdWallet and many more…
Apache incubation under way
Blob, Data
At RestIn Motion
At RestIn Motion
Kafka Hadoop
Samza Jobs
hour +
Distributed Multi-dimensional OLAP
Columnar + indexes
No joins
Latency: low ms to sub-second
Site-facing	Apps Reporting	dashboards Monitoring
Open source.
In production @
LinkedIn, Uber
At RestIn Motion
Data Infra 1.0 for Data Democracy
Query Engines
2010 - now
write code
Share Learnings
at Strata

Data Scientist
PM Designer
We should enable users
to filter connection
suggestions by company
How much do
people utilize
existing filter
Let's see how
users send
Tracking data records user activity
(powers metrics and data products)
Scale fact:

~ 1000 tracking event types, 

~ Hundreds of metrics & data
Tracking data records user activity
tracking data
metric scripts
production code
Tracking Data Lifecycle
TransportProduce Consume
Member facing
data products
Business facing
decision making
Tracking Data Lifecycle & Teams
Product or App teams:
PMs, Developers, TestEng
Infra teams:
Hadoop, Kafka, DWH, ...
Data science teams: 

Analytics, ML Engineers,...
tracking data
metric scripts
production code
Member facing
data products
Business facing
decision making
TransportProduce Consume
How do we calculate a metric: ProfileViews
	Record	1:	
		"memberId"	:	12345,	
		"time"	:	1454745292951,	
		"appName"	:	"LinkedIn",	
		"pageKey"	:	"profile_page",	
		"trackingInfo"	:		
				vid=1214,ps=EDU|EXP|SKIL|	..."	

ProfileViews = sum(PageViewEvent

where pagekey = profile_page

	Record	101:	
		"memberId"	:	12345,	
			"time"	:	1454745292951,	
			"appName"	:	"LinkedIn",	
			"pageKey"	:	"new_profile_page",	
		"trackingInfo"	:		
				vid=1214,ps=EDU|EXP|SKIL|	..."	
or new_profile_page
Ok but
forgot to notifyundesirable
Metrics ecosystem at LinkedIn: 3 yrs ago
Operational Challenges for infra teams
Diminished Trust due to multiple sources of truth
What was causing unhappiness?
1. No contracts: Downstream scripts broke when upstream changed

2. "Naming things is hard": different semantics & conventions in various data Events
(per team) 

--> need to email to figure out what is correct and complete logic to use 

--> inefficient and potentially wrong

3. Discrepant metric logic: 

Duplicate tech allowed for duplicate logic allowed for discrepant metric logic
So how did we solve this?
Data Modeling Tip
Say no to Fragile Formats or Schema-Free
Invest in a mature serialization protocol like Avro, Protobuf, Thrift etc for serializing
your messages to your persistent stores: Kafka, Hadoop, DBs etc.
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
Chose Avro as our format
Sometimes you need a committee
Leads from product and infra teams
Review each new data model
Ensure that it follows our conventions,
patterns and best practices across entire
data lifecycle
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
Data Model Review Committee
Tooling to codify conventions
“Always be reviewing”
Who and What Evolution
Unified Metrics Platform
A single source of truth for all
business metrics at LinkedIn
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
- metrics processing platform as a
- a metrics computation template
- a set of tools and process to
facilitate metrics life-cycle
Central Team,
& Deploy
System JobsCore Metrics
1. iterate
2. create
4. check in
3. review
5,000 metrics daily
Unified Metrics Platform: Pipeline
Metrics Logic
UMP Harness
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
+ Database
+ Other data
Tracking Platform: standardizing production
Schema compatibility
At RestIn Motion
Data Infra + Platforms 2.0
Tracking Platform Unified Metrics Platform (UMP)
Production Consumption
circa 2015
What was still causing unhappiness?
1. Old bad data sticks around (e.g. old mobile app versions)
2. No clear contract for data production - Producers unaware of consumers concerns
3. Never a good time to pay down this tech debt
We started from the bottom.
Product or App teams:
PMs, Developers, TestEng
Infra teams:
Hadoop, Kafka, DWH, ...
Data science teams: 

Analytics, ML Engineers,...
tracking data
metric scripts
production code
Member facing
data products
Business facing
decision making
3. Never a good time to pay down this "data" debt
#victimsOfTheData —> #DataScienceHappiness 

via proactively forging our own data destiny.
Features are waiting to ship to members... some of this stuff is invisible
But what is the cost
of not doing it?
The Big Problem Opportunity in 2015 

Launch a completely rewritten LinkedIn mobile app
		"header"	:	{	
				"memberId"	:	12345,	
				"time"	:	1454745292951,	
				"appName"	:	{	
						"string"	:	"LinkedIn"	
				"pageKey"	:	"profile_page"	
		"trackingInfo"	:	{	
				["Viewee"	:	"23456"],	
We already wanted to move to better data models
		"header"	:	{	
				"memberId"	:	12345,	
				"time"	:	4745292951145,	
				"appName"	:	{	
						"string"	:	"LinkedIn"	
				"pageKey"	:	"profile_page"	
"entityView"	:	{	
					"viewType"	:	"profile-view",	
					"viewerId"	:	“12345”,		
"vieweeId"	:	“23456”,		

1. Keep the old tracking:
a. Cost: producers (try to) replicate it (write bad old code from
b. Save: consumers avoid migrating.

2. Evolve.
a. Cost: time on clean data modeling, and on consumer
migration to new tracking events,
b. Save: pays down data modeling tech debt
There were two options:

1. Keep the old tracking:
a. Cost: producers (try to) replicate it (write bad old code from
b. Save: consumers avoid migrating.

2. Evolve.
a. Cost: time on clean data modeling, and on consumer
migration to new tracking events,
b. Save: pays down data modeling tech debt
How much work would it be?
We pitched it to our Leadership team
Do it!
2. Clear contract did not exist for data production
Producers were unaware of consumers needs, and were "Throwing data over the wall". 

Albeit avro, Schema adherence != Semantics equivalence
user engagement
tracking data


Member facing

data products
Business facing
decision making
#victimsOfTheData —> #DataScienceHappiness, via proactive joint requirements definition
Own the artifact that
feeds the data ecosystem
(and data scientists!)
Data producers
(PM, app developers)
Data consumers 

2a. Ensure dialogue between Producers & Consumers
• Awareness: Train about end-to-end data pipeline, data modeling
• Instill communication & collaborative ownership process between all: a step-by-step
playbook for who & how to develop and own tracking
2b. Standardized core data entities
• Event types and names: Page, Action, Impression
• Framework level client side tracking: views, clicks, flows
• For all else (custom) - guide when to create a new Event

Page View
Control Interaction
2c. Created clear maintainable data production contracts
Tracking specification with monitoring and alerting for adherence: 

clear, visual, consistent contract
Need tooling to support culture and process shift - "Always be tooling"

Tracking specification Tool
1. Old bad data sticks around
"header" : {
"memberId" : 12345,
"time" : 1454745292951,
"appName" : {
"string" : "LinkedIn"
"pageKey" :
"trackingInfo" : {
["vieweeID" : "23456"],
"header" : {
"memberId" : 12345,
"time" : 4745292951145,
"appName" : {
"string" : "LinkedIn"
"pageKey" : "profile_page"
"entityView" : {
"viewType" : "profile-view",
"viewerId" : “12345”,
"vieweeId" : “23456”,
How do we handle old and new?
Producers Consumers
The Big Challenge
load “/data/tracking/PageViewEvent” using AvroStorage()
(Pig scripts)
My Raw Data
Our scripts were doing ….
My Raw Data
My Data API
We need “microservices" for Data
The Database community solved this
decades ago...
We built Dali to solve this
A Data Access Layer for Linkedin
Abstract away underlying physical details to allow users to
focus solely on the logical concerns
Logical Tables + Views
Logical FileSystem
Producers Consumers
Job Seeker App
LinkedIn App
Dali: Implementation Details in Context
Dali FileSystem
Processing Engine
(MR, Spark)
Dali Datasets (Tables+Views)
Dataflow APIs
(MR, Spark,
Query Layers
(Pig, Hive,
Dali CLI
Data Catalog
Git + Artifactory
View Def +
Data Source
Data Sink
load ‘/data/tracking/PageViewEvent’
using AvroStorage();
load ‘tracking.UnifiedProfileView’ using
One small step for a script
A Few Hard Problems
Views and UDFs
Mapping to Hive metastore entities
Development lifecycle
Git as source of truth
Gradle for build
LinkedIn tooling integration for deployment
State of the world today
Pretty much all new UMP metrics use Dali
data sources
At Rest
Now brewing: Dali on Kafka
Can we take the same
views and run them
seamlessly on Kafka as
Stream Data
Standard streaming API-s
- Samza System Consumer
- Kafka Consumer
What’s next for Dali?
Selective materialization
Open source
Hive is an implementation detail, not a long term bet
Dali: When are we done dreaming?
At RestIn Motion
At RestIn Motion
Data Infra + Platforms 3.0
Tracking Platform Unified Metrics Platform (UMP)
DaliDr Elephant WhereHows
circa 2017
Did we succeed? We just handled another huge rewrite!
write code
Share Learnings
at Strata
Three (Naïve) Steps to #DataScienceHappiness
Basic data
for data democracy
Platforms, Process
to standardize
produce + consume

Tech + process

to sustain
healthy data
Our Journey towards #DataScienceHappiness
Tracking, UMP
2013 ->
Kafka, Hadoop,
Gobblin, Pinot
2010 -> 2015 ->
The Pursuit of #DataScienceHappiness
A original
@yaelgarten @shirshanka
Thank You!
to be continued…

More Related Content

What's hot

Action from Insight - Joining the 2 Percent Who are Getting Big Data Right
Action from Insight - Joining the 2 Percent Who are Getting Big Data RightAction from Insight - Joining the 2 Percent Who are Getting Big Data Right
Action from Insight - Joining the 2 Percent Who are Getting Big Data Right
Vertical is the New Horizontal - MinneAnalytics 2016 Sri Ambati Keynote on AI
Vertical is the New Horizontal - MinneAnalytics 2016 Sri Ambati Keynote on AIVertical is the New Horizontal - MinneAnalytics 2016 Sri Ambati Keynote on AI
Vertical is the New Horizontal - MinneAnalytics 2016 Sri Ambati Keynote on AI
Sri Ambati
Mat Keep
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
Softweb Solutions
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data Platform
Vikas Manoria
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Amazon Web Services Korea
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning Pipeline
Srivatsan Srinivasan
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
Infochimps, a CSC Big Data Business
Webinar: Hybrid Cloud Integration - Why It's Different and Why It Matters
Webinar: Hybrid Cloud Integration - Why It's Different and Why It MattersWebinar: Hybrid Cloud Integration - Why It's Different and Why It Matters
Webinar: Hybrid Cloud Integration - Why It's Different and Why It Matters
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
Elvis Muyanja
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
eXascale Infolab
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Teradata Aster
Introduction to Deep Learning and AI at Scale for Managers
Introduction to Deep Learning and AI at Scale for ManagersIntroduction to Deep Learning and AI at Scale for Managers
Introduction to Deep Learning and AI at Scale for Managers
DataWorks Summit
Marketing vs Technology
Marketing vs TechnologyMarketing vs Technology
Marketing vs Technology
Nguyen Ngoc Hoai Aan
Unified Information Governance, Powered by Knowledge Graph
Unified Information Governance, Powered by Knowledge GraphUnified Information Governance, Powered by Knowledge Graph
Unified Information Governance, Powered by Knowledge Graph
Democratizing Intelligence - Sri Ambati, CEO & Co-Founder,
Democratizing Intelligence - Sri Ambati, CEO & Co-Founder, H2O.aiDemocratizing Intelligence - Sri Ambati, CEO & Co-Founder,
Democratizing Intelligence - Sri Ambati, CEO & Co-Founder,
Sri Ambati
IBM Big Data Analytics Concepts and Use Cases
IBM Big Data Analytics Concepts and Use CasesIBM Big Data Analytics Concepts and Use Cases
IBM Big Data Analytics Concepts and Use Cases
Tony Pearson
Apply Machine Learning to Microservices
Apply Machine Learning to MicroservicesApply Machine Learning to Microservices
Apply Machine Learning to Microservices
Kai Wähner
Infochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey TheoremInfochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey Theorem
Infochimps, a CSC Big Data Business

What's hot (20)

Action from Insight - Joining the 2 Percent Who are Getting Big Data Right
Action from Insight - Joining the 2 Percent Who are Getting Big Data RightAction from Insight - Joining the 2 Percent Who are Getting Big Data Right
Action from Insight - Joining the 2 Percent Who are Getting Big Data Right
Vertical is the New Horizontal - MinneAnalytics 2016 Sri Ambati Keynote on AI
Vertical is the New Horizontal - MinneAnalytics 2016 Sri Ambati Keynote on AIVertical is the New Horizontal - MinneAnalytics 2016 Sri Ambati Keynote on AI
Vertical is the New Horizontal - MinneAnalytics 2016 Sri Ambati Keynote on AI
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data Platform
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning Pipeline
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
Webinar: Hybrid Cloud Integration - Why It's Different and Why It Matters
Webinar: Hybrid Cloud Integration - Why It's Different and Why It MattersWebinar: Hybrid Cloud Integration - Why It's Different and Why It Matters
Webinar: Hybrid Cloud Integration - Why It's Different and Why It Matters
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Introduction to Deep Learning and AI at Scale for Managers
Introduction to Deep Learning and AI at Scale for ManagersIntroduction to Deep Learning and AI at Scale for Managers
Introduction to Deep Learning and AI at Scale for Managers
Marketing vs Technology
Marketing vs TechnologyMarketing vs Technology
Marketing vs Technology
Unified Information Governance, Powered by Knowledge Graph
Unified Information Governance, Powered by Knowledge GraphUnified Information Governance, Powered by Knowledge Graph
Unified Information Governance, Powered by Knowledge Graph
Democratizing Intelligence - Sri Ambati, CEO & Co-Founder,
Democratizing Intelligence - Sri Ambati, CEO & Co-Founder, H2O.aiDemocratizing Intelligence - Sri Ambati, CEO & Co-Founder,
Democratizing Intelligence - Sri Ambati, CEO & Co-Founder,
IBM Big Data Analytics Concepts and Use Cases
IBM Big Data Analytics Concepts and Use CasesIBM Big Data Analytics Concepts and Use Cases
IBM Big Data Analytics Concepts and Use Cases
Apply Machine Learning to Microservices
Apply Machine Learning to MicroservicesApply Machine Learning to Microservices
Apply Machine Learning to Microservices
Infochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey TheoremInfochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey Theorem

Viewers also liked

Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Shirshanka Das
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Shirshanka Das
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Shirshanka Das
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Shirshanka Das
Aksyon radyo
Aksyon radyoAksyon radyo
Aksyon radyo
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Shirshanka Das
SlideShare 101
SlideShare 101SlideShare 101
SlideShare 101
Amit Ranjan

Viewers also liked (7)

Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Aksyon radyo
Aksyon radyoAksyon radyo
Aksyon radyo
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
SlideShare 101
SlideShare 101SlideShare 101
SlideShare 101

Similar to Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn

Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
Yael Garten
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
Agile data science
Agile data scienceAgile data science
Agile data science
Joel Horwitz
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesAgile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Raphael Branger
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
Frank Kienle
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Tomasz Bednarz
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Infochimps, a CSC Big Data Business Self Service Data Mesh Platform Self Service Data Mesh Self Service Data Mesh Platform Self Service Data Mesh Platform
Sanjay Padhi, Ph.D
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time Analytics
Arcadia Data
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
Adrien Blind
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Sreedhar Chowdam
Information Security Analytics
Information Security AnalyticsInformation Security Analytics
Information Security Analytics
Amrit Chhetri
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...
Balvinder Hira
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Yahoo Developer Network
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Ian Gomez
Confluent Partner Tech Talk with BearingPoint
Confluent Partner Tech Talk with BearingPointConfluent Partner Tech Talk with BearingPoint
Confluent Partner Tech Talk with BearingPoint

Similar to Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn (20)

Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
Agile data science
Agile data scienceAgile data science
Agile data science
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesAgile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data Self Service Data Mesh Platform Self Service Data Mesh Self Service Data Mesh Platform Self Service Data Mesh Platform
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time Analytics
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Information Security Analytics
Information Security AnalyticsInformation Security Analytics
Information Security Analytics
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Confluent Partner Tech Talk with BearingPoint
Confluent Partner Tech Talk with BearingPointConfluent Partner Tech Talk with BearingPoint
Confluent Partner Tech Talk with BearingPoint

Recently uploaded

University of Toronto degree offer diploma Transcript
University of Toronto  degree offer diploma TranscriptUniversity of Toronto  degree offer diploma Transcript
University of Toronto degree offer diploma Transcript
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
sapna sharmap11
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Amazon Web Services Korea
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeSaket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
shruti singh$A17
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
Amazon Web Services Korea
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
simmi singh$A17
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeNoida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
kumkum tuteja$A17
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeMalviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeRK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Alisha Pathan $A17
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
Jyotishko Biswas

Recently uploaded (20)

University of Toronto degree offer diploma Transcript
University of Toronto  degree offer diploma TranscriptUniversity of Toronto  degree offer diploma Transcript
University of Toronto degree offer diploma Transcript
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeSaket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Sunshine Coast University diploma
Sunshine Coast University diplomaSunshine Coast University diploma
Sunshine Coast University diploma
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeNoida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Noida Extension @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeMalviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeRK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn

  • 1. Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn Mar 16, 2017 Shirshanka Das, Principal Staff Engineer, LinkedIn Yael Garten, Director of Data Science, LinkedIn @shirshanka, @yaelgarten
  • 2. The Pursuit of #DataScienceHappiness A original @yaelgarten @shirshanka
  • 3. Achieve Data Democracy Data Scientists write code Unleash Insights Share Learnings at Strata! Three (Naïve) Steps to #DataScienceHappiness circa 2010
  • 4. Achieve Data Democracy Data Scientists write code Unleash Insights Share Learnings at Strata! Three (Naïve) Steps to #DataScienceHappiness circa 2010
  • 5. Achieving Data Democracy “… helping everybody to access and understand data .… breaking down silos… providing access to data when and where it is needed at any given moment.” Collect, flow, store as much data as you can Provide efficient access to data in all its stages of evolution
  • 6. The forms of data Key-Value++ Message Bus Fast-OLAP Search Graph Crunchable Espresso Venice Pinot Galene Graph DB Document DB DynamoDB Azure Blob, Data Lake Storage
  • 7. The forms of data At RestIn Motion Espresso Venice Pinot Galene Graph DBDocument DB DynamoDB Azure Blob, Data Lake Storage
  • 8. The forms of data At RestIn Motion Scale O(10) clusters ~1.7 Trillion messages ~450 TB Scale O(10) clusters ~10K machines ~100 PB
  • 9. At RestIn Motion SFTPJDBCREST Data Integration Azure Blob, Data Lake Storage
  • 10. Data Integration: key requirements Source, Sink Diversity Batch + Streaming Data Quality So, we built
  • 11. SFTP JDBC REST Simplifying Data Integration @LinkedIn Hundreds of TB per day Thousands of datasets ~30 different source systems 80%+ of data ingest Open source @ Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal, NerdWallet and many more… Apache incubation under way SFTP Azure Blob, Data Lake Storage
  • 15. Kafka Hadoop Samza Jobs Pinot minutes hour + Distributed Multi-dimensional OLAP Columnar + indexes No joins Latency: low ms to sub-second Query Engines
  • 16. Site-facing Apps Reporting dashboards Monitoring Open source. In production @ LinkedIn, Uber
  • 17. At RestIn Motion Processing Frameworks Data Infra 1.0 for Data Democracy Query Engines 2010 - now
  • 19. How does LinkedIn build data- driven products? 
 Data Scientist PM Designer Engineer We should enable users to filter connection suggestions by company How much do people utilize existing filter capabilities? Let's see how users send connection invitations today.
  • 20. Tracking data records user activity InvitationClickEvent()
  • 21. (powers metrics and data products) InvitationClickEvent() Scale fact:
 ~ 1000 tracking event types, 
 ~ Hundreds of metrics & data products Tracking data records user activity
  • 22. user engagement tracking data metric scripts production code Tracking Data Lifecycle TransportProduce Consume Member facing data products Business facing decision making
  • 23. Tracking Data Lifecycle & Teams Product or App teams: PMs, Developers, TestEng Infra teams: Hadoop, Kafka, DWH, ... Data science teams: 
 Analytics, ML Engineers,... user engagement tracking data metric scripts production code Member facing data products Business facing decision making TransportProduce Consume Members Execs
  • 24. How do we calculate a metric: ProfileViews PageViewEvent Record 1: { "memberId" : 12345, "time" : 1454745292951, "appName" : "LinkedIn", "pageKey" : "profile_page", "trackingInfo" : “Viewee=1214,lnl=f,nd=1,o=1214, ^SP=pId-'pro_stars',rslvd=t,vs=v, vid=1214,ps=EDU|EXP|SKIL| ..." } Metric: 
 ProfileViews = sum(PageViewEvent
 where pagekey = profile_page
 ) PageViewEvent Record 101: { "memberId" : 12345, "time" : 1454745292951, "appName" : "LinkedIn", "pageKey" : "new_profile_page", "trackingInfo" : "viewee_id=1214,lnl=f,nd=1,o=1214, ^SP=pId-'pro_stars',rslvd=t,vs=v, vid=1214,ps=EDU|EXP|SKIL| ..." } or new_profile_page Ok but forgot to notifyundesirable
  • 25. Metrics ecosystem at LinkedIn: 3 yrs ago Operational Challenges for infra teams Diminished Trust due to multiple sources of truth
  • 26. What was causing unhappiness? 1. No contracts: Downstream scripts broke when upstream changed
 2. "Naming things is hard": different semantics & conventions in various data Events (per team) 
 --> need to email to figure out what is correct and complete logic to use 
 --> inefficient and potentially wrong
 3. Discrepant metric logic: 
 Duplicate tech allowed for duplicate logic allowed for discrepant metric logic So how did we solve this?
  • 27. Data Modeling Tip Say no to Fragile Formats or Schema-Free Invest in a mature serialization protocol like Avro, Protobuf, Thrift etc for serializing your messages to your persistent stores: Kafka, Hadoop, DBs etc. 1. No contracts 2. Naming things is hard 3. Discrepant metric logic Chose Avro as our format
  • 28. Sometimes you need a committee Leads from product and infra teams Review each new data model Ensure that it follows our conventions, patterns and best practices across entire data lifecycle 1. No contracts 2. Naming things is hard 3. Discrepant metric logic Data Model Review Committee (DMRC) Tooling to codify conventions “Always be reviewing” Who and What Evolution
  • 29. Unified Metrics Platform A single source of truth for all business metrics at LinkedIn 1. No contracts 2. Naming things is hard 3. Discrepant metric logic - metrics processing platform as a service - a metrics computation template - a set of tools and process to facilitate metrics life-cycle Central Team, Relevant Stakeholders Sandbox Metric Definition Code Repo Build & Deploy System JobsCore Metrics Job Metric Owner 1. iterate 2. create 4. check in 3. review 5,000 metrics daily
  • 30. Unified Metrics Platform: Pipeline Metrics Logic Raw Data Pinot UMP Harness Incremental Aggregate Backfill Auto-join Raptor dashboards HDFS Aggregated Data Experiment Analysis Machine Learning Anomaly Detection HDFS Ad-hoc 1. No contracts 2. Naming things is hard 3. Discrepant metric logic Tracking + Database + Other data
  • 31. Tracking Platform: standardizing production Schema compatibility Time Audit KafkaClient-side Tracking Tracking Frontend Services Tools
  • 32. Query Engines At RestIn Motion Processing Frameworks Data Infra + Platforms 2.0 Pinot Tracking Platform Unified Metrics Platform (UMP) Production Consumption circa 2015
  • 33. What was still causing unhappiness? 1. Old bad data sticks around (e.g. old mobile app versions) 2. No clear contract for data production - Producers unaware of consumers concerns 3. Never a good time to pay down this tech debt We started from the bottom.
  • 34. Product or App teams: PMs, Developers, TestEng Infra teams: Hadoop, Kafka, DWH, ... Data science teams: 
 Analytics, ML Engineers,... user engagement tracking data metric scripts production code Member facing data products Business facing decision making Members Execs 3. Never a good time to pay down this "data" debt #victimsOfTheData —> #DataScienceHappiness 
 via proactively forging our own data destiny. Features are waiting to ship to members... some of this stuff is invisible But what is the cost of not doing it?
  • 35. The Big Problem Opportunity in 2015 
 Launch a completely rewritten LinkedIn mobile app
  • 36. PageViewEvent { "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "trackingInfo" : { ["Viewee" : "23456"], ... } } We already wanted to move to better data models ProfileViewEvent { "header" : { "memberId" : 12345, "time" : 4745292951145, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "entityView" : { "viewType" : "profile-view", "viewerId" : “12345”, "vieweeId" : “23456”, }, } viewee_ID
  • 37. 
 1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from scratch), b. Save: consumers avoid migrating.
 2. Evolve. a. Cost: time on clean data modeling, and on consumer migration to new tracking events, b. Save: pays down data modeling tech debt There were two options:
  • 38. 
 1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from scratch), b. Save: consumers avoid migrating.
 2. Evolve. a. Cost: time on clean data modeling, and on consumer migration to new tracking events, b. Save: pays down data modeling tech debt How much work would it be? #DataScienceHappiness
  • 39. We pitched it to our Leadership team Do it! CTOCEO
  • 40. 2. Clear contract did not exist for data production Producers were unaware of consumers needs, and were "Throwing data over the wall". 
 Albeit avro, Schema adherence != Semantics equivalence user engagement tracking data metric 
 scripts production
 code Member facing
 data products Business facing decision making #victimsOfTheData —> #DataScienceHappiness, via proactive joint requirements definition Own the artifact that feeds the data ecosystem (and data scientists!) Data producers (PM, app developers) Data consumers 
  • 41. 2a. Ensure dialogue between Producers & Consumers • Awareness: Train about end-to-end data pipeline, data modeling • Instill communication & collaborative ownership process between all: a step-by-step playbook for who & how to develop and own tracking
  • 42. 2b. Standardized core data entities • Event types and names: Page, Action, Impression • Framework level client side tracking: views, clicks, flows • For all else (custom) - guide when to create a new Event
 Navigation Page View Control Interaction
  • 43. 2c. Created clear maintainable data production contracts Tracking specification with monitoring and alerting for adherence: 
 clear, visual, consistent contract Need tooling to support culture and process shift - "Always be tooling"
 Tracking specification Tool
  • 44. 1. Old bad data sticks around PageViewEvent { "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "trackingInfo" : { ["vieweeID" : "23456"], ... } } ProfileViewEvent { "header" : { "memberId" : 12345, "time" : 4745292951145, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "entityView" : { "viewType" : "profile-view", "viewerId" : “12345”, "vieweeId" : “23456”, }, }
  • 45. How do we handle old and new? PageViewEvent ProfileViewEvent Producers Consumers old new Relevance Analytics
  • 46. The Big Challenge load “/data/tracking/PageViewEvent” using AvroStorage() (Pig scripts) My Raw Data Our scripts were doing ….
  • 47. My Raw Data My Data API We need “microservices" for Data
  • 48. The Database community solved this decades ago... Views!
  • 49. We built Dali to solve this A Data Access Layer for Linkedin Abstract away underlying physical details to allow users to focus solely on the logical concerns Logical Tables + Views Logical FileSystem
  • 52. Dali: Implementation Details in Context Dali FileSystem Processing Engine (MR, Spark) Dali Datasets (Tables+Views) Dataflow APIs (MR, Spark, Scalding) Query Layers (Pig, Hive, Spark) Dali CLI Data Catalog Git + Artifactory View Def + UDFs Dataset Owner Data Source Data Sink
  • 53. From load ‘/data/tracking/PageViewEvent’ using AvroStorage(); To load ‘tracking.UnifiedProfileView’ using DaliStorage(); One small step for a script
  • 54. A Few Hard Problems Versioning Views and UDFs Mapping to Hive metastore entities Development lifecycle Git as source of truth Gradle for build LinkedIn tooling integration for deployment
  • 55. State of the world today ~300 views Pretty much all new UMP metrics use Dali data sources ProfileViews MessagesSent Searches InvitationsSent ArticlesRead JobApplications ... At Rest Data Processing Frameworks
  • 56. Now brewing: Dali on Kafka Can we take the same views and run them seamlessly on Kafka as well? Stream Data Standard streaming API-s - Samza System Consumer - Kafka Consumer
  • 57. What’s next for Dali? Selective materialization Open source Hive is an implementation detail, not a long term bet
  • 58. Dali: When are we done dreaming? At RestIn Motion Data Processing Frameworks Dali
  • 59. Query Engines At RestIn Motion Processing Frameworks Data Infra + Platforms 3.0 Pinot Tracking Platform Unified Metrics Platform (UMP) DaliDr Elephant WhereHows circa 2017
  • 60. Did we succeed? We just handled another huge rewrite! #DataScienceHappiness
  • 62. Basic data infrastructure for data democracy Platforms, Process to standardize produce + consume Evangelize investing
 in #DataScience Happiness Tech + process
 to sustain healthy data ecosystem Our Journey towards #DataScienceHappiness Dali, Dialogue 2015-> Tracking, UMP DMRC 2013 -> Kafka, Hadoop, Gobblin, Pinot 2010 -> 2015 ->
  • 63. The Pursuit of #DataScienceHappiness A original @yaelgarten @shirshanka Thank You! to be continued…