Building and managing complex dependencies pipeline using Apache Oozie

Building and managing complex
dependencies pipeline using
Apache Oozie
Purshotam Shah (purushah@yahoo-inc.com)
Sr. Software Engineer, Yahoo Hadoop team
Apache Oozie PMC member and committer

Agenda
Oozie at Yahoo1
Data Pipelines
SLA and Monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5

Why Oozie?
3
 Out-of-box support for multiple job types
 Java, shell, distcp
 Mapreduce
• Pipes, streaming
 pig, hive, spark
 Highly scalable
 High availability
 Hot-Hot with rolling upgrades
 Load balanced
 Hue Integration
Oozie
Hbase
Pig
Hive
Spark
Yarn
HDFS
Hue
HCata
log

4
Security: https + kerberos /
cookie-based auth
Deployment Architecture at Yahoo
Load
Balancer
Oracle
RAC
Hadoop Cluster, HBase, HCatalog
submit request
request redirection
Oozie Server 1
Oozie Server 2
Inter server communication
for log streaming,sharelib update etc
Zookeeper
Curator
Security: https + kerberos / cookie-
based-auth
Security: https+kerberos
Lock management
Security: kerberos
Security: kerberos

Scale at Yahoo
5
Deployed on all clusters (production, non-production)
One instance per cluster
75 products / 2000 + projects
255 monthly users
90,00 workflow jobs daily June 2016, one busy cluster)
Between 1-8 actions :Avg. 4 actions/workflow
Extreme use case, submit 100-200 workflow jobs per min
2,277 coordinator jobs daily (June 2016, one busy cluster)
Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)
99 % of workflow jobs kicked from coordinator
97 bundle jobs daily (June 2016, one busy cluster)

Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5

Data Pipelines
7
Ad Exchange
Ad Latency
Search Advertising
Content Management
Content Optimization
Content Personalization
Flickr Video
Audience Targeting
Behavioral Targeting
Partner Targeting
Retargeting
Web Targeting
Advertisement Content Targeting

Data Pipelines
8
Anti Spam
Content
Retargeting
Research
Dashboards & Reports
Forecasting
Email Data Intelligence Data Management
Audience Pipeline

Large Scale Data Pipeline Requirements
10
 Administrative
 One should be able to start, stop and pause all related pipelines or part of it at the
same time
 Dependency Management
 BCP support
 Data is not guaranteed, start processing even if partial data is available
 Mandatory and optional feeds

Large Scale Data Pipeline Requirements
11
 Multiple Providers
 If data is available from multiple providers, I want to specify the provider priority
 Combining dataset from multiple providers
 SLA Management
 Monitor pipeline processing to take immediate action in case of failures or SLA misses
 Pipelines owners should get notified if an SLA is missed

Bundle
12
 The Bundle system allows the user to define and execute a bunch of
Loosely coupled set of coordinators. They are dependent on each
other, but dependency is enforced via inputs and outputs.
 Bundle can be used to start/stop/suspend/resume/rerun whole pipeline

Complex dependencies
13
OOZIE-1976 : Specifying coordinator input datasets in more logical ways

BCP Support
Pull data from A or B. Specify dataset as AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A”/>
<data-in dataset="B"/>
</or>
</input-logic>
14

Minimum availability processing
15
 Some time, we want to process even if partial data is available.
<input-logic>
<data-in dataset=“A" min=”4”/>
</input-logic>

Optional feeds
16
 Dataset B is optional, Oozie will start processing as soon as A is available. It will include
dataset from A and whatever is available from B.
<input-logic>
<and name="optional>
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>

Priority Among Dataset Instances
A will have higher precedence over B and B will have higher precedence over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="C”/>
</or>
</input-logic>
17

Wait for primary
Sometime we want to give preference to primary data source and switch to secondary
only after waiting for some specific amount of time.
<input-logic>
<or name="AorB">
<data-in dataset="A” wait=“120”/>
</or>
</input-logic>
18

Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:current(-5)} </start-instance>
<end-instance> ${coord:current(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:current(-5)}</start-instance>
<end-instance>${coord:current(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
</combine>
</input-logic>
19

Monitoring
21
 Configure to receive notifications
 Email action
 HTTP notifications for job status change
 Email notification for SLA misses
 JMS notification for SLA events
 By Polling
 CLI/REST API monitoring
• Single Job monitoring
• Bulk Monitoring for Bundles and Coordinators
• SLA monitoring

Monitoring
22
 Email action can be added to workflow to send mail
 Job status change notification for coordinator action
 oozie.coord.action.notification.url
 oozie.coord.action.notification.proxy
 Job status change notification for workflow
 “oozie.wf.workflow.notification.url”
 “oozie.wf.workflow.notification.proxy”

Job Monitoring - polling
23
 Supported for both CLI and web service
 Single job monitoring
 Bulk job monitoring
 Multiple parameter like,
• Bundle name, bundle id, username, startcreatedtime, endcreatedtime
 Multiple job status such as
• oozie jobs -bulk bundle=bundle-app-1; actionstatus=RUNNING; actionstatus=FAILED

 Oozie can actively track SLAs on Jobs’
 Start-time, End-time, Duration
 Access/Filter SLA info via
 Web-console dashboard
 REST API
 JMS Messages
 Email alert
24
SLA Monitoring

25
SLA dashboard – tabular view

26
SLA dashboard – Graph view

 User view
 BCP SLA support
 No Color coding
 Paging/oncall
 Threshold
 Consolidated email
 Multi grid view
28
Monitoring Limitations

29
Data pipeline monitoring use case from Y!

 Setup cron job which periodically pull SLA information from oozie
 If there is any SLA miss, notification is sent to internal monitoring
system
› Pages and sends mobile alert to on-call person
› Send email alert
30
Case-1

Case-2
32
 Divided into four section
 SLA Details
 Error jobs
 Long Running Jobs
 Running jobs

Long Waiting jobs – missing dependencies
36

Validation job
41
 Data pipe line also run periodically validation jobs to validate the output
 Those multiple pipeline has multiple validation requirement, One example of validation
job is to validate the number of click impression with billing details.

Reprocessing
43
 One of the biggest requirements of a pipeline is to reprocess whole
dependent DAG.
 Oozie does not support any data dependencies
 This makes it very difficult to rerun the whole pipeline for a particular
nominal time.

Reprocessing
44
 To solve Oozie limitation, they have built a job dependency DAG.
 It is very similar to job explorer->feed lookup feature.
 job explorer->feed lookup is based on the output produced by
coordinator jobs.
 Job dependencies DAG is based on the input to jobs.
 Currently there is no UI to this, they parse oozie jobs daily and store the
dependencies in text file.

Reprocessing
45
 Rerun the failed action and all dependent coordinator jobs.
• Easy to do
• Cons
– Difficult to monitor
 Create a new coordinator for timeline which has failed
• Easy to monitor

Future Work
50
 Oozie Unit testing framework
 No unit tests now. Directly tested by running in staging
 Coordinator Dependency management
 Better reprocessing
 Aperiodic and Incremental processing
 Managed through workarounds

THANK YOU
Purshotam Shah (purushah@yahoo-inc.com)
Sr. Software Engineer, Yahoo Hadoop team

Building and managing complex dependencies pipeline using Apache Oozie

More Related Content

Building and managing complex dependencies pipeline using Apache Oozie