SlideShare a Scribd company logo
Building and managing complex
dependencies pipeline using
Apache Oozie
Purshotam Shah (purushah@yahoo-inc.com)
Sr. Software Engineer, Yahoo Hadoop team
Apache Oozie PMC member and committer
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and Monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Why Oozie?
3
 Out-of-box support for multiple job types
 Java, shell, distcp
 Mapreduce
• Pipes, streaming
 pig, hive, spark
 Highly scalable
 High availability
 Hot-Hot with rolling upgrades
 Load balanced
 Hue Integration
Oozie
Hbase
Pig
Hive
Spark
Yarn
HDFS
Hue
HCata
log
4
Security: https + kerberos /
cookie-based auth
Deployment Architecture at Yahoo
Load
Balancer
Oracle
RAC
Hadoop Cluster, HBase, HCatalog
submit request
request redirection
Oozie Server 1
Oozie Server 2
Inter server communication
for log streaming,sharelib update etc
Zookeeper
Curator
Security: https + kerberos / cookie-
based-auth
Security: https+kerberos
Lock management
Security: kerberos
Security: kerberos
Scale at Yahoo
5
Deployed on all clusters (production, non-production)
One instance per cluster
75 products / 2000 + projects
255 monthly users
90,00 workflow jobs daily June 2016, one busy cluster)
Between 1-8 actions :Avg. 4 actions/workflow
Extreme use case, submit 100-200 workflow jobs per min
2,277 coordinator jobs daily (June 2016, one busy cluster)
Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)
99 % of workflow jobs kicked from coordinator
97 bundle jobs daily (June 2016, one busy cluster)
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Data Pipelines
7
Ad Exchange
Ad Latency
Search Advertising
Content Management
Content Optimization
Content Personalization
Flickr Video
Audience Targeting
Behavioral Targeting
Partner Targeting
Retargeting
Web Targeting
Advertisement Content Targeting
Data Pipelines
8
Anti Spam
Content
Retargeting
Research
Dashboards & Reports
Forecasting
Email Data Intelligence Data Management
Audience Pipeline
Use Case - Data pipeline
9
Large Scale Data Pipeline Requirements
10
 Administrative
 One should be able to start, stop and pause all related pipelines or part of it at the
same time
 Dependency Management
 BCP support
 Data is not guaranteed, start processing even if partial data is available
 Mandatory and optional feeds
Large Scale Data Pipeline Requirements
11
 Multiple Providers
 If data is available from multiple providers, I want to specify the provider priority
 Combining dataset from multiple providers
 SLA Management
 Monitor pipeline processing to take immediate action in case of failures or SLA misses
 Pipelines owners should get notified if an SLA is missed
Bundle
12
 The Bundle system allows the user to define and execute a bunch of
Loosely coupled set of coordinators. They are dependent on each
other, but dependency is enforced via inputs and outputs.
 Bundle can be used to start/stop/suspend/resume/rerun whole pipeline
Complex dependencies
13
OOZIE-1976 : Specifying coordinator input datasets in more logical ways
BCP Support
Pull data from A or B. Specify dataset as AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A”/>
<data-in dataset="B"/>
</or>
</input-logic>
14
Minimum availability processing
15
 Some time, we want to process even if partial data is available.
<input-logic>
<data-in dataset=“A" min=”4”/>
</input-logic>
Optional feeds
16
 Dataset B is optional, Oozie will start processing as soon as A is available. It will include
dataset from A and whatever is available from B.
<input-logic>
<and name="optional>
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>
Priority Among Dataset Instances
A will have higher precedence over B and B will have higher precedence over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="A"/>
<data-in dataset="B"/>
<data-in dataset="C”/>
</or>
</input-logic>
17
Wait for primary
Sometime we want to give preference to primary data source and switch to secondary
only after waiting for some specific amount of time.
<input-logic>
<or name="AorB">
<data-in dataset="A” wait=“120”/>
<data-in dataset="B"/>
</or>
</input-logic>
18
Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:current(-5)} </start-instance>
<end-instance> ${coord:current(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:current(-5)}</start-instance>
<end-instance>${coord:current(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
<data-in dataset="A"/>
<data-in dataset="B"/>
</combine>
</input-logic>
19
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Monitoring
21
 Configure to receive notifications
 Email action
 HTTP notifications for job status change
 Email notification for SLA misses
 JMS notification for SLA events
 By Polling
 CLI/REST API monitoring
• Single Job monitoring
• Bulk Monitoring for Bundles and Coordinators
• SLA monitoring
Monitoring
22
 Email action can be added to workflow to send mail
 Job status change notification for coordinator action
 oozie.coord.action.notification.url
 oozie.coord.action.notification.proxy
 Job status change notification for workflow
 “oozie.wf.workflow.notification.url”
 “oozie.wf.workflow.notification.proxy”
Job Monitoring - polling
23
 Supported for both CLI and web service
 Single job monitoring
 Bulk job monitoring
 Multiple parameter like,
• Bundle name, bundle id, username, startcreatedtime, endcreatedtime
 Multiple job status such as
• oozie jobs -bulk bundle=bundle-app-1; actionstatus=RUNNING; actionstatus=FAILED
 Oozie can actively track SLAs on Jobs’
 Start-time, End-time, Duration
 Access/Filter SLA info via
 Web-console dashboard
 REST API
 JMS Messages
 Email alert
24
SLA Monitoring
25
SLA dashboard – tabular view
26
SLA dashboard – Graph view
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
 User view
 BCP SLA support
 No Color coding
 Paging/oncall
 Threshold
 Consolidated email
 Multi grid view
28
Monitoring Limitations
29
Data pipeline monitoring use case from Y!
 Setup cron job which periodically pull SLA information from oozie
 If there is any SLA miss, notification is sent to internal monitoring
system
› Pages and sends mobile alert to on-call person
› Send email alert
30
Case-1
Case-1
31
Case-2
32
 Divided into four section
 SLA Details
 Error jobs
 Long Running Jobs
 Running jobs
SLA information
33
SLA-status
34
Long Waiting jobs
35
Long Waiting jobs – missing dependencies
36
Error Jobs
37
Running job details
38
Job explorer
39
Feeds - jobs
40
Validation job
41
 Data pipe line also run periodically validation jobs to validate the output
 Those multiple pipeline has multiple validation requirement, One example of validation
job is to validate the number of click impression with billing details.
Alert
42
Reprocessing
43
 One of the biggest requirements of a pipeline is to reprocess whole
dependent DAG.
 Oozie does not support any data dependencies
 This makes it very difficult to rerun the whole pipeline for a particular
nominal time.
Reprocessing
44
 To solve Oozie limitation, they have built a job dependency DAG.
 It is very similar to job explorer->feed lookup feature.
 job explorer->feed lookup is based on the output produced by
coordinator jobs.
 Job dependencies DAG is based on the input to jobs.
 Currently there is no UI to this, they parse oozie jobs daily and store the
dependencies in text file.
Reprocessing
45
 Rerun the failed action and all dependent coordinator jobs.
• Easy to do
• Cons
– Difficult to monitor
 Create a new coordinator for timeline which has failed
• Easy to monitor
Reprocessing
46
Reprocessing
47
Consolidate SLA Monitoring
48
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Future Work
50
 Oozie Unit testing framework
 No unit tests now. Directly tested by running in staging
 Coordinator Dependency management
 Better reprocessing
 Aperiodic and Incremental processing
 Managed through workarounds
Oozie BOF at Ballroom B
51
THANK YOU
Purshotam Shah (purushah@yahoo-inc.com)
Sr. Software Engineer, Yahoo Hadoop team

More Related Content

Building and managing complex dependencies pipeline using Apache Oozie

  • 1. Building and managing complex dependencies pipeline using Apache Oozie Purshotam Shah (purushah@yahoo-inc.com) Sr. Software Engineer, Yahoo Hadoop team Apache Oozie PMC member and committer
  • 2. Agenda Oozie at Yahoo1 Data Pipelines SLA and Monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 3. Why Oozie? 3  Out-of-box support for multiple job types  Java, shell, distcp  Mapreduce • Pipes, streaming  pig, hive, spark  Highly scalable  High availability  Hot-Hot with rolling upgrades  Load balanced  Hue Integration Oozie Hbase Pig Hive Spark Yarn HDFS Hue HCata log
  • 4. 4 Security: https + kerberos / cookie-based auth Deployment Architecture at Yahoo Load Balancer Oracle RAC Hadoop Cluster, HBase, HCatalog submit request request redirection Oozie Server 1 Oozie Server 2 Inter server communication for log streaming,sharelib update etc Zookeeper Curator Security: https + kerberos / cookie- based-auth Security: https+kerberos Lock management Security: kerberos Security: kerberos
  • 5. Scale at Yahoo 5 Deployed on all clusters (production, non-production) One instance per cluster 75 products / 2000 + projects 255 monthly users 90,00 workflow jobs daily June 2016, one busy cluster) Between 1-8 actions :Avg. 4 actions/workflow Extreme use case, submit 100-200 workflow jobs per min 2,277 coordinator jobs daily (June 2016, one busy cluster) Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min) 99 % of workflow jobs kicked from coordinator 97 bundle jobs daily (June 2016, one busy cluster)
  • 6. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 7. Data Pipelines 7 Ad Exchange Ad Latency Search Advertising Content Management Content Optimization Content Personalization Flickr Video Audience Targeting Behavioral Targeting Partner Targeting Retargeting Web Targeting Advertisement Content Targeting
  • 8. Data Pipelines 8 Anti Spam Content Retargeting Research Dashboards & Reports Forecasting Email Data Intelligence Data Management Audience Pipeline
  • 9. Use Case - Data pipeline 9
  • 10. Large Scale Data Pipeline Requirements 10  Administrative  One should be able to start, stop and pause all related pipelines or part of it at the same time  Dependency Management  BCP support  Data is not guaranteed, start processing even if partial data is available  Mandatory and optional feeds
  • 11. Large Scale Data Pipeline Requirements 11  Multiple Providers  If data is available from multiple providers, I want to specify the provider priority  Combining dataset from multiple providers  SLA Management  Monitor pipeline processing to take immediate action in case of failures or SLA misses  Pipelines owners should get notified if an SLA is missed
  • 12. Bundle 12  The Bundle system allows the user to define and execute a bunch of Loosely coupled set of coordinators. They are dependent on each other, but dependency is enforced via inputs and outputs.  Bundle can be used to start/stop/suspend/resume/rerun whole pipeline
  • 13. Complex dependencies 13 OOZIE-1976 : Specifying coordinator input datasets in more logical ways
  • 14. BCP Support Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available. <input-logic> <or name=“AorB”> <data-in dataset="A”/> <data-in dataset="B"/> </or> </input-logic> 14
  • 15. Minimum availability processing 15  Some time, we want to process even if partial data is available. <input-logic> <data-in dataset=“A" min=”4”/> </input-logic>
  • 16. Optional feeds 16  Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B. <input-logic> <and name="optional> <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and> </input-logic>
  • 17. Priority Among Dataset Instances A will have higher precedence over B and B will have higher precedence over C. <input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or> </input-logic> 17
  • 18. Wait for primary Sometime we want to give preference to primary data source and switch to secondary only after waiting for some specific amount of time. <input-logic> <or name="AorB"> <data-in dataset="A” wait=“120”/> <data-in dataset="B"/> </or> </input-logic> 18
  • 19. Combining Dataset From Multiple Providers Combine function will first check instances from A and go to B next for whatever is missing in A. <data-in name="A" dataset="dataset_A"> <start-instance> ${coord:current(-5)} </start-instance> <end-instance> ${coord:current(-1)} </end-instance> </data-in> <data-in name="B" dataset="dataset_B"> <start-instance>${coord:current(-5)}</start-instance> <end-instance>${coord:current(-1)}</end-instance> </data-in> <input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine> </input-logic> 19
  • 20. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 21. Monitoring 21  Configure to receive notifications  Email action  HTTP notifications for job status change  Email notification for SLA misses  JMS notification for SLA events  By Polling  CLI/REST API monitoring • Single Job monitoring • Bulk Monitoring for Bundles and Coordinators • SLA monitoring
  • 22. Monitoring 22  Email action can be added to workflow to send mail  Job status change notification for coordinator action  oozie.coord.action.notification.url  oozie.coord.action.notification.proxy  Job status change notification for workflow  “oozie.wf.workflow.notification.url”  “oozie.wf.workflow.notification.proxy”
  • 23. Job Monitoring - polling 23  Supported for both CLI and web service  Single job monitoring  Bulk job monitoring  Multiple parameter like, • Bundle name, bundle id, username, startcreatedtime, endcreatedtime  Multiple job status such as • oozie jobs -bulk bundle=bundle-app-1; actionstatus=RUNNING; actionstatus=FAILED
  • 24.  Oozie can actively track SLAs on Jobs’  Start-time, End-time, Duration  Access/Filter SLA info via  Web-console dashboard  REST API  JMS Messages  Email alert 24 SLA Monitoring
  • 25. 25 SLA dashboard – tabular view
  • 26. 26 SLA dashboard – Graph view
  • 27. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 28.  User view  BCP SLA support  No Color coding  Paging/oncall  Threshold  Consolidated email  Multi grid view 28 Monitoring Limitations
  • 29. 29 Data pipeline monitoring use case from Y!
  • 30.  Setup cron job which periodically pull SLA information from oozie  If there is any SLA miss, notification is sent to internal monitoring system › Pages and sends mobile alert to on-call person › Send email alert 30 Case-1
  • 32. Case-2 32  Divided into four section  SLA Details  Error jobs  Long Running Jobs  Running jobs
  • 36. Long Waiting jobs – missing dependencies 36
  • 41. Validation job 41  Data pipe line also run periodically validation jobs to validate the output  Those multiple pipeline has multiple validation requirement, One example of validation job is to validate the number of click impression with billing details.
  • 43. Reprocessing 43  One of the biggest requirements of a pipeline is to reprocess whole dependent DAG.  Oozie does not support any data dependencies  This makes it very difficult to rerun the whole pipeline for a particular nominal time.
  • 44. Reprocessing 44  To solve Oozie limitation, they have built a job dependency DAG.  It is very similar to job explorer->feed lookup feature.  job explorer->feed lookup is based on the output produced by coordinator jobs.  Job dependencies DAG is based on the input to jobs.  Currently there is no UI to this, they parse oozie jobs daily and store the dependencies in text file.
  • 45. Reprocessing 45  Rerun the failed action and all dependent coordinator jobs. • Easy to do • Cons – Difficult to monitor  Create a new coordinator for timeline which has failed • Easy to monitor
  • 49. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 50. Future Work 50  Oozie Unit testing framework  No unit tests now. Directly tested by running in staging  Coordinator Dependency management  Better reprocessing  Aperiodic and Incremental processing  Managed through workarounds
  • 51. Oozie BOF at Ballroom B 51
  • 52. THANK YOU Purshotam Shah (purushah@yahoo-inc.com) Sr. Software Engineer, Yahoo Hadoop team