Apache Airflow

Presented By: Divyansh Jain
Software Consultant
Knoldus Inc
Apache Airflow

Agenda
What is Airflow?
How it works?
Pros & Cons
UI Screenshots
Demo
Why use Airflow ?

c
Airflow is a platform to
programmatically author, schedule
and monitor workflows (a.k.a DAGs or
Directed Acyclic Graph)
What is Apache
Airflow?

Incubating
- Created @ Airbnb in 2015 by Maxime
Beauchemin
- 6100+ Forks
- 16k+ Github Stars
- 1k+ Contributors
- 150+ companies officially using it
- More than 8k committers

What is Airflow?
Pipelines are configured as
code, allowing for dynamic
pipeline generation
A platform to schedule
data pipelines
Easy to define our own
operators, executors
It’s all about DAGs
A platform to monitor and
control data pipelines

Why to use Airflow ?
When would you use a Workflow Scheduler like
Airflow?
- ETL Pipelines
- Machine Learning Projects
- Predictive Data Pipelines such as Fraud Detection etc.
- Job Scheduling

Why to use Airflow ?
What should a Workflow do well?
- Schedule your workflow plan
- Handle task failures
- Monitor Performance

Very Flexible
- Rich User Interface
- Easily Extensible
- Allow communication between task
- Backfill Control
- Efficient CLI

How the Job initiates?
Web Server
Scheduler
Worker Worker
Metadata
User
- A user manages/ schedules
DAGs using Airflow
- Airflow webserver stores
scheduling metadata in
metadata DB
- Scheduler picks up new
schedule and distributes work
over Celery or RabbitMQ
- Workers picks up Airflow Task
Note- Celery is used to manage
node and Redis or RabbitMQ for
communication

Pros
Easy-to-use user interface - Independent scheduler Active open-source
community
● can easily view task logs
● can easily view hierarchy,
statuses, and code runs
● allows you to easily
change task statuses,
rerun historical tasks
● comes with its own
scheduler
● supports multiple DAGs
● chatroom itself is very
active
● newbies can get their
questions answered within
a span of few hours.

Cons
Task optimization - No direct dealing with
- tasks
Lesser flexibility
● sometimes unclear of
how to organize multiple
tasks into the massive
pipeline
● doesn't deal with data sets
or files as inputs of tasks
directly
● If database is lost, harder
to restore
● Workers cannot start the
task independently or pick
tasks flexibly.

Apache Airflow

Related slideshows

More Related Content

Apache Airflow