This presentation gives an overview of the Apache Airflow project. It explains Apache Airflow in terms of it's pipelines, tasks, integration and UI.


What Is Apache Airflow ● A work flow management platform ● Uses Python based work flows ● Schedule by time or event ● Open source Apache 2.0 license ● Written in Python ● Monitor work flows in UI ● Has a wide range of integration options ● Originally developed at Airbnb

What Is Apache Airflow ● Uses SqlLite as a back end DB but can use – MySQL Postgres JDBC etc ● Install extra packages using pip command – Wide variety available includes – Many databases cloud services – Hadoop eco system – Security web services queues – Many more

Airflow Pipelines ● These are Python based work flows ● Are actually directed acyclic graphs DAGs ● Pipelines use Jinja templating ● Pipelines contain user defined tasks ● Tasks can run on different workers at different times ● Jinja scripts can be embedded in tasks ● Comments can be added in tasks in varying formats ● Inter task dependencies can be defined

Airflow Pipelines

Airflow Tasks ● Tasks have a lifecycle ● Tasks use operators to execute depends upon type – For instance MySqlOperator ● Hooks are used to access external systems i.e. databases ● Worker specific queues can be used for tasks ● Xcom allows tasks to exchange messages ● Pipelines or DAGs allow – Branching – Sub DAGs – Service level agreements SLA – Triggering rules

Airflow Task Stages ● Tasks have life cycle stages

Airflow Task Life Cycle

Airflow UI ● Airflow UI provides views – DAG Tree Graph Variables Gantt Chart – Task duration Code view ● Select a task instance in any view to manage ● Monitor and troubleshoot pipelines in views ● Monitor DAGs by owner schedule run time etc ● Use views to find pipeline problem areas ● Use views to find bottle necks

Airflow UI

Airflow Integration ● Airflow Integrates with – Azure: Microsoft Azure – AWS: Amazon Web Services – Databricks – GCP: Google Cloud Platform – Cloud Speech Translate Operators – Qubole ● Kubernetes – Run tasks as pods

Airflow Metrics ● Airflow can send metrics to StatsD – A network daemon that runs on Node.js – Listens for statistics like counters gauges timers – Statistics sent over UDP or TCP ● Install metrics using pip command ● Specify which stats to record i.e. – schedulerexecutordagrun

