Opensource Workflow Orchestration Tool, where work is represented as DAGs
Airflow is NOT a data streaming solution, its a workflow orechestrator
Overview: Scheduler → Submits individual Tasks to Executor, which assigns the tasks to Workers. Airflows UI → Webserver allows users to connect, run, check logs etc about the tasks
Dag Directory → DAG files ready to be accessed and excecuted
Why use airflow?
Opensource, Free, Fully Python, Integration (Plug and Play), Easy to use, Scalable (Ready to scale to infinity b/c Modular Architecture and uses a Message Queue to Orchestrate an arbitrary number of workers)
Advantages of representing Data Pipelines as DAGs
DAG is a Directed Acyclic Graph
All trees are DAGs, but not all DAGs are trees.
DAGs represents workflows.
Tasks and Operators
Advantages of Workflows as code & Airflow UI:
Maintainable, version-able, collaborative, and Testable! (Can pass it thru unit tests too!)
Comes with a nice UI to trigger/check logs, retries attempted, etc
# AIRFLOW COMMON COMMANDS
# Check the version
which airflow
# initialize the database
airflow db init
# start the web server, default port is 8080
airflow webserver -p 8080
# TO actually make airflow run the dags, we need to run it's scheduler aswell
# start the scheduler
airflow scheduler
First, we have the imports
from datetime import timedelta
# The DAG object; we'll need this to instantiate a DAG
from airflow.models import DAG
# Operators; you need this to write tasks!
from airflow.operators.bash_operator import BashOperator
# This makes scheduling easy
from airflow.utils.dates import days_ago
Then, we have the default arguments for the DAG.