Airflow provides a variety of operators to couple your business logic into executable tasks in a workflow. Often times it is confusing to decide when to use what. In this article we will discuss the p
Airflow Operators - A Comparison
Airflow provides a variety of operators to couple your business logic into executable tasks in a workflow. Often it is confusing to decide when to use what. In this article, we will discuss the pros and cons of each in detail.
PythonOperator
When using the Airflow PythonThe best operator, all the business logic and its associated code reside in the airflow DAG directory. The PythonOperator
imports and runs them during the execution
airflow
\__dags
\_classification_workflow.py
\_ tweet_classification
\_preprocess.py
\_predict.py
\_ __init__.py
\__logs
\__airflow.cfg
Pros
- Best option is when the code is in the same repo as the Airflow
- Simple and easy to use
- Works well on small teams
Cons:
- Couples airflow code with business logic
- Any business logic change would mean redeploying airflow code
- Sharing a single airflow instance across multiple projects will be a nightmare
- Can run only Python code, well, duh.
DockerOperator
When using Airflow’s Docker operator, all the business logic and its associated code reside in a docker image.
During execution
- Airflow pulls the specified image
- Spins up a container
- Executes the respective command.
- We have to ensure that a docker daemon is running
DockerOperator(
dag=dag,
task_id='docker_task',
image='gs://project-predict/predict-api:v1',
auto_remove=True,
docker_url='unix://var/run/docker.sock',
command='python extract_from_api_or_something.py'
)
Pros
- Works well across cross-functional teams
- Can run projects that are not built-in Python
- Works well when your infra is already working on a Docker system -e.g., Docker compose
Cons
- Needs docker installed in the worker machine
- Depending on the resources available, The load of the worker machine might be heavy when multiple containers run at the same time
KubernetesPodOperator
When using KubernetesPodOperator
, all the business logic and it’s associated code resides in a docker image. During execution, airflow spins up a worker pod, which pulls the mentioned docker image and executes the respective command.
KubernetesPodOperator(
task_id='classify_tweets',
name='classify_tweets',
cmds=['python', 'app/classify.py'],
namespace='airflow',
image='gcr.io/tweet_classifier/dev:0.0.1')
Pros
- Works well across cross-functional teams
- Single airflow instance can be shared across teams without hassle
- Decouples DAG and the business logic
Cons:
Complex on the infrastructure, since it uses docker and Kubernetes.