Description
This lab demonstrates the functionalities of Airflow to programmatically automate, author, schedule and monitor workflows.
- Airflow is a platform to programmatically automate, schedule and monitor workflows.
- In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.
- The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies.
- Rich command line utilities make performing complex surgeries on DAGs a snap.
- The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
Experiment Setup
- Create a new project and install the required dependencies.
pip install apache-airflow
- EXPORT AIRFLOW_HOME
Enter the path of the present working directory
- Initialize the instance
airflow db init
- Create an admin user
airflow users create
–username admin
–firstname YourName
–lastname YourLastName
–role Admin
–email example@example.com
- Start the Daemon in the background.
airflow webserver -D
It usually runs on port 8080
- To check whether Airflow Daemon is running:
List the services running on port 8080
lsof -i tcp:8080
- Start the scheduler
airflow scheduler
Check the web server on 127.0.0.1:8080
- Create folder dags inside AIRFLOW_HOME
Place the DemoDag python file under the ‘dags’ folder.
- Kill and Start the scheduler again to show the dags on the web server
lsof -i tcp:8080
Kill the pid of the running services on port 8080
- Start the web server again by ‘airflow webserver -D’
- The file can now be seen under the ‘dags’ folder.
- Dags can be scheduled and run every minute or hourly/daily
- You can also pause/unpause the dag depending on the requirement
- Trigger your dag
- Check the logs for additional information
- Adding tasks to a DAG
Adding task_2 by making changes in the code and clicking the ‘Update’ button
Checking logs for additional information
- Restructuring the code
Airflow TFX
- Installing ‘requirements.txt’
- You can now see the ‘taxi_pipeline’ dag in the dags folder.
3. The Tree view looks something like this:
- Successful execution of ‘taxi pipeline’.
- ExampleGen ingests and splits the input dataset.
- StatisticsGen calculates statistics for the dataset.
- SchemaGen SchemaGen examines the statistics and creates a data schema.
- ExampleValidator looks for anomalies and missing values in the dataset.
Lessons learned
- This lab helps us understand how Airflow allows users to create workflows with high granularity and track the progress as they execute.
- Airflow enables us to have a platform that can run and automate all the jobs on a schedule.
- You can also add/transform jobs as and when required.
References


