Sunday, 18 August 2019

Apache Airflow Orchestration tool to schedule and orchestrate your data workflows



Hello Techie! Did you hear about Apache Airflow? Do you know, what is use of Airflow? If you don’t know, no need to worry. You are at right place.
In this post, we will demonstrate the working of Apache Airflow. Here we are not going to discuss the installation of Apache Airflow.  So primarily we will go through the introduction then will focus on “How we can use the Apache Airflow?”. So, let’s start….
Apache Airflow is an open-source workflow automation and scheduling platform.  It is used for data pipeline model building tool and Similar to Apache Oozie, Azkaban, and Luigi.
In Apache Airflow, we create a Dags (Directed Acyclic Graphs) by using the python code. The Dags includes below information:
·         Configuration file to outline HOW to execute a task
·         Contains a collection of tasks
·         Determines the order of execution
·         Time of implementation
                                   
The above DAG diagram shows the order of execution of tasks. The task A will execute first then after other tasks will execute according to flow. According to our business requirement we can change the order of execution of tasks.

Before going to dive into python programming part, let’s take a glance about the Apache Airflow console.



So far, we have gotten some basic idea about the Airflow. If you did not get anything from the above discussion, then forget it. From here we will learn practically about Apache Airflow. Here we are assuming that you are using Linux machine or instance for the Apache Airflow.

Airflow Commands:
airflow -version
#To initiate the database where Airflow saves the workflows and their states
airflow initdb
#To check the list of dags, it will show only dags id
airflow list_dags   
#To check the list of tasks for specific dags id
airflow list_tasks <dag_id>

Operator
An operator describes a single task in a workflow. Operators determine what executes when your DAG runs. Airflow provides operators for many common tasks, including: –
  • BashOperator – executes a bash command
  • PythonOperator – calls an arbitrary Python function
  • EmailOperator – sends an email
  • SimpleHttpOperator – sends an HTTP request
  • MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, etc. – executes a SQL command
  • Sensor – waits for a certain time, file, database row, S3 key, etc. In addition to these basic building blocks, there are many more specific operators: DockerOperator, HiveOperator, S3FileTransformOperator, PrestoToMysqlOperator, SlackOperator
Here we discuss only BashOperator and PythonOperator, if you want to use other than these two operators then you can follow the Apache Airflow official document(https://airflow.apache.org/_modules/index.html)


Firstly, open the terminal where your have installed the Apache Airflow and go the Airflow directory, you can able to see the below sub-directories




Then go the dags directory and create a python file (file name should be anything, whatever you want) with .py extension and write the code in this file.
Example 1: (Hello_World.py)
In this example, we will use BashOperater for tasks.



Then go to Apache Airflow console and click on Dags tab and find the dag id, in this example we have mentioned “HelloWorld as dag id.


Note : You can see the all tasks by clicking on graph view button near by trigger button.

Example 2: (FCA.py)

In this example, we will use pythonOperater for tasks. 


0 comments:

Post a Comment