Hello everyone! In this article, we will discuss about few ecosystems which are widely used in big data domain. All stacks I have designed over docker contains and also I will give you little a bit basic understanding about the docker.
Let’s start with high level project understanding……
Here we have created four ecosystems on docker containers which
are given below:
1)
Apache Airflow: This is used for orchestrating
and scheduling the jobs.
2)
LIVY: We are using this to submit the spark
job.
3)
Apache Spark: This is using for doing
some data processing (Aggregations & manipulating the records) as per the
business requirements.
4)
MySQL: This is our destination. This
might be different, could be object storage (s3, gcs etc.), RDMS(oracle, mysql,
postgres etc) or local filesystem as per your use case.
Let’s set-up the environments on docker containers. Firstly
we are going to create environment for the Apache Airflow using docker-compose.yaml.
Copy the above code snippet and store in a docker-compose
file and execute the below command on terminal or cmd.
Once Airflow services started in different containers then try
to access the airflow webserver in web browser. In my case I am running docker
in local and defined 9099 port for airflow webserver.
http://localhost:9099
default User: airflow
default password: airflow
Perfect! We have successfully built Apache Airflow setup on
docker container. Now let’s jump on Spark and Livy setup.
Below is yaml file snippet for creating Apache Spark and Apache Livy
containers. We need to execute yaml file using command.
we have successfully built spark and Livy setup also. Now we
need to test the environments, for that we are going to do the spark submit
using Livy and also we will schedule the spark submit job from Apache airflow
to Apache Livy and code will execute on spark worker nodes and writing the
outcome to mysql database.
Copy the pysaprk code from local to Livy machine to path
using below command.
Now login to airflow webserver node and create a DAG python
file into dags directory.
It will look like below snapshot and we are good to trigger
the job.
That’s it …. we have validated all the integration, further
we are able to build the spark code as per business requirement and for the
automation we have to follow same procedure.
nice blog. keep it up
ReplyDelete