Tuesday, 14 February 2023

Demonstrating spark project on Docker

Hello everyone! In this article, we will discuss about few ecosystems which are widely used in big data domain. All stacks I have designed over docker contains and also I will give you little a bit basic understanding about the docker.

Let’s start with high level project understanding……



Here we have created four ecosystems on docker containers which are given below:

1)      Apache Airflow: This is used for orchestrating and scheduling the jobs.

2)      LIVY: We are using this to submit the spark job.

3)      Apache Spark: This is using for doing some data processing (Aggregations & manipulating the records) as per the business requirements.

4)      MySQL: This is our destination. This might be different, could be object storage (s3, gcs etc.), RDMS(oracle, mysql, postgres etc) or local filesystem as per your use case.


Let’s set-up the environments on docker containers. Firstly we are going to create environment for the Apache Airflow using docker-compose.yaml.



Copy the above code snippet and store in a docker-compose file and execute the below command on terminal or cmd.


Once Airflow services started in different containers then try to access the airflow webserver in web browser. In my case I am running docker in local and defined 9099 port for airflow webserver.


http://localhost:9099

default User: airflow

default password: airflow


Perfect! We have successfully built Apache Airflow setup on docker container. Now let’s jump on Spark and Livy setup.

Below is yaml file  snippet for creating Apache Spark and Apache Livy containers. We need to execute yaml file using command.


we have successfully built spark and Livy setup also. Now we need to test the environments, for that we are going to do the spark submit using Livy and also we will schedule the spark submit job from Apache airflow to Apache Livy and code will execute on spark worker nodes and writing the outcome to mysql database.

Copy the pysaprk code from local to Livy machine to path using below command.


Now login to airflow webserver node and create a DAG python file into dags directory.

It will look like below snapshot and we are good to trigger the job.


That’s it …. we have validated all the integration, further we are able to build the spark code as per business requirement and for the automation we have to follow same procedure.

1 comment: