Friday, 28 April 2023

Spark internals

 


Spark is a distributed data processing engine and every spark application itself is distributed data processing engine. Spark need a cluster along with a resource manager to perform distributed computation. As of now, Spark supports the below cluster technologies.

·       Standalone

·       YARN

·       Kubernetes

·       Mesos

       But in this article, we are assuming that we have a YARN resource manager to manage the spark cluster. Below is the high-level architecture of the spark cluster.


 

Let’s try to understand how Spark works internally. This is an important concept which helps us in every aspect when you dealing with large amounts of data. I will try to make it very simple.

 

Firstly, we need to know how can we execute the Spark application, there are two ways to run the Spark application

1)      Interactive mode

2)      Batch mode

 Interactive mode, generally we do for testing purposes because we need immediate output after each instruction. We can do this by using spark-shell or Jupyter Notebook.

Batch mode, we perform in production when the spark application is fully completed. It executes completely and generates the final output as per business requirements. We can do this by using the spark-submit command.

Now let’s come to the above diagram, here you can see that programmer doing the spark submit. Spark is written in Scala (Scala is a JVM language) which means Spark's native language is Scala.

Let’s suppose I want to execute the Pyspark application then we have to use spark-submit command on the terminal. Then our request will be submitted to the resource manager then YARN RM  randomly select a worker node and creates one Application Master(AM) container and start the main method in the AM container. Again here raised one more question what is a container? Container is an isolated virtual run-time environment. Here is a question if Spark is written in Scala, then How Spark can execute other languages like Python, Java & R?

Let’s come inside the Application master container. AM Container is responsible to execute the main method of application but here we have a Python application that’s a Pyspark application. In the case of the Pyspark application, AM container consists Pyspark driver & JVM driver. When we submit the Pyspark application the Pyspark driver executes the main method of the Pyspark application because Pysaprk is designed to execute the main () of the Pysaprk application. Then Pyspark starts the JVM main method with the help of Py4j.

Refer to the below two diagrams for better visualization.





Once JVM main method started executing the AM request to the YARN resource manager for executors.

Then creates executors as per configuration and returns all details of executors to the AM.




In conclusion, Spark helps us break down intensive and high-computational jobs into smaller, more concise tasks which are then executed by the worker nodes. It also achieves the processing of real-time or archived data using its basic architecture.


0 comments:

Post a Comment