Spark is a distributed data processing engine and every spark application itself is distributed data processing engine. Spark need a cluster along with a resource manager to perform distributed computation. As of now, Spark supports the below cluster technologies.
·
Standalone
·
YARN
·
Kubernetes
·
Mesos
But in this article, we are assuming that we have a YARN resource manager to manage the spark cluster. Below is the high-level architecture of the spark cluster.
Let’s try to understand how Spark works internally. This is an
important concept which helps us in every aspect when you dealing with large
amounts of data. I will try to make it very simple.
Firstly, we need to know how can we execute the Spark
application, there are two ways to run the Spark application
1)
Interactive mode
2)
Batch mode
Interactive mode, generally we do for
testing purposes because we need immediate output after each instruction. We
can do this by using spark-shell or Jupyter Notebook.
Batch mode, we perform in
production when the spark application is fully completed. It executes completely
and generates the final output as per business requirements. We can do this by
using the spark-submit command.
Now let’s come to the above diagram, here you can see that
programmer doing the spark submit. Spark is written in Scala (Scala is a JVM
language) which means Spark's native language is Scala.
Let’s suppose I want to execute the Pyspark application then
we have to use spark-submit command on the terminal. Then our request will be
submitted to the resource manager then YARN RM randomly select a worker node and creates one Application
Master(AM) container and start the main method in the AM
container. Again here raised one more question what is a container? Container
is an isolated virtual run-time environment. Here is a question if Spark is
written in Scala, then How Spark can execute other languages like Python, Java
& R?
Let’s come inside the Application master container. AM Container
is responsible to execute the main method of application but here we have a Python
application that’s a Pyspark application. In the case of the Pyspark
application, AM container consists Pyspark driver & JVM driver. When we
submit the Pyspark application the Pyspark driver executes the main method of the
Pyspark application because Pysaprk is designed to execute the main () of the Pysaprk
application. Then Pyspark starts the JVM main method with the help of Py4j.
Refer to the below two diagrams for better visualization.
Once JVM main method started executing the AM request to the
YARN resource manager for executors.
Then creates executors as per
configuration and returns all details of executors to the AM.
In conclusion, Spark helps us
break down intensive and high-computational jobs into smaller, more concise
tasks which are then executed by the worker nodes. It also achieves the
processing of real-time or archived data using its basic architecture.
0 comments:
Post a Comment