Sunday, 19 July 2020

Understanding about How does Spark work?


Hello everyone! Today we are going to talk about the components of Apache Spark. Spark contains several components and each component has specific role in executing the Spark program.


First discuss about the components and its responsibilities then we will deep dive into the workflow of each component.



1)      Worker: It is one of the nodes which belong to Cluster and called slave. Workers have resources (CPU+RAM) for executing the task.

2)      Cluster Manager: It have all information about the workers. It aware about the available resource in workers. According to resource availability, decides to distribute the particular to Workers. As we know that Spark have embedded resource manager and we can use when we deploy the Spark on Standalone mode. Apart from standalone mode, we can use customize resource manager like Local, YARN, Mesos and Kubernetes.
3)      Driver:  As the part of the Spark application responsible for instantiating a SparkSession, the Spark driver has multiple roles: it communicates with the cluster manager; it requests
resources (CPU, memory, etc.) from the cluster manager for Spark’s executors (JVMs); and it transforms all the Spark operations into DAG computations, schedules them, and distributes their execution as tasks across the Spark executors. Once the resources are allocated, it communicates directly with the executors.

Question: What is difference between SparkSession and SparkContext?
Answer: Prior to Spark 2.0, entry point for the Spark applications included the SparkContext, used for Spark apps; SQLContext, HiveContext, StreamingContext. In Spark application, we have only one SparkContext.

The SparkSession object introduced in Spark 2.0 combines all these objects into single entry point that can be used for all Spark application. SparkSession object contains all the runtime configuration properties set by the user such as the master, application name, numbers of executors and more.

4)      Executors: A Spark executor runs on each worker node in the cluster. The executors communicate with the driver program and are responsible for executing tasks on the workers. In most deployments’ modes, only a single executor runs per node.

5)      Tasks: Small unit of job which is perform by SparkContext.



SparkSession/SparkContext initiates on the client machine and make the connection to master. Whenever we perform any Spark operation it will communicate to master and logically directed acyclic graph called DAG. master has all information about resources of workers. Now the driver talks to the cluster manager and negotiates the resources. Cluster manager launches executors in worker nodes on behalf of the driver. At this point, the driver will send the tasks to the executors based on data placement. When executors start, they register themselves with drivers. So, the driver will have a complete view of executors that are executing the task. During the execution of tasks, driver program will monitor the set of executors that runs. Driver node also schedules future tasks based on data placement.

1 comment:

  1. In Simple words, spark Context is entry point used to create rdd api.
    Whereas
    Spark session is unified context (sc,ssc,sqlcontext) etc ... Its used to create dataset.
    Thanks to shar eu rknowledge Ravi Kumar.
    Regards
    Venu
    bigdata training institute in Hyderabad
    spark training in Hyderabad

    ReplyDelete