Sunday 17 May 2020

Lazy Evaluation in Spark



Hello Everyone! Today we will discuss about “Lazy Evaluation” in Spark. When I was started Spark and at the first time when I read Lazy Evaluation topic, completely I did not get about the concept because I am that kind of person, who believe in practical rather than theory. At that time, I thought this is not important concept and skip it but when I gone through with multiple interviews and always struggled with same question. Then I started exploring about Lazy evaluation in Spark. So far what I got, I am going to discuss about it.
As we know that, there are two kind of operations we generally perform in Spark i.e. transformation and action. Transformations return new rdd objects and actions return values or data to the driver program. Without time wasting last start with practical experience…
What I understood about Lazy Evaluation; Simply Lazy Evaluation in Spark means that the execution will not start until an action is triggered.
I have a file ‘’string.txt” in my/root directory.

I am going to pass right path of file for rdd.Here we have created rdd using right path of the file. You can see in the below screenshot.

Okay! We have passed right path for rdd creation and able to perform collect () action.
Let see if we pass wrong path for rdd creation.



You can see that, when I passed wrong path of datasets but spark did not validate it, just created rdd.
Now we are going to perform collect() action and it will give error for rdd.

What does it mean? It means that when spark create rdd does not validate path is right or wrong, but whenever perform action it validates the path, is it right or not. This is best example of lazy evaluation, Spark only evaluate at the time of action not transformation. 


Now we have an question in our mind what is need of Lazy evaluation?


Whenever we perform multiple transformations, spark create a DAG (Directed acyclic Graph) based on this graph spark take more educated decision that how it can optimize the overall process. So Spark only process those rdd which are required in particular oprations, not perform operations on all the rdds.


0 comments:

Post a Comment