Hello
Everyone! Today we will discuss about “Lazy Evaluation” in Spark. When I
was started Spark and at the first time when I read Lazy Evaluation topic,
completely I did not get about the concept because I am that kind of
person, who believe in practical rather than theory. At that time, I thought
this is not important concept and skip it but when I gone through with multiple
interviews and always struggled with same question. Then I started exploring
about Lazy evaluation in Spark. So far what I got, I am going to discuss about
it.
As we know that, there are two kind of operations we generally
perform in Spark i.e. transformation and action. Transformations return new rdd
objects and actions return values or data to the driver program. Without time
wasting last start with practical experience…
What I understood about Lazy Evaluation; Simply Lazy
Evaluation in Spark means that the execution will not start until an action is
triggered.
I have a file ‘’string.txt” in my/root directory.
I am going to pass right path of file for rdd.Here we have created rdd using right path of the file. You can see in the below screenshot.
Okay! We have passed right path for rdd creation and able to perform
collect () action.
Let see if we pass wrong path for rdd creation.
You can see that, when I passed wrong path of datasets but
spark did not validate it, just created rdd.
Now
we are going to perform collect() action and it will give error for rdd.
What
does it mean? It means that when spark create rdd does not validate path is
right or wrong, but whenever perform action it validates the path, is it right
or not. This is best example of lazy evaluation, Spark only evaluate at the time
of action not transformation.
Now we have an question in our mind what is need of Lazy
evaluation?
Whenever we perform multiple transformations, spark create a DAG
(Directed acyclic Graph) based on this graph spark take more educated decision
that how it can optimize the overall process. So Spark only process those rdd which are required in particular oprations, not perform operations on all the rdds.
0 comments:
Post a Comment