Monday, 13 May 2019

Why Apache Spark




First time, When I was started reading about Spark, every time only one question was raising in my mind “if we have already MapReduce and which provides similar functionality then why introduced   Apache Spark?”.These kinds of questions force me to make the comparison in between Spark and MapReduce.
In this post we will focus on differences between Spark and MapReduce.
Spark is differ from MapReduce because it is very faster than MapReduce.


Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action, so Spark should outperform Hadoop MapReduce.
Nonetheless, Spark needs a lot of memory. Much like standard DBs, it loads a process into memory and keeps it there until further notice, for the sake of caching. If Spark runs on Hadoop YARN with other resource-demanding services, or if the data is too big to fit entirely into the memory, then there could be major performance degradations for Spark.
MapReduce, however, kills its processes as soon as a job is done, so it can easily run alongside other services with minor performance differences.
Spark has the upper hand as long as we’re talking about iterative computations that need to pass over the same data many times. But when it comes to one-pass ETL-like jobs, for example, data transformation or data integration, then MapReduce is the deal—this is what it was designed for.



0 comments:

Post a Comment