Friday 17 May 2019

Feature of Apache Spark


Hello folks!! I am very much excited for this post. You Know why? Because this post will give you a clear picture about Apache Spark. As we earlier discussed about Introduction and importance of Apache Spark but without knowing the core components or feature of Spark, we cannot draw a clear picture about Apache Spark. So, let’s start…
As we know that every object have some feature that makes it different from others similarly Apache Spark have some features, which are mentioned below:
1)  Spark SQL
2)  ML lib
3)  Spark Streaming
4)  GraphX
Above features are a complete chapter, we will discuss more about them in separate section. Here we will get only idea or purpose, so let’s start one by one;
1)Spark SQL:
Apache Spark ecosystem has in-built native API to execute the SQL query or SQL jobs. Spark SQL module is used for the structural data processing. It provides a Data Frame abstraction in Python, Java, and Scala to simplify working with structured datasets. Data Frames are like tables in a relational database. we can read and write data in a variety of structured formats (e.g., JSON, Hive Tables, and Parquet) by using Spark sql. It lets you query the data using SQL, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ ODBC), such as business intelligence tools like Tableau or PowerBI or Pentaho.
2)ML lib:
Apache Spark has own machine learning library, which is use for running the machine learning algorithms.
Spark ML lib provides the following tools:

·         ML Algorithms: ML Algorithms form the core of ML lib. These include common learning algorithms such as classification, regression, clustering and collaborative filtering.
·         Featurization: Featurization includes feature extraction, transformation, dimensionality reduction and selection.
·         Pipelines: Pipelines provide tools for constructing, evaluating and tuning ML Pipelines.
·         Persistence: Persistence helps in saving and loading algorithms, models and Pipelines.
·         Utilities: Utilities for linear algebra, statistics and data handling.
ML lib Algorithms:
The popular algorithms and utilities in Spark ML lib are:

·         Basic Statistics
·         Regression
·         Classification
·         Recommendation System
·         Clustering
·         Dimensionality Reduction
·         Feature Extraction
·         Optimization
3)  Spark Streaming:
Spark Streaming is an extension of core Spark that enables Scalable, fault-tolerant processing of data stream. It receives data streams from input sources, process them in a cluster, push out to databases/ dashboards.




It does Chop up data streams into batches of few seconds .Spark treats each batch of data as RDDs and processes them using RDD operations and also processed results are pushed out in batches.


4) GraphX:
GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge.



To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, join Vertices, and aggregate Messages) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.

1 comment: