Hello folks!! I am very much excited for this post. You Know why? Because this post will give you a clear picture about Apache Spark. As we earlier discussed about Introduction and importance of Apache Spark but without knowing the core components or feature of Spark, we cannot draw a clear picture about Apache Spark. So, let’s start…
As we know that every object have some feature
that makes it different from others similarly Apache Spark have some features,
which are mentioned below:
1) Spark SQL
2) ML lib
3) Spark Streaming
4) GraphX
Above features are a complete chapter, we will discuss
more about them in separate section. Here we will get only idea or purpose, so let’s
start one by one;
1)Spark SQL:
Apache Spark ecosystem has in-built native API to
execute the SQL query or SQL jobs. Spark SQL module is used for the structural
data processing. It provides a Data Frame abstraction in Python, Java, and
Scala to simplify working with structured datasets. Data Frames are like tables
in a relational database. we can read and write data in a variety of structured
formats (e.g., JSON, Hive Tables, and Parquet) by using Spark sql. It lets you
query the data using SQL, both inside a Spark program and from external tools
that connect to Spark SQL through standard database connectors (JDBC/ ODBC),
such as business intelligence tools like Tableau or PowerBI or Pentaho.
2)ML lib:
Apache Spark has own machine learning library,
which is use for running the machine learning algorithms.
Spark ML lib provides the following tools:
·
ML Algorithms: ML Algorithms
form the core of ML lib. These include common learning algorithms such as
classification, regression, clustering and collaborative filtering.
·
Featurization: Featurization
includes feature extraction, transformation, dimensionality reduction and
selection.
·
Pipelines: Pipelines
provide tools for constructing, evaluating and tuning ML Pipelines.
·
Persistence: Persistence
helps in saving and loading algorithms, models and Pipelines.
·
Utilities: Utilities for
linear algebra, statistics and data handling.
ML lib Algorithms:
The popular algorithms and utilities in Spark ML lib
are:
·
Basic
Statistics
·
Regression
·
Classification
·
Recommendation
System
·
Clustering
·
Dimensionality
Reduction
·
Feature
Extraction
·
Optimization
3) Spark Streaming:
Spark Streaming is an extension of core Spark
that enables Scalable, fault-tolerant processing of data stream. It receives
data streams from input sources, process them in a cluster, push out to
databases/ dashboards.
It does Chop up data streams into batches of few
seconds .Spark treats each batch of data as RDDs and processes them using RDD
operations and also processed results are pushed out in batches.
4) GraphX:
GraphX is a new component in Spark for graphs and
graph-parallel computation. At a high level, GraphX extends the Spark RDD by
introducing a new Graph abstraction: a directed multigraph with properties
attached to each vertex and edge.
To support graph computation, GraphX exposes a
set of fundamental operators (e.g., subgraph, join Vertices, and aggregate Messages)
as well as an optimized variant of the Pregel API. In addition, GraphX includes
a growing collection of graph algorithms and builders to simplify graph
analytics tasks.
Great Sir
ReplyDelete