Hello
everyone!! Today we are going to discuss
quite interesting topic as in this you will take next step after data processing
then visualization of the same presented to business user or anyone who is
concerned with that process data. In this article, I will use Spark with Python
(Pyspark library) and we will create a sample data values to create a
dataframe out of that dataset and then we will use data for the operations on
that particular dataset and process as per our requirement and finally will use
matplotlib library to create visualization .
Why I am using Python instead of other languages, because python
provides a very rich libraries to do such kind of things like data visualization
and this is very easy to implement for various use cases. If you want to
implement it along with me. There are some basic prerequisite; first you should
have Python installed on your system if you have not get installed you can go
to website and download it for your respective operating system(Centos GUI) and
you should have any particular IDE for
python. PyCharm is most popular IDE for Python so I am using to create new
project. Second you should have installed Hadoop on your machine. Now we are ready to perform data process and visualization.
Now open the PyCharm IDE and click on open new Project and
put the project name and select the interpreter whatever Python version you
have installed on your system. Under the main project directory create a python
file. Before writing the code on python file just import libraries and jars which
are related to spark in your project. For
importing libraries and jars, go to the file setting and choose project
Structure and import the below libraries and jars:
So, let’s start ….
First need to start the Hadoop because Spark need a resource
manger to execute the Program. I have installed Hadoop on standalone mode in my
/opt directory and hitting the below command:
/opt/Hadoop/sbin/start-all.sh
Now everything is good for write the code 😊
I have dataset sales dataset(sales.csv) which contains what
quantity of each product was sold in 2017,2018 and 2019.
Here I am importing the Spark SQL, pandas and matplotlib module
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import matplotlib.pyplot as plt
import pandas
Creating the Spark Session
spark=SparkSession.builder.master('local').appName('Spark-Operations').getOrCreate()
Defining the schema according to dataset
Schema = StructType([
StructField("InvoicNo", IntegerType(),True),
StructField("Product", StringType(),True),
StructField("Quantity", IntegerType(),True),
StructField("Year", IntegerType(),True),
])
creating the SparkContext object and reading the dataset
sc=spark.sparkContext
df=spark.read.csv("/root/Desktop/sales.csv", header=True, sep=',',schema=Schema)
df.show()
When we execute the above code, it will generate the below
output.
Now we have successfully created the dataframe now we are
going to some SQL operation and Visualizing the processed data
dfplot=df.groupBy("Year").sum("Quantity")
x=dfplot.toPandas()["Year"].values.tolist()
y=dfplot.toPandas()["sum(Quantity)"].values.tolist()
plt.bar(x,y, color="blue", label="Sales Visulization")
plt.title("Sales Report")
plt.xlabel=("Year")
plt.ylabel=("Sales")
plt.lengend(facecolor="gray")
plt.show()
0 comments:
Post a Comment