Sunday 22 December 2019

Data To Visualization Using Pyspark and Python Library


Hello everyone!!  Today we are going to discuss quite interesting topic as in this you will take next step after data processing then visualization of the same presented to business user or anyone who is concerned with that process data. In this article, I will use Spark with Python (Pyspark library) and we will create a sample data values to create a dataframe out of that dataset and then we will use data for the operations on that particular dataset and process as per our requirement and finally will use matplotlib library to create visualization .




Why I am using Python instead of other languages, because python provides a very rich libraries to do such kind of things like data visualization and this is very easy to implement for various use cases. If you want to implement it along with me. There are some basic prerequisite; first you should have Python installed on your system if you have not get installed you can go to website and download it for your respective operating system(Centos GUI) and you should have any particular IDE  for python. PyCharm is most popular IDE for Python so I am using to create new project. Second you should have installed Hadoop on your machine.Now we are ready to perform data process and visualization.
So, let’s start ….
First need to start the Hadoop because Spark need a resource manger to execute the Program. I have installed Hadoop on standalone mode in my /opt directory and hitting the below command:

/opt/Hadoop/sbin/start-all.sh


Now open the PyCharm IDE and click on open new Project and put the project name and select the interpreter whatever Python version you have installed on your system. Under the main project directory create a python file. Before writing the code on python file just import libraries and jars which are related to spark in your project.  For importing libraries and jars, go to the file setting and choose project Structure and import the below libraries and jars:
Now everything is good for write the code 😊

I have dataset sales dataset(sales.csv) which contains what quantity of each product was sold in 2017,2018 and 2019.


Here I am importing the Spark SQL, pandas and matplotlib module
       
           from pyspark.sql import SparkSession
           from pyspark.sql.types import *
           import matplotlib.pyplot as plt
           import pandas
Creating the Spark Session       

spark=SparkSession.builder.master('local').appName('Spark-Operations').getOrCreate()

       
 
Defining the schema according to dataset       

Schema = StructType([
    StructField("InvoicNo", IntegerType(),True),
    StructField("Product", StringType(),True),
    StructField("Quantity", IntegerType(),True),
    StructField("Year", IntegerType(),True),
])

       
 
creating the SparkContext object and reading the dataset

sc=spark.sparkContext
df=spark.read.csv("/root/Desktop/sales.csv", header=True, sep=',',schema=Schema)
df.show()

       
 

When we execute the above code, it will generate the below output.

Now we have successfully created the dataframe now we are going to some SQL operation and Visualizing the processed data
       

dfplot=df.groupBy("Year").sum("Quantity")
x=dfplot.toPandas()["Year"].values.tolist()
y=dfplot.toPandas()["sum(Quantity)"].values.tolist()

plt.bar(x,y, color="blue", label="Sales Visulization")
plt.title("Sales Report")
plt.xlabel=("Year")
plt.ylabel=("Sales")
plt.lengend(facecolor="gray")
plt.show()
       
 

So, finally we have plotted the graph of datasat .










0 comments:

Post a Comment