Hello Everyone!!
As we know, today is the world of heterogeneity. We have different
technologies, which operates on different platforms. We have large amount
of data being generated day by day. According to a survey, in year 2017,
there are 4.7 trillion photos stored over the social medias. Lots of
organization generation a huge amount of revenue by using the information. Now
a day if we compare data with oxygen then it won’t be wrong, because in every
aspect we play with data.
As we get data from different
sources in different form of data like Structured, semi-Structured and
Unstructured and we need to be sure about data quality before
visualization. It might change to get inconsistent, incomplete,
ambiguous and duplicate data and we can’t get meaningful information from the
raw data. From here, we introduce data profiling.
Data profiling is an activity in
which we do check accuracy, completeness, and validity of data.
There are lots of tools available in
market for data profiling, but I am going to clean the data for visualization
using Scala.
In this example
we have datasets named as d.json which is stored in hdfs and creating
the sparksession and pulling the data into datafram.
1) Missing Values remove:
>>>
from pyspark.sql import SQLContext
>>>
salcontext=SQLContext(sc)
>>>
df=salcontext.read.format("json").load("///d.json")
>>>
df.show()
>>>
df.na.drop(thresh=2).show()
1) Remove Duplicate Values
3) Junk value removal:
It includes
special characters like $, &, *, %, # etc. we can also remove that particular rows which contains
these values or we can also update the value of cell by using Scala build-in
function.
Replace null
value by 0.0
Note:
Similarly we
can remove all junk values by using above build-in function.
1) Remove identical rows:
Tensorflow is one of the most important things which helps engineers to make things easy to handle and access. To know more about machine learning provider and machine learning solutions, click know more.
ReplyDeleteBig data consulting services should understand the need of Data, and they should work to build more appropriate services to meet the requirements of their clients.
ReplyDelete