Sunday 1 December 2019

Data Profiling for improve the data quality by using Scala


Hello Everyone!! As we know, today is the world of heterogeneity. We have different technologies, which operates on different platforms. We have large amount of data being generated day by day. According to a survey, in year 2017, there are 4.7 trillion photos stored over the social medias. Lots of organization generation a huge amount of revenue by using the information. Now a day if we compare data with oxygen then it won’t be wrong, because in every aspect we play with data.




As we get data from different sources in different form of data like Structured, semi-Structured and Unstructured and we need to be sure about data quality before visualization. It might change to get inconsistent, incomplete, ambiguous and duplicate data and we can’t get meaningful information from the raw data. From here, we introduce data profiling.
Data profiling is an activity in which we do check accuracy, completeness, and validity of data.

There are lots of tools available in market for data profiling, but I am going to clean the data for visualization using Scala.
In this example we have datasets named as d.json which is stored in hdfs and creating the sparksession and pulling the data into datafram.
1)      Missing Values remove:
>>> from pyspark.sql import SQLContext
>>> salcontext=SQLContext(sc)
>>> df=salcontext.read.format("json").load("///d.json")
>>> df.show()



>>> df.na.drop(thresh=2).show()

1)      Remove Duplicate Values

















3)      Junk value removal:
It includes special characters like $, &, *, %, # etc. we can also  remove that particular rows which contains these values or we can also update the value of cell by using Scala build-in function.

Replace null value by 0.0





Note:
Similarly we can remove all junk values by using above build-in function.


1)      Remove identical rows:




1)      Blank row Removal Section:







2 comments:

  1. Tensorflow is one of the most important things which helps engineers to make things easy to handle and access. To know more about machine learning provider and machine learning solutions, click know more.

    ReplyDelete
  2. Big data consulting services should understand the need of Data, and they should work to build more appropriate services to meet the requirements of their clients.

    ReplyDelete