Wednesday 12 May 2021

Operations on Dataframe

 

Hello Everyone! In this post, I will be demonstrate some operations which are very important to know when you are dealing with Dataframe. 

 

 


As we get data from different sources in different form of data like Structured, semi-Structured and Unstructured and we need to be sure about data quality before visualization. It might change to get inconsistent, incomplete, ambiguous and duplicate data and we can’t get meaningful information from the raw data so in this case we need to use the some operations to manipulate the data.


Here I will load datasets into Dataframe and apply some operations on it.

So let’s start ….

Before performing the operations on Dataframe we need to bring the data into Dataframe.

I am going to create spark session and after i will load data into DataFrame

 

1) How to project columns?


There are three ways to project the columns from Dataframe.

1st Method:

 


2nd Method:

3rd Method:


2) How we can retrieve the desire records from the Datafram?

There are two methods filter(condition) and b) where(condition) through these method we can project only those records which are satisfying the conditions.

For example:

In the below example i am using where with some conditions.


In the below example, i am using filter method with some conditions.


3) How we can merge two columns into single column?

For this activity, i am going to load student datasets into new datframe df2. In this dataframe we can see that for student name we have two separate two coulmns; first_name and last_name. Now i am going to merge into single.

Here i am using concatenate method to merge  the columns.


4) How we can rename the column in dataframe?

Let's suppose, we have an datasets having columns name in not proper manner. there are two options to define the schema

 i) we can infer the schema from datasets and rename the column name as per you requirement but infer schema is not recommended for large datasets.

Let see the below example, how we can rename the column.

 Before renaming the dataframe column:


 After renaming the dataframe column:


 

 

ii) Define the schema and pass schema to dataframe when you are creating dataframe.

For example:






0 comments:

Post a Comment