How to remove Nan or NULL values in data using python
Removing Nan values is essential for data analysis.
Let’s see how we can remove Nan values from our data.
See image 1
We can see our data which contains three columns 1st and 2nd column are categorical and 3rd column is numerical.
We then used data.isna().sum() to give us which column contains how many Nan values.
See image 2
If we want to sort the values from higher to lower, we use data.isna().sum().sort_values(ascending=False)
We also delete that rows contain Nan values using data.dropna().
It is not recommended to delete rows that contain Nan value, but it is good to know that we can also delete Nan rows.
See image 3
We can also fill in Nan values with other values, such as 0, using data.fillna(0).
We see the basic syntax of how to handle Nan values. Let’s see how to fill Nan's values with more meaningful values.
So what is the meaning of meaningful values?
It is not the best option to replace the Nan value with some constant value, as we do in image 3.
If the column data type is numerical, then it is best to fill the Nan value with that column's mean or average value. In some other cases, we also use median and mode, but in most cases, we use mean and
If the column data type is an object, then it is best to use the most frequent value of that column.
So how do we do that? In this case, we use the Simpleimputer function to fill in a numerical column with the mean and an object column with the most frequent value.
So simpleimputer() takes an argument that is by default mean, but we set it to simpleimputer(strategy=”most_frequent”)
Using strategy = “most_frequent” we say that fill the column with the most frequent value.
Let’s see how we do that with the use of python.
See image 4
We imported SimpleImputer, which will be used to remove Nan values.
The first step we will take is to select a numerical column as we do in the second line of image four then we use the fit_transform method to fill Nan with mean values.
So our numerical column Nan values are now replaced with the mean value
See image 5
We do the same thing as before, but we pass strategy = “most_frequent” because our column is an object, so this will fill that two columns with the most frequent value of that column.
Now our Nan value is removed.
See image 6
Now add all the columns together, and that’s it. All our Nan values are now removed with some meaningful values.