
Missing value
Duplicated data
Outlier
Normalization
One-Hot Encoding
Binning
For missing values, you can either 1. delete missing values or 2. impute (replace missing values with substituted values) in a broad sense.
DataFrame.isnull() function can be used to detect missing values. It returns a boolean same-sized object indicating if the values are NA. NA values (None or numpy.Nan) gets mapped to True values and others get mapped to False values.
DataFrame.isnull().mean() computes the mean of Boolean mask (True evaluates as 1 and False as 0)
Let's say theres a column that has values:
[np.nan, 2, 3, 4]
is evaluated as:
[True, False, False, False]
interpreted as:
[1, 0, 0, 0,]
get back laterDataFrame.isnull().mean().sort_values(ascending = False) sorts the resulting series by column names decending
sort.values() can be used with only certain data types. I've created my own dataframe to try this but when I typed df.isnull().mean() the result's dtype came back as the float64 which can not be used for sort_values().
AttributeError: 'numpy.float64' object has no attribute 'sort_values'
If you look more closely, there are total 7 ways to handle missing values in machine learning.
The rows or columns having null values(missing values) can be deleted.
If columns have more than half of the rows as null, then the entire column can be dropped. The rows that are having one or more columns values as null can also be dropped.
Pros:
Cons: