Aiffel Day 5

Serena Chang·2022년 1월 3일

AIFFEL

목록 보기

5/8

LMS & Python master class day 5

What I've learned

Data preprocessing part 1

Missing value
Duplicated data
Outlier
Normalization
One-Hot Encoding
Binning

1. Missing value

For missing values, you can either 1. delete missing values or 2. impute (replace missing values with substituted values) in a broad sense.

DataFrame.isnull() function can be used to detect missing values. It returns a boolean same-sized object indicating if the values are NA. NA values (None or numpy.Nan) gets mapped to True values and others get mapped to False values.

DataFrame.isnull().mean() computes the mean of Boolean mask (True evaluates as 1 and False as 0)

Let's say theres a column that has values:

[np.nan, 2, 3, 4]

is evaluated as:

[True, False, False, False]

interpreted as:

[1, 0, 0, 0,]

~~DataFrame.isnull().mean().sort_values(ascending = False) sorts the resulting series by column names decending~~ get back later
sort.values() can be used with only certain data types. I've created my own dataframe to try this but when I typed df.isnull().mean() the result's dtype came back as the float64 which can not be used for sort_values().

AttributeError: 'numpy.float64' object has no attribute 'sort_values'

If you look more closely, there are total 7 ways to handle missing values in machine learning.

Deleting rows with missing values
Impute missing values for continuous vairable
Impute missing values for categorical variable
Other Imputation Methods
Using Algorithms that support missing values
Prediction of missing values
Imputation using Depp Learning Library - Datawig

1. Deleting rows with missing values

The rows or columns having null values(missing values) can be deleted.
If columns have more than half of the rows as null, then the entire column can be dropped. The rows that are having one or more columns values as null can also be dropped.

Pros: