Aiffel Day 5

Serena Chang·2022년 1월 3일
0

AIFFEL

목록 보기
5/8
post-thumbnail

LMS & Python master class day 5

What I've learned

Data preprocessing part 1

  1. Missing value

  2. Duplicated data

  3. Outlier

  4. Normalization

  5. One-Hot Encoding

  6. Binning

1. Missing value

For missing values, you can either 1. delete missing values or 2. impute (replace missing values with substituted values) in a broad sense.

DataFrame.isnull() function can be used to detect missing values. It returns a boolean same-sized object indicating if the values are NA. NA values (None or numpy.Nan) gets mapped to True values and others get mapped to False values.

DataFrame.isnull().mean() computes the mean of Boolean mask (True evaluates as 1 and False as 0)

Let's say theres a column that has values:

[np.nan, 2, 3, 4]

is evaluated as:

[True, False, False, False]

interpreted as:

[1, 0, 0, 0,]

DataFrame.isnull().mean().sort_values(ascending = False) sorts the resulting series by column names decending get back later
sort.values() can be used with only certain data types. I've created my own dataframe to try this but when I typed df.isnull().mean() the result's dtype came back as the float64 which can not be used for sort_values().

AttributeError: 'numpy.float64' object has no attribute 'sort_values'

If you look more closely, there are total 7 ways to handle missing values in machine learning.

  1. Deleting rows with missing values
  2. Impute missing values for continuous vairable
  3. Impute missing values for categorical variable
  4. Other Imputation Methods
  5. Using Algorithms that support missing values
  6. Prediction of missing values
  7. Imputation using Depp Learning Library - Datawig
1. Deleting rows with missing values

The rows or columns having null values(missing values) can be deleted.
If columns have more than half of the rows as null, then the entire column can be dropped. The rows that are having one or more columns values as null can also be dropped.

Pros:

  • A model trained with the removal of all missing values creates a robust model.

Cons:

  • Loss of a lot of information.
  • Works poorly if the percentage of missing values is excessive in comparison to the complete datatset.
profile
new to Python and everything

0개의 댓글