Data Preprocessing

been_29·2024년 8월 6일

한국경제신문 with Toss bank MLOps 과정

목록 보기

11/26

A crucial step that involves transforming raw data into a clean and usable format.

Handling Missing Values: Identify and address missing or null values.
- Removal : Drop rows or columns with missing values.
- Imputation : Fill missing values using methods like mean, median, mode, or more advanced techniques like KNN.
Removing Duplicates: Detect and remove duplicate entries.
Handling Outliers: Identify outliers that may skew the data analysis and either remove or adjust them.

Normalization/Standaradization : Scale numerical features to a standard range(e.g., 0 to 1) or to have a mean of 0 and a standard deviation of 1.
Encoding Categorical Variables:
- Label Encoding : Convert categorical labels to integer values.
- One-Hot Encoding : Create binary columns for each category.
Binning: Group continuous variables into bins or intervals.
Feature Engineering: Create new features based on existing data to improve model performance.

Feature Selection : Select relevant features and discard irrelevant or redundant ones.
Dimensionality Reduction : Use techniques like Principal Component Analysis(PCA) to reduce the number of features while retaining most of the variance in the data.

Split the dataset into training and testing sets to evaluate the performance of machine learning models.
Optionally, create a validation set for model tuning and selection.

Definition : The process of converting categorical data into numerical data. Each category is assigned a unique integer.
Example
- 'Male' -> 0
- 'Female' -> 1
Usage : Suitable for ordinal categorical data such as 'Low', 'Medium', 'High'.

Definition : Convert categorical data into binary vectors.
Example : For a 'City' column with values 'New York', 'Los Angeles', 'San Francisco'.
- 'New York' -> [1,0,0]
- 'Los Aneles' -> [0,1,0]
- 'San Fransico' -> [0,0,1]
Usage : Appropriate for nomial categorical data where there is no ordinal relationship.

Transform the data so that if fits within a specific range.

Definition : Transform the data to fit within a specific range, typically 0 to 1
Method : Scale each feature to a given range by subtracting the minimum value of the feature and then dividing by the range of the feature values
Formula
$X_{norm} = \frac{X-X_{min}}{X_{max}-X_{min}}$
Benefits
- Useful when you know the boundaries of your data and when the data does not contain outliers
- Mainatains the relationship between data points
Drawbacks
- Sensitive to outliers, which can skew the results

Definition : Transform the data to have a mean of 0 and a standard deviation of 1
Method : Scale the feature by subtracting the mean and dividing by the standard deviation of the feature values
Fomula

$X_{standard} = \frac{X-\mu}{\sigma}$
- $\mu$ is the mean of the feature values
- $\sigma$ is the standard deviation of the feature values
Benefits
- Less Sensitive to outliers than normalization
- Useful when the data follows a Gaussian distribution
Drawbacks
- Do not bound the data within a specific range, which might be less interpretable in some cases.

Normalization : When you need the data to be bounded within a certain range, and the algorithm does not assume any distribution of the data (e.g., k-means clustering, neural networks).
Standardization : When the algorithm assumes the data follows a Gaussian distribution (e.g., linear regression, logistic regression, SVM).

Data Analysis