Data Preprocessing

been_29Β·2024λ…„ 8μ›” 6일
post-thumbnail

πŸ’‘ Data Preprocessing

A crucial step that involves transforming raw data into a clean and usable format.


Main steps for Data Processing

  1. Data Collection
  2. Data Cleaning
  • Handling Missing Values: Identify and address missing or null values.
    • Removal : Drop rows or columns with missing values.
    • Imputation : Fill missing values using methods like mean, median, mode, or more advanced techniques like KNN.
  • Removing Duplicates: Detect and remove duplicate entries.
  • Handling Outliers: Identify outliers that may skew the data analysis and either remove or adjust them.
  1. Data Transformation
  • Normalization/Standaradization : Scale numerical features to a standard range(e.g., 0 to 1) or to have a mean of 0 and a standard deviation of 1.
  • Encoding Categorical Variables:
    • Label Encoding : Convert categorical labels to integer values.
    • One-Hot Encoding : Create binary columns for each category.
  • Binning: Group continuous variables into bins or intervals.
  • Feature Engineering: Create new features based on existing data to improve model performance.
  1. Data Integration
  • Combine data from multiple sources into a coherent dataset.
  • Ensure that the combined data maintains consistency and integrity.
  1. Data Reduction
  • Feature Selection : Select relevant features and discard irrelevant or redundant ones.
  • Dimensionality Reduction : Use techniques like Principal Component Analysis(PCA) to reduce the number of features while retaining most of the variance in the data.
  1. Data Splitting
  • Split the dataset into training and testing sets to evaluate the performance of machine learning models.
  • Optionally, create a validation set for model tuning and selection.


Label Encoding

  • Definition : The process of converting categorical data into numerical data. Each category is assigned a unique integer.
  • Example
    • 'Male' -> 0
    • 'Female' -> 1
  • Usage : Suitable for ordinal categorical data such as 'Low', 'Medium', 'High'.


One-Hot Encoding

  • Definition : Convert categorical data into binary vectors.
  • Example : For a 'City' column with values 'New York', 'Los Angeles', 'San Francisco'.
    • 'New York' -> [1,0,0]
    • 'Los Aneles' -> [0,1,0]
    • 'San Fransico' -> [0,0,1]
  • Usage : Appropriate for nomial categorical data where there is no ordinal relationship.


Scaling

Transform the data so that if fits within a specific range.


Normalization

  • Definition : Transform the data to fit within a specific range, typically 0 to 1

  • Method : Scale each feature to a given range by subtracting the minimum value of the feature and then dividing by the range of the feature values

  • Formula

    Xnorm=Xβˆ’XminXmaxβˆ’XminX_{norm} = \frac{X-X_{min}}{X_{max}-X_{min}}
  • Benefits

    • Useful when you know the boundaries of your data and when the data does not contain outliers
    • Mainatains the relationship between data points
  • Drawbacks

    • Sensitive to outliers, which can skew the results

Standardization

  • Definition : Transform the data to have a mean of 0 and a standard deviation of 1

  • Method : Scale the feature by subtracting the mean and dividing by the standard deviation of the feature values

  • Fomula

    Xstandard=Xβˆ’ΞΌΟƒX_{standard} = \frac{X-\mu}{\sigma}
    • ΞΌ\mu is the mean of the feature values
    • Οƒ\sigma is the standard deviation of the feature values
  • Benefits

    • Less Sensitive to outliers than normalization
    • Useful when the data follows a Gaussian distribution
  • Drawbacks

    • Do not bound the data within a specific range, which might be less interpretable in some cases.

When to use Which?

  • Normalization : When you need the data to be bounded within a certain range, and the algorithm does not assume any distribution of the data (e.g., k-means clustering, neural networks).
  • Standardization : When the algorithm assumes the data follows a Gaussian distribution (e.g., linear regression, logistic regression, SVM).
profile
Data Analysis

0개의 λŒ“κΈ€