π‘ Data Preprocessing
A crucial step that involves transforming raw data into a clean and usable format.
Main steps for Data Processing
- Data Collection
- Data Cleaning
- Handling Missing Values: Identify and address missing or null values.
- Removal : Drop rows or columns with missing values.
- Imputation : Fill missing values using methods like mean, median, mode, or more advanced techniques like KNN.
- Removing Duplicates: Detect and remove duplicate entries.
- Handling Outliers: Identify outliers that may skew the data analysis and either remove or adjust them.
- Data Transformation
- Normalization/Standaradization : Scale numerical features to a standard range(e.g., 0 to 1) or to have a mean of 0 and a standard deviation of 1.
- Encoding Categorical Variables:
- Label Encoding : Convert categorical labels to integer values.
- One-Hot Encoding : Create binary columns for each category.
- Binning: Group continuous variables into bins or intervals.
- Feature Engineering: Create new features based on existing data to improve model performance.
- Data Integration
- Combine data from multiple sources into a coherent dataset.
- Ensure that the combined data maintains consistency and integrity.
- Data Reduction
- Feature Selection : Select relevant features and discard irrelevant or redundant ones.
- Dimensionality Reduction : Use techniques like Principal Component Analysis(PCA) to reduce the number of features while retaining most of the variance in the data.
- Data Splitting
- Split the dataset into training and testing sets to evaluate the performance of machine learning models.
- Optionally, create a validation set for model tuning and selection.
Label Encoding
- Definition : The process of converting categorical data into numerical data. Each category is assigned a unique integer.
- Example
- 'Male' -> 0
- 'Female' -> 1
- Usage : Suitable for ordinal categorical data such as 'Low', 'Medium', 'High'.
One-Hot Encoding
- Definition : Convert categorical data into binary vectors.
- Example : For a 'City' column with values 'New York', 'Los Angeles', 'San Francisco'.
- 'New York' -> [1,0,0]
- 'Los Aneles' -> [0,1,0]
- 'San Fransico' -> [0,0,1]
- Usage : Appropriate for nomial categorical data where there is no ordinal relationship.
Scaling
Transform the data so that if fits within a specific range.
Normalization
-
Definition : Transform the data to fit within a specific range, typically 0 to 1
-
Method : Scale each feature to a given range by subtracting the minimum value of the feature and then dividing by the range of the feature values
-
Formula
Xnormβ=XmaxββXminβXβXminββ
-
Benefits
- Useful when you know the boundaries of your data and when the data does not contain outliers
- Mainatains the relationship between data points
-
Drawbacks
- Sensitive to outliers, which can skew the results
Standardization
When to use Which?
- Normalization : When you need the data to be bounded within a certain range, and the algorithm does not assume any distribution of the data (e.g., k-means clustering, neural networks).
- Standardization : When the algorithm assumes the data follows a Gaussian distribution (e.g., linear regression, logistic regression, SVM).