π‘ Dimensionality Reduction
Used to improve model performance or facilitate visualization by removing unnecessary or highly correlated variables while preserving important features of the data
π¨ Curse of Dimensionality
In data analysis or machine learning, the problem where data becomes sparse or difficult to analyze as the number of dimensions increases
Problems caused by the curse of dimensionality
- Data Sparsity
- As dimensions increase, data points spread out more in space, making distances between points larger -> Difficult to detect patterns or measure similarity
- Cause significant problems, especially in distance-based algorithms like k-NN and k-means
- Overfitting
- The higher the dimensionality, the more variables and patterns the model tries to learn, leading to potential overfitting to the training data
- Increased Computational Complexity
- The amount and complexity of data to compute increase, resulting in a significant rise in required computations -> Longer processing times and much higher memory usage
- Increased Sample Requirements
- As dimensionality increases, more data samples are needed to cover the distribution in each dimension -> In practice, it is often difficult to obtain enough data in high-dimensional spaces
π¨ Feature Selection
The process of selecting the most important existing features from the data while removing irrelevant or redundant ones
Filter Methods
- A method that evaluates each feature independently by considering only the statistical characteristics of the feature
- Independent of model training, calculates the statistical importance of each feature and removes unimportant features
- Types
- Variance Threshold: Remove features with very low variance (i.e., features that barely change)
- Correlation: Select features with a high correlation to the target variable by calculating the correlation between each feature and the target variable
- Chi-Squared Test: Evaluate the independence of categorical features from the target feature to select important features
- Observed Frequencies: Frequencies recorded based on how often values corresponding to each category combination are observed in the actual data
- Expected Frequencies: The expected frequency for each category assuming the two variables are independent
Eijβ=GrandΒ Total(RowΒ Total)Γ(ColumnΒ Total)β
- Chi-Squared Statistic Calculation: Calculate the chi-squared statistic based on the difference between the observed and expected frequencies
Ο2=βEijβ(OijββEijβ)2β
- Degrees of Freedom: Used to assess the significance of the chi-squared statistic or to find the p-value after calculating the chi-squared statistic
(NumberΒ ofΒ Rowsβ1)Γ(NumberΒ ofΒ Columnsβ1)
- p-value: The probability of obtaining the observed result or a more extreme result under the assumption that the null hypothesis is true
- Example) Assuming a chi-squared statistic of 5.99 with 2 degrees of freedom, the p-value is 0.05 -> When compared with the significance level Ξ±=0.05, the p-value of 0.05 indicates the result is at the threshold for rejecting the null hypothesis -> In this case, the null hypothesis can be rejected, indicating a significant relationship between the two variables
Wrapper Methods
- A method that evaluates various feature combinations using a model and selects the optimal feature set
- Features are evaluated based on model performance
- Types
- Forward Selection: Start with no features selected and adds features one by one, selecting the feature that improves model performance the most
- Backward Elimination: Start with all features included and removes the least important features one by one, based on their impact on model performance
- Recursive Feature Elimination: Evaluate the importance of each feature and iteratively removes the least important features to find the optimal feature set
Embedded Methods
- Feature selection is performed simultaneously during model training
- Combines the advantages of both filter and wrapper methods by allowing the model to inherently perform feature selection
- Types
- Lasso Regression: Uses regularization techniques to adjust feature weights while setting the weights of unnecessary features to zero -> automatically eliminates unimportant features
- Ridge Regression: Does not set weights to zero, but reduces the weights of less important features to guide feature selection
- Elastic Net: A combination of L1 and L2 regularization techniques
- Tree-based model: Tree-based algorithms automatically evaluate feature importance and reduce the impact of unimportant features during the learning process
The process of transforming existing features into new features, often reducing dimensionality while preserving important information
PCA (Principal Component Analysis)
LDA (Linear Discriminant Analysis)
-
Similar to PCA, but reduce dimensionality by maximizing the variance between classes and minimizing the variance within classes in classification problems
-
Primarily used for classification tasks, unlike PCA, it considers class labels when reducing dimensions
-
How it works
- Calculate the mean of each class: Compute the center point of each class
- Calculate within-class and between-class variance: Calculate the variance within each class and the variance between classes
- Find the optimal axis: Find the axis that maximizes between-class variance and minimizes within-class variance to project the data
-
LDA Objective
J(W)=β£WTSWβWβ£β£WTSBβWβ£β
- Where SBβ is the between-class scatter matrix and SWβ is the within-class scatter matrix
-
Characteristics
- Provide clear class separation in classification problems
- Since it considers class labels, it can offer better classification performance than PCA
- Effective only when the data can be linearly separated
- Less effective when the between-class and within-class variances are similar
t-SNE (t-Distributed Stochastic Neighbor Embedding)
-
A nonlinear dimensionality reduction technique
-
Often used for visualizing high-dimensional data in lower-dimensional space while preserving the distances between data points
-
Particularly useful for visualizing data cluster structures
-
How it works
- Probabilistically computes the nearest neighbors in the high-dimensional space and looks for a similar probability distribution in the lower-dimensional space
- Transform the data by minimizing the difference between the two probability distributions, ensuring that data points that are close in high-dimensional space remain close in the lower-dimensional space
-
Characteristics
- Computationally expensive and the results may not always be interpretable
Autoencoder
- A neural network-based nonlinear dimensionality reduction technique
- Compress the input data to learn a lower-dimensional representation, then reconstruct it to extract important information
- Structure
- Encoder: Transform the input data into a lower-dimensional space -> extract only the important features from the input
- Decoder: Restore the lower-dimensional data generated by the encoder back to its original dimensions
- Characteristics
- Minimize information loss during the reconstruction process, automatically learning the important features of the data
- Proper hyperparameter tuning is crucial
ICA (Independent Component Analysis)
-
A method for extracting statistically independent signals from data
-
Commonly used in signal processing, speech separation, and image processing
-
How it works
- Assume that the data is composed of a combination of independent components
- Identify statistically independent components from the data -> often used in signal processing to separate individual signals
-
Characteristics
- Effective only if the components of the data are statistically independent
SVD (Singular Value Decomposition)
-
A matrix decomposition technique that allows high-dimensional data to be approximated in a lower-dimensional space
-
Definition: Decompose an arbitrary mΓn matrix A into the product of the following three matrices
A=UΞ£VT
- A: Original data matrix
- U: mΓm orthogonal matrix (left singular vectors of the data)
- Ξ£: mΓn diagonal matrix (contains singular values)
- VT: nΓn orthogonal matrix (right singular vectors of the data)
-
How it works
- Decompose data based on singular values, retaining the most important information (large singular values) while discarding unnecessary information (small singular values) to reduce dimensionality
- Compress the data by preserving essential information
LSA (Latent Semantic Analysis)