Dimensionality Reduction

been_29·2024년 9월 12일

한국경제신문 with Toss bank MLOps 과정

목록 보기

16/26

💡 Dimensionality Reduction

Used to improve model performance or facilitate visualization by removing unnecessary or highly correlated variables while preserving important features of the data

🎨 Curse of Dimensionality

In data analysis or machine learning, the problem where data becomes sparse or difficult to analyze as the number of dimensions increases

Problems caused by the curse of dimensionality

Data Sparsity
- As dimensions increase, data points spread out more in space, making distances between points larger -> Difficult to detect patterns or measure similarity
- Cause significant problems, especially in distance-based algorithms like k-NN and k-means
Overfitting
- The higher the dimensionality, the more variables and patterns the model tries to learn, leading to potential overfitting to the training data
Increased Computational Complexity
- The amount and complexity of data to compute increase, resulting in a significant rise in required computations -> Longer processing times and much higher memory usage
Increased Sample Requirements
- As dimensionality increases, more data samples are needed to cover the distribution in each dimension -> In practice, it is often difficult to obtain enough data in high-dimensional spaces

🎨 Feature Selection

The process of selecting the most important existing features from the data while removing irrelevant or redundant ones

Filter Methods

A method that evaluates each feature independently by considering only the statistical characteristics of the feature
Independent of model training, calculates the statistical importance of each feature and removes unimportant features
Types
- Variance Threshold: Remove features with very low variance (i.e., features that barely change)
- Correlation: Select features with a high correlation to the target variable by calculating the correlation between each feature and the target variable
- Chi-Squared Test: Evaluate the independence of categorical features from the target feature to select important features
  - Observed Frequencies: Frequencies recorded based on how often values corresponding to each category combination are observed in the actual data
  - Expected Frequencies: The expected frequency for each category assuming the two variables are independent
    $E_{ij} = \frac{(Row\ Total) \times (Column\ Total)}{Grand\ Total}$
  - Chi-Squared Statistic Calculation: Calculate the chi-squared statistic based on the difference between the observed and expected frequencies
    $\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$
  - Degrees of Freedom: Used to assess the significance of the chi-squared statistic or to find the p-value after calculating the chi-squared statistic
    $(Number\ of\ Rows - 1) \times (Number\ of\ Columns - 1)$
  - p-value: The probability of obtaining the observed result or a more extreme result under the assumption that the null hypothesis is true
  - Example) Assuming a chi-squared statistic of 5.99 with 2 degrees of freedom, the p-value is 0.05 -> When compared with the significance level $\alpha=0.05$ , the p-value of 0.05 indicates the result is at the threshold for rejecting the null hypothesis -> In this case, the null hypothesis can be rejected, indicating a significant relationship between the two variables

Wrapper Methods

A method that evaluates various feature combinations using a model and selects the optimal feature set
Features are evaluated based on model performance
Types
- Forward Selection: Start with no features selected and adds features one by one, selecting the feature that improves model performance the most
- Backward Elimination: Start with all features included and removes the least important features one by one, based on their impact on model performance
- Recursive Feature Elimination: Evaluate the importance of each feature and iteratively removes the least important features to find the optimal feature set

Embedded Methods

Feature selection is performed simultaneously during model training
Combines the advantages of both filter and wrapper methods by allowing the model to inherently perform feature selection
Types
- Lasso Regression: Uses regularization techniques to adjust feature weights while setting the weights of unnecessary features to zero -> automatically eliminates unimportant features
- Ridge Regression: Does not set weights to zero, but reduces the weights of less important features to guide feature selection
- Elastic Net: A combination of L1 and L2 regularization techniques
- Tree-based model: Tree-based algorithms automatically evaluate feature importance and reduce the impact of unimportant features during the learning process

🎨 Feature Extraction

The process of transforming existing features into new features, often reducing dimensionality while preserving important information

PCA (Principal Component Analysis)

A method that creates new axes (principal components) in the direction that maximizes the variance of high-dimensional data
How it works
1. Data Standardization: Before applying PCA, standardize the data to align the scale of each feature -> adjust the mean of each feature to 0 and variance to 1
2. Covariance Matrix Calculation: Calculate the covariance matrix based on the standardized data -> the covariance matrix represents the correlation between features
3. Eigen Decomposition: Decompose the covariance matrix into eigenvalues and eigenvectors -> eigenvalues represent the amount of variance explained by each principal component, and eigenvectors represent the direction of the principal components
4. Principal Component Selection: Select the principal components in order of the largest eigenvalues and transform the data -> you can reduce the dimensionality by selecting only the top few principal components
Formula

$Z = XW$
- Here, $Z$ is the transformed data (principal components), $X$ is the original data, and $W$ is the eigenvector matrix
Characteristics
- Reduce dimensionality while preserving as much of the original data's variance as possible
- Remove correlations between features, creating more interpretable features
- Based on linear transformation, so it's not suitable for nonlinear data
- Sensitive to the scale of the data, so standardization is essential

LDA (Linear Discriminant Analysis)

Similar to PCA, but reduce dimensionality by maximizing the variance between classes and minimizing the variance within classes in classification problems
Primarily used for classification tasks, unlike PCA, it considers class labels when reducing dimensions
How it works
1. Calculate the mean of each class: Compute the center point of each class
2. Calculate within-class and between-class variance: Calculate the variance within each class and the variance between classes
3. Find the optimal axis: Find the axis that maximizes between-class variance and minimizes within-class variance to project the data
LDA Objective

$J(W) = \frac{|W^T S_B W|}{|W^T S_W W|}$
- Where $S_B$ is the between-class scatter matrix and $S_W$ is the within-class scatter matrix
Characteristics
- Provide clear class separation in classification problems
- Since it considers class labels, it can offer better classification performance than PCA
- Effective only when the data can be linearly separated
- Less effective when the between-class and within-class variances are similar

t-SNE (t-Distributed Stochastic Neighbor Embedding)

A nonlinear dimensionality reduction technique
Often used for visualizing high-dimensional data in lower-dimensional space while preserving the distances between data points
Particularly useful for visualizing data cluster structures
How it works
1. Probabilistically computes the nearest neighbors in the high-dimensional space and looks for a similar probability distribution in the lower-dimensional space
2. Transform the data by minimizing the difference between the two probability distributions, ensuring that data points that are close in high-dimensional space remain close in the lower-dimensional space
Characteristics
- Computationally expensive and the results may not always be interpretable

Autoencoder

A neural network-based nonlinear dimensionality reduction technique
Compress the input data to learn a lower-dimensional representation, then reconstruct it to extract important information
Structure
- Encoder: Transform the input data into a lower-dimensional space -> extract only the important features from the input
- Decoder: Restore the lower-dimensional data generated by the encoder back to its original dimensions
Characteristics
- Minimize information loss during the reconstruction process, automatically learning the important features of the data
- Proper hyperparameter tuning is crucial

ICA (Independent Component Analysis)

A method for extracting statistically independent signals from data
Commonly used in signal processing, speech separation, and image processing
How it works
1. Assume that the data is composed of a combination of independent components
2. Identify statistically independent components from the data -> often used in signal processing to separate individual signals
Characteristics
- Effective only if the components of the data are statistically independent

SVD (Singular Value Decomposition)

A matrix decomposition technique that allows high-dimensional data to be approximated in a lower-dimensional space
Definition: Decompose an arbitrary $m \times n$ matrix $A$ into the product of the following three matrices

$A = U \Sigma V^T$
- $A$ : Original data matrix
- $U$ : $m \times m$ orthogonal matrix (left singular vectors of the data)
- $\Sigma$ : $m \times n$ diagonal matrix (contains singular values)
- $V^T$ : $n \times n$ orthogonal matrix (right singular vectors of the data)
How it works
- Decompose data based on singular values, retaining the most important information (large singular values) while discarding unnecessary information (small singular values) to reduce dimensionality
- Compress the data by preserving essential information

LSA (Latent Semantic Analysis)

A method for extracting hidden meanings (latent semantics) from text data, based on SVD
How it works
- Create Document-Term Matrix: Build a document-term matrix based on the frequency of occurrences of words in each document -> each element of this matrix represents how often a word appears in a document
- Apply SVD: Apply SVD to the document-term matrix to transform the documents and words into a lower-dimensional latent semantic space -> places related documents or words closer together in this space
  
  $A = U \Sigma V^T$
  - $A$ : Document-Term Matrix
  - $U$ : Principal component vectors of documents (how documents are represented in the latent semantic space)
  - $\Sigma$ : Singular value matrix (represents the importance of each principal component)
  - $V^T$ : Principal component vectors of words (how words are represented in the latent semantic space)
- Generate Latent Semantic Space: In the latent semantic space created by SVD, synonymous words or semantically similar documents are more easily distinguished
Characteristics
- Convert text data into vectors for use in document retrieval systems or document classification
- Based on simple linear transformations, so it may not effectively capture nonlinear relationships