Supervised Learning vs. Unsupervised Learning

been_29·2024년 7월 30일

한국경제신문 with Toss bank MLOps 과정

목록 보기

9/26

💡 Supervised Learning

Provides both problems (features) and answers (labels) to the machine learning model.

Classification

Definition : Problems predicting Discrete(Categorical) Valued Output
Types
- Binary Classification : The model predicts one of two possible classes. Example: Spam vs. Not Spam.
- Multiclass Classification : The model predicts one of more than two classes. Example: Classifying emails into categories like "work," "personal," "spam," etc.
- Multilabel Classification : Each instance can belong to multiple classes simultaneously. Example: Tagging an image with multiple objects like "cat," "dog," "tree."

Regression

Definition : Predicts continuous outputs
Types
- Simple Linear Regression : Examines the linear relationship between two variables. Where $y$ is the dependent variable, $x$ is the independent variable, $\beta_0$ is the intercept, $\beta_1$ is the slope, and $\epsilon$ is the error term.
  $\\y={\beta}_0 + {\beta}_1x+\epsilon$
- Multiple Linear Regression : Extends simple linear regression to include multiple independent variables. Where $x_1$ , $x_2$ , ..., $x_n$ are the independent variables.
  $\\y={\beta}_0 + {\beta}_1x_1+{\beta}_2x_2+ ... + {\beta}_nx_n+\epsilon$
- Polynomial Regression : Fits a polynomial equation to the data. It can model non-linear relationships.
  $\\y={\beta}_0 + {\beta}_1x+{\beta}_2x^2+ ... + {\beta}_nx^n+\epsilon$
- Logistic Regression : Used for binary classification problems, where the dependent variable is categorical. The model predicts the probability of the outcome:
  $\\P(y=1|x) = \frac{1}{1+e^{-({\beta}_0+{\beta}_1x_1+{\beta}_2x_2 + ... + {\beta}_nx_n)}}$
- Ridge and Lasso Regression : These are types of regularized regression techniques. They add a penalty to the model to prevent overfitting.

Challenges in Supervised Learning

Data-Related : Supervised learning requires large, high-quality datasets. Insufficient or poor-quality data can lead to models that do not perform well, and imbalanced datasets can result in biased models.
Labeling : Labeling data is often expensive and time-consuming, and incorrect or subjective labels can mislead the model, reducing its accuracy.
Model-Related : Selecting the right algorithm can be difficult due to many options. Models can overfit (learn the noise) or underfit (be too simple) the training data, leading to poor performance.
Training : Training complex models requires significant computational resources and time. Additionally, finding the optimal hyperparameters is often a trial-and-error process.
Evaluation : Choosing the appropriate metrics to evaluate model performance is crucial, and ensuring the model generalizes well to new, unseen data is essential for its success.

💡 Unsupervised Learning

A type of machine learning where the model is trained on unlabeled data. The primary goal is to find hidden patterns, groupings, or features in the data.

Clustering

Definition : Grouping similar data points together
types
- K-means Clustering : Partitions data into $K$ clsuters, with each data point assigned to the cluster with the nearest mean. The algorithm iteratively adjusts the clsuter centroids until convergence.
- Hierarchical Clustering : Builds a tree of clusters(dendrogram) by recursively merging or splitting existing clsuters. Can be agglomerative(bottom-up) or divisive(top-down).
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups together points that are closely packed together, marking points in low-density regions as outliers.

Association (Rule Learning)

Definition : Finding rules that describe large portions of the data
Types
- Apriori Algorithm : Finds frequent itemsets and generates association rules.
- Eclat Algorithm : Uses a depth-first search strategy to find frequent itemsets more efficiently in some contexts than Apriori.

Dimensionality Reduction

Definition : Simplifies the data by reducing its dimensions while retaining most of the variability.
Types
- Principal Component Analysis(PCA) : Transforms data into a set of linearly uncorrelated variables called principal components, ordered by the amount of variance they explain.
- t-Distributed Stochastic Neighbor Embedding(t-SNE) : Reduces dimensions while maintaining the local structure of the data, often used for visualization of high-dimensional data.
- Linear Discriminant Analysis(LDA) : Finds the linear combinations of features that best separate different classes in the data, oftem used for feature extraction and dimensionality reduction.

Challenges in Unsupervised Learning

Evaluation : Unlike supervised learning, there are no straightforward metrics for evaluating the performance of unsupervised learning models. Validation often requires domain expertise and subjective judgment.
Scalability : Handling large datasets can be computationally intensive, requiring efficient algorithms and infrastructure.
Interpretability : The results of unsupervised learning, such as clusters or reduced dimensions, can be difficult to interpret without domain knowledge.
Data Quality : Unsupervised learning is sensitive to the quality of the data. Noise, outliers, and missing values can significantly impact the results.

💡 Comparison of Supervised and Unsupervised Learning

Feature	Supervised Learning	Unsupervised Learning
Data Type	Labeled data (input-output pairs)	Unlabeled data (only inputs)
Goal	Predict outcomes for new data	Discover hidden patterns or structures
Algorithms	Regression, classification	Clustering, association, dimensionality reduction
Training Process	Guided by labels, iterative improvement	Self-organized based on data structure
Examples	Image classification, spam detection, price prediction	Customer segmentation, anomaly detection, market basket analysis
Applications	When labeled data is available and prediction is needed	When labeled data is not available, or the goal is to explore data