[ML&DL] 10. Unsupervised Learning

KBC·2024년 12월 14일
0

Unsupervised Learning

  • Most of this course focuses on supervised learning methods such as regression and classification
  • In that setting we observe obth a set of features X1,X2,,XpX_1,X_2,\dots,X_p for each object, as well as a response or outcome variable YY
  • The goal is then to predict YY using X1,X2,,XpX_1,X_2,\dots,X_p
  • Here we instead focus on unsupervised learning, we where observe only the features X1,X2,,XpX_1,X_2,\dots,X_p
  • We are not interested in prediction, because we do not have an associated response variable YY

The Goals of Unsupervised Learning

  • The goal is to discover interesting things about the measurements :
    is there an informative way to visualize the data?
  • Can we discover subgroups among the variables or among the observations?
  • We discuss two methods
    • principal components analysis
      A tool used for data visulization or data pre-processing before supervised techniques are applied
    • clustering
      A broad class of methods for discovering unknown subgroups in data

Principal Components Analysis

  • PCA produces a low-dimensional representation of a dataset
  • If finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated
  • Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visulization
  • The first principal components of a set of features X1,X2,,XpX_1,X_2,\dots,X_p is the normalized linear combination of the features
    Z1=ϕ11X1+ϕ21X2++ϕp1XpZ_1=\phi_{11}X_1+\phi_{21}X_2+\dots+\phi_{p1}X_p
    that has the largest variance. By normalized, we mean that j=1pϕj12=1\sum^p_{j=1}\phi^2_{j1}=1
  • We refer to the elements ϕ11,,ϕp1\phi_{11},\cdots,\phi_{p1} as the loadings of the first principal component; together, the loadings make up the principal component loading vector
    ϕ1=(ϕ11  ϕ21    ϕp1)T\phi_1=\left(\phi_{11}\;\phi_{21}\;\dots\;\phi_{p1}\right)^T
  • We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance

Computation of Pricinpal Components

  • Suppose we have a n×pn\times p data set XX
  • Since we are only interested in variance, we assume that each of the variables in XX has been centered to have mean zero (that is, the column means of XX are zero)
  • We then look for the linear combination of the sample feature values of the form
    zi1=ϕ11xi1+ϕ21xi2++ϕp1xipz_{i1}=\phi_{11}x_{i1}+\phi_{21}x_{i2}+\cdots+\phi_{p1}x_{ip}
    for i=1,,ni=1,\dots,n that has largest sample variance, subject to the constraint that j=1pϕj12=1\sum^p_{j=1}\phi^2_{j1}=1
  • Since each of the xijx_{ij} has mean zero, then so does zi1z_{i1}(for any value of ϕj1)\phi_{j1})
  • Hence the sample variance of the zi1z_{i1} can be written as 1ni=1nzi12\frac{1}{n}\sum^n_{i=1}z^2_{i1}
  • Plugging in (1) the first principal component loading vector solves the optimization problem
    maximizeϕ11,,ϕp11ni=1n(j=1pϕj1xij)2subject to j=1pϕj12=1.\text{maximize}_{\phi_{11}, \dots, \phi_{p1}} \frac{1}{n} \sum_{i=1}^n \left( \sum_{j=1}^p \phi_{j1} x_{ij} \right)^2 \text{subject to } \sum_{j=1}^p \phi_{j1}^2 = 1.
  • This problem can be solved via a singular-value decomposition of the matrix XX, a standard technique in linear algebra
  • We refer to Z1Z_1 as the first principal component, with realized values z11,,zn1z_{11},\dots,z_{n1}

Geometry of PCA

  • The loading vector ϕ1\phi_1 with elements ϕ11,ϕ21,,ϕp1\phi_{11},\phi_{21},\dots,\phi_{p1} defines a direction in feature space along which the data vary the most
  • If we project the nn data points x1,,xnx_1,\dots,x_n onto this direction, the projected values are the principal component scores z11,,zn1z_{11},\dots,z_{n1} themselves

Further principal components

  • The second principal component is the linear combination of X1,,XpX_1,\dots,X_p that has maximal variance among all linear combinations that are uncorrelated with Z1Z_1
  • The second principal component scores z12,z22,,zn2z_{12},z_{22},\dots,z_{n2} take the form
    zi2=ϕ12xi1++ϕp2xipz_{i2}=\phi_{12}x_{i1}+\dots+\phi_{p2}x_{ip}
    where ϕ2\phi_2 is the second principal component loading vector, with elements ϕ12,,ϕp2\phi_{12},\dots,\phi_{p2}
  • It turns out that constraining Z2Z_2 to be uncorrelated with Z1Z_1 is equivalent to constraining the direction ϕ2\phi_2 to be orthogonal (perpendicular) to the direction ϕ1\phi_1 And so on
  • The principal component directions ϕ1,ϕ2,ϕ3\phi_1,\phi_2,\phi_3 are the ordered sequence of right singular vectors of the matrix XX, and the variance of the components are 1n\frac{1}{n} times the squares of the singular values
  • There are at most min(n1,p)\min(n-1,p) principal components

    PCA should be performed after standardization

PCA find the hyperplane closest to the observations

  • The first principal component loading vector has a very special property : it defines the line in pp-dimensional space that is closest to the nn observations (using averae squared Euclidean distance as a measure of closeness)
  • The notion of principal components as the dimensions that are closest to the nn observations extends beyond just the first principal component
  • For instance, the first two principal components of a data set span the plan that is closest to the nn observations, in terms of average squared Euclidean distance

Scaling of the variables matters

  • If the variables are in different units, scaling each to have standard deviation equal to one is recommended
  • If they are in the same units, you might or might not scale the variables

Proportion Variance Explained

  • To understand the strength of each component, we are interested in knowing the proportion of variance explained (PVE) by each one
  • The total variance present in a data set (assuming that the variables have been centered to have mean zero) is defined as
    j=1pVar(Xj)=j=1p1ni=1nxij2n:row,j:variables\sum^p_{j=1}\text{Var}(X_j)=\sum^p_{j=1}\frac{1}{n}\sum^n_{i=1}x_{ij}^2\\[0.2cm] n:\text{row},\quad j:\text{variables}
    and the variance explained by the mmth principal component is
    Var(Zm)=1ni=1nzim2\text{Var}(Z_m)=\frac{1}{n}\sum^n_{i=1}z^2_{im}
  • It can be shown that j=1pVar(Xj)=m=1MVar(Zm),  with M=min(n1,p)\sum^p_{j=1}\text{Var}(X_j)=\sum^M_{m=1}\text{Var}(Z_m),\;\text{with }M=\min(n-1,p)
  • Therefore, the PVE of the mmth principal component is given by the positive quantity between 00 and 11
    i=1nzim2j=1pi=1nxij2\frac{\sum^n_{i=1}z^2_{im}}{\sum^p_{j=1}\sum^n_{i=1}x^2_{ij}}
  • The PVEs sum to one
  • We sometimes display the cumlative PVEs

How many principal components should we use?

  • If we use principal components as a summary of our data, how many components are sufficient?
    • No simple answer to this question, as cross-validation is not available for this purpose
      • Why not?
      • When could we use cross-validation to select the number of components?
    • the scree plot on the previous slide can be used as a guide : we look for an elbow

Matrix Completion via Principal Components

  • We pose instead a modified version of the approximation criterion

    where OO is the set of all observed pairs of indices (i,j)(i,j) a subset of the possible n×pn\times p pairs
  • Once we solve this problem :
    • we can estimate a missing observation xijx_{ij} using x^ij=m=1Ma^imb^jm\hat x_{ij}=\sum^M_{m=1}\hat a_{im}\hat b_{jm}, where a^im\hat a_{im} and b^jm\hat b_{jm} are the (i,m)(i,m) and (j,m)(j,m) elements of the solution matrices A^\hat A and B^\hat B
    • we can (approximately) recover the MM principal component scores and loadings, as if data were complete

Iterative Algorithm for Matirx Completion

  1. Initialize : create a complete data matrix X~\tilde X by filling in the missing value susing mean imputation
  2. Repeat : step (a)-(c) until the objective in (c) fails to decreases
    • (a)
      by computing the principal components of X~\tilde X
    • (b) For each missing entry (i,j)∉O,(i,j)\not \in O, set x~ijm=1Ma^imb^im\tilde x_{ij} \leftarrow \sum^M_{m=1}\hat a_{im}\hat b_{im}
    • (c) Compute the objective
  3. Return the estimated missing entries x~ij,  (i,j)∉O\tilde x_{ij},\;(i,j)\not \in O

Clustering

K-means clustering

  • Note that there is no ordering of the clusters, so that cluster coloring is arbitrary
  • Let C1,,CKC_1,\dots,C_K denotes sets containing the indices of the observations in each cluster
  • These sets satisfy two properties
    1. C1C2CK={1,,n}C_1\cup C_2 \cup \dots\cup C_K=\{1,\dots,n\}. In other words, each observation belongs to at least one of the KK clusters
    2. CkCk=C_k \cap C_{k'}=\not 0 for all kkk\neq k'
      In other words, the clusters are non-overlapping : no observation belongs to more than one cluster
    • For instnace, if the iith observation is in the kkth cluster, then iCki\in C_k

  • The idea begind KK-means clustering is that a good clustering is one for which the within-cluster variation is as small as possible
  • The within-cluster variation for cluster CkC_k is a measure WCV(Ck)\text{WCV}(C_k) of the amount by which the observations within a cluster differ from each other
  • Hence we wnat to solve the problem
    minimizeC1,,CK{k=1KWCV(Ck)}\text{minimize}_{C_1,\dots,C_K}\left\{\sum^K_{k=1}\text{WCV}(C_k)\right\}
  • In words, this formula says that we want to partition the observation into KK clusters such that the total within-cluster variation, summed over all KK clusters, is as small as possible

How to define within-cluster variation

  • Typically we use Euclidean distance
    WCV(Ck)=1Cki,iCkj=1p(xijxij)2\text{WCV}(C_k)=\frac{1}{|C_k|}\sum_{i,i'\in C_k}\sum^p_{j=1}(x_{ij}-x_{i'j})^2
    where Ck|C_k| denotes the number of observations in the kkth cluster
  • Combining (2) and (3) gives the optimization problem that define KK-means clustering
    minimizeC1,,CK{k=1K1Cki,iCkj=1p(xijxij)2}\text{minimize}_{C_1,\dots,C_K}\left\{\sum^K_{k=1}\frac{1}{|C_k|}\sum_{i,i'\in C_k}\sum^p_{j=1}(x_{ij}-x_{i'j})^2\right\}

K-Means Clustering Algorithm

  1. Randomly assign a number, from 11 to KK, to each of the observations. These serve as initial cluster assignments for the observations
  2. Iterate until the cluster assignments stop changing
    2.1. For each of the KK clusters, compute the cluster centroid. The kkth cluster centroid is the vector of the pp feature means for the observations in the kkth cluster
    2.2. Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance)

Properties of the Algorithm

  • This algorithm is guaranteed to decrease the value of the objective (4) at each step
    1Cki,iCkj=1p(xijxij)2=2iCkj=1p(xijxˉkj)2,where xˉkj=1CkiCkxij is the mean for feature j in cluster Ck.\frac{1}{|C_k|} \sum_{i,i' \in C_k} \sum_{j=1}^p (x_{ij} - x_{i'j})^2 = 2 \sum_{i \in C_k} \sum_{j=1}^p (x_{ij} - \bar{x}_{kj})^2, \\[0.2cm] \text{where } \bar{x}_{kj} = \frac{1}{|C_k|} \sum_{i \in C_k} x_{ij} \text{ is the mean for feature } j \text{ in cluster } C_k.
  • however it is not guaranteed to give the global minimum

Hierarchical Clustering

  • K-means requires us to pre-specify the number of clusters KK
  • We describe bottom-up or agglomerative clustering
  • The approach in words :
    • Start with each point in its own cluster
    • Identify the closest two clusters and merge them
    • Repeat
    • Ends when all points are in a single cluster

Linkage

Choice of Dissimilarity Measure

  • So far have used Euclidean distance
  • An alternative is correlation-based distance which considers two observations to be similar if their features are highly correlated
  • This is an unusual use of correlation, which is noramlly computed betwen variables; here it is computed between the observation profiles for each pair of observations

Practical issues

  • Scaling of the variable matters
  • Should the observations of features first be standardized in some way?
  • For instance, maybe the variables should be centered to have mean zero and scaled to have standard deviation one
  • In the case of hierarchical clustering,
    • What dissimilarity measure should be used?
    • What type of linkage should be used?
  • How many clusters to choose? ( in both K-means or hierarchical clustering)

    Difficult problem

  • No agreed-upon method
  • Which features should we use to drive the clustering?

All Contents written based on GIST - Machine Learning & Deep Learning Lesson(Instructor : Prof. sun-dong. Kim)

profile
AI, Security

0개의 댓글