Unsupervised Learning
- Most of this course focuses on
supervised learning methods such as regression and classification
- In that setting we observe obth a set of features X1,X2,…,Xp for each object, as well as a response or outcome variable Y
- The goal is then to predict Y using X1,X2,…,Xp
- Here we instead focus on
unsupervised learning, we where observe only the features X1,X2,…,Xp
- We are not interested in prediction, because we do not have an associated response variable Y
The Goals of Unsupervised Learning
- The goal is to discover interesting things about the measurements :
is there an informative way to visualize the data?
- Can we discover subgroups among the variables or among the observations?
- We discuss two methods
principal components analysis
A tool used for data visulization or data pre-processing before supervised techniques are applied
clustering
A broad class of methods for discovering unknown subgroups in data
Principal Components Analysis
PCA produces a low-dimensional representation of a dataset
- If finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated
- Apart from producing derived variables for use in supervised learning problems,
PCA also serves as a tool for data visulization
- The
first principal components of a set of features X1,X2,…,Xp is the normalized linear combination of the featuresZ1=ϕ11X1+ϕ21X2+⋯+ϕp1Xp that has the largest variance. By normalized, we mean that ∑j=1pϕj12=1
- We refer to the elements ϕ11,⋯,ϕp1 as the loadings of the first principal component; together, the loadings make up the
principal component loading vectorϕ1=(ϕ11ϕ21…ϕp1)T
- We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance

Computation of Pricinpal Components
- Suppose we have a n×p data set X
- Since we are only interested in variance, we assume that each of the variables in X has been centered to have mean zero (that is, the column means of X are zero)
- We then look for the linear combination of the sample feature values of the form
zi1=ϕ11xi1+ϕ21xi2+⋯+ϕp1xip for i=1,…,n that has largest sample variance, subject to the constraint that ∑j=1pϕj12=1
- Since each of the xij has mean zero, then so does zi1(for any value of ϕj1)
- Hence the sample variance of the zi1 can be written as n1∑i=1nzi12
- Plugging in (1) the first principal component loading vector solves the optimization problem
maximizeϕ11,…,ϕp1n1i=1∑n(j=1∑pϕj1xij)2subject to j=1∑pϕj12=1.
- This problem can be solved via a singular-value decomposition of the matrix X, a standard technique in linear algebra
- We refer to Z1 as the first principal component, with realized values z11,…,zn1
Geometry of PCA
- The loading vector ϕ1 with elements ϕ11,ϕ21,…,ϕp1 defines a direction in feature space along which the data vary the most
- If we project the n data points x1,…,xn onto this direction, the projected values are the principal component scores z11,…,zn1 themselves
Further principal components
- The second principal component is the linear combination of X1,…,Xp that has maximal variance among all linear combinations that are
uncorrelated with Z1
- The second principal component scores z12,z22,…,zn2 take the form
zi2=ϕ12xi1+⋯+ϕp2xip where ϕ2 is the second principal component loading vector, with elements ϕ12,…,ϕp2
- It turns out that constraining Z2 to be uncorrelated with Z1 is equivalent to constraining the direction ϕ2 to be orthogonal (perpendicular) to the direction ϕ1 And so on
- The principal component directions ϕ1,ϕ2,ϕ3 are the ordered sequence of right singular vectors of the matrix X, and the variance of the components are n1 times the squares of the singular values
- There are at most min(n−1,p) principal components
PCA should be performed after standardization
PCA find the hyperplane closest to the observations
- The first principal component loading vector has a very special property : it defines the line in p-dimensional space that is
closest to the n observations (using averae squared Euclidean distance as a measure of closeness)
- The notion of principal components as the dimensions that are closest to the n observations extends beyond just the first principal component
- For instance, the first two principal components of a data set span the plan that is closest to the n observations, in terms of average squared Euclidean distance
Scaling of the variables matters
- If the variables are in different units, scaling each to have standard deviation equal to one is recommended
- If they are in the same units, you might or might not scale the variables

Proportion Variance Explained
- To understand the strength of each component, we are interested in knowing the proportion of variance explained (PVE) by each one
- The
total variance present in a data set (assuming that the variables have been centered to have mean zero) is defined asj=1∑pVar(Xj)=j=1∑pn1i=1∑nxij2n:row,j:variables and the variance explained by the mth principal component isVar(Zm)=n1i=1∑nzim2
- It can be shown that ∑j=1pVar(Xj)=∑m=1MVar(Zm),with M=min(n−1,p)
- Therefore, the
PVE of the mth principal component is given by the positive quantity between 0 and 1∑j=1p∑i=1nxij2∑i=1nzim2
- The
PVEs sum to one
- We sometimes display the cumlative
PVEs

How many principal components should we use?
- If we use principal components as a summary of our data, how many components are sufficient?
- No simple answer to this question, as cross-validation is not available for this purpose
- Why not?
- When could we use cross-validation to select the number of components?
- the
scree plot on the previous slide can be used as a guide : we look for an elbow
Matrix Completion via Principal Components
- We pose instead a modified version of the approximation criterion

where O is the set of all observed pairs of indices (i,j) a subset of the possible n×p pairs
- Once we solve this problem :
- we can estimate a missing observation xij using x^ij=∑m=1Ma^imb^jm, where a^im and b^jm are the (i,m) and (j,m) elements of the solution matrices A^ and B^
- we can (approximately) recover the M principal component scores and loadings, as if data were complete
Iterative Algorithm for Matirx Completion
Initialize : create a complete data matrix X~ by filling in the missing value susing mean imputation
Repeat : step (a)-(c) until the objective in (c) fails to decreases
- (a)

by computing the principal components of X~
- (b) For each missing entry (i,j)∈O, set x~ij←∑m=1Ma^imb^im
- (c) Compute the objective

- Return the estimated missing entries x~ij,(i,j)∈O
Clustering
K-means clustering
- Note that there is
no ordering of the clusters, so that cluster coloring is arbitrary
- Let C1,…,CK denotes sets containing the indices of the observations in each cluster
- These sets satisfy two properties
- C1∪C2∪⋯∪CK={1,…,n}. In other words, each observation belongs to at least one of the K clusters
- Ck∩Ck′=0 for all k=k′
In other words, the clusters are non-overlapping : no observation belongs to more than one cluster
- For instnace, if the ith observation is in the kth cluster, then i∈Ck
- The idea begind K-means clustering is that a good clustering is one for which the
within-cluster variation is as small as possible
- The within-cluster variation for cluster Ck is a measure WCV(Ck) of the amount by which the observations within a cluster differ from each other
- Hence we wnat to solve the problem
minimizeC1,…,CK{k=1∑KWCV(Ck)}
- In words, this formula says that we want to partition the observation into K clusters such that the total within-cluster variation, summed over all K clusters, is as small as possible
How to define within-cluster variation
- Typically we use Euclidean distance
WCV(Ck)=∣Ck∣1i,i′∈Ck∑j=1∑p(xij−xi′j)2 where ∣Ck∣ denotes the number of observations in the kth cluster
- Combining (2) and (3) gives the optimization problem that define K-means clustering
minimizeC1,…,CK⎩⎪⎨⎪⎧k=1∑K∣Ck∣1i,i′∈Ck∑j=1∑p(xij−xi′j)2⎭⎪⎬⎪⎫
K-Means Clustering Algorithm
- Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations
- Iterate until the cluster assignments stop changing
2.1. For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster
2.2. Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance)
Properties of the Algorithm
- This algorithm is guaranteed to decrease the value of the objective (4) at each step
∣Ck∣1i,i′∈Ck∑j=1∑p(xij−xi′j)2=2i∈Ck∑j=1∑p(xij−xˉkj)2,where xˉkj=∣Ck∣1i∈Ck∑xij is the mean for feature j in cluster Ck.
- however it is not guaranteed to give the global minimum

Hierarchical Clustering
K-means requires us to pre-specify the number of clusters K
- We describe
bottom-up or agglomerative clustering
- The approach in words :
- Start with each point in its own cluster
- Identify the closest two clusters and merge them
- Repeat
- Ends when all points are in a single cluster

Linkage

Choice of Dissimilarity Measure
- So far have used
Euclidean distance
- An alternative is
correlation-based distance which considers two observations to be similar if their features are highly correlated
- This is an unusual use of correlation, which is noramlly computed betwen variables; here it is computed between the observation profiles for each pair of observations

Practical issues
Scaling of the variable matters
- Should the observations of features first be standardized in some way?
- For instance, maybe the variables should be centered to have mean zero and scaled to have standard deviation one
- In the case of hierarchical clustering,
- What dissimilarity measure should be used?
- What type of linkage should be used?
- How many clusters to choose? ( in both K-means or hierarchical clustering)
Difficult problem
- No agreed-upon method
- Which features should we use to drive the clustering?
All Contents written based on GIST - Machine Learning & Deep Learning Lesson(Instructor : Prof. sun-dong. Kim)