Unsupervised Learning in Python

지니🧸·2022년 10월 17일

딥러닝 머신러닝

ML, DL, etc.

목록 보기

2/3

Unsupervised Learning in Python

1. Clustering for dataset exploration

Unsupervised Learning

Unsupervised learning: a class of machine learning techniques for discovering patterns in data

(ex) clustering customers by their purchases, compressing data using purchase patterns (dimensions reduction)
Supervised learning vs. Unsupervised learning
- supervised: finds patterns for a prediction task
  - (ex) classify tumors as benign/cancerous (labels)
- unsupervised: finds patterns in data w/o labels (w/o a specific prediction task)

K-means clustering

finds clusters of samples
- number of clusters must be specified
uses sklearn (scikit-learn)

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3) #3 species of Iris in the case of Iris dataset
model.fit(samples) #samples passed in form of array
model.cluster_centers_ #centroids

model.fit(samples)
- fits the model to the data by locating & remembering the regions where the different clusters occur

labels = model.predict(samples)

Cluster labels for new samples

new samples can be assigned to existing clusters
k-means remembers the mean of each cluster (the centroids)
new samples are assigned to the cluster whose centroid is closest

Scatter plots

import matplotlib.pyplot as plt
xs = samples[:,0]
ys = samples = samples[:,2]
plt.scatter(xs, ys, c=labels) #c: color defined by labels
plt.show()

plt.scatter(x, y, c=labels, marker=’D’, s=50)
- c: color of labels defined by labels
- marker=’D’: diamond as marker
- s: size of marker

Evaluating a clustering

compare the clusters with the original data
measure quality of a clustering
informs choice of how many clusters to look for

import pandas as pd
df = pd.DataFrame({'labels': labels, 'species': species})
ct = pd.crosstab(df['labels'], df['species'])

Crosstab of labels and species

Measuring clustering quality using only samples & cluster labels

a good clustering has tight clusters
- tight clusters: samples in each cluster bunched together (not spread out)
inertia: measures how spread out the clusters are
- lower is better
- measures distance from each sample to centroid of its cluster
- available after fit() method, as attribute inertia_
- decreases with increasing number of clusters

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(samples)
print(model.inertia_)

how to choose good clustering when tradeoff between inertia & number of clusters
- low inertia & not too many clusters
- where inertia begins to decrease more slowly than the speed of increase in number of clusters

Transforming features for better clustering

in KMeans: feature variance = feature influence
- variance of a feature corresponds to its influence on the clustering algorithm
to give every feature a chance, data needs to be transformed so that features have equal variance
- StandardScaler transforms each feature to have mean 0 & variance 1

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(samples)
StandardScaler(copy=True, with_mean=True, with_std=True)
samples_scaled = scaler.transform(samples)

StandardScaler vs. KMeans
- StandardScaler: fit()/transform()
  - transforms data
- KMeans: fit()/predict()
  - assigns cluster labels to samples
StandardScaler then KMeans
- use sklearn pipeline to combine the steps

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)
from sklearn.pipeline import make_pipeline
**pipeline = make_pipeline(scaler, kmeans)**
pipeline.fit(samples)
labels = pipeline.predict(samples)

Normalizer()

StandardScaler() standardizes features by removing the mean & scaling to unit variance
Normalizer() rescales each sample independently of the other

2. Visualization with hierarchical clustering and t-SNE

Visualizing hierarchies

t-SNE: creates 2D map of a dataset
- conveys useful information about the proximity of samples from one another
hierarchical clustering

Hierarchical clustering (Agglomerative)

every country begins in a separate cluster
at each step, the 2 closest clusters are merged
continue until all countries in a single cluster

divisive clustering works the other way around

The dendrogram

read from bottom up
vertical lines represent clusters
joining of vertical lines = merging of clusters

import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
mergings = linkage(samples, method='complete')
dendrogram(mergings, labels=country_names, leaf_rotation=90, leaf_font_size=6)
plt.show()

Cluster labels in hierarchical clustering

Intermediate clusterings & heights on dendrogram

intermediate stage in the hierarchical clustering is specified by choosing a height on the dendrogram

Dendrograms

y-axis (height on dendrogram) = distance between merging clusters
- don’t merge clusters further apart than this
distance b/w clusters
- defined by Linkage method
- “complete” linkage: distance b/w clusters is distance b/w furthest points
- “single” linkage: distance b/w clusters is the distance b/w closest points
- specified via method parameter

Extracting cluster labels

fcluster() function
- 2nd argument: height
returns a NumPy array of cluster labels
- cluster labels start at 1 (not 0 like scikit-learn)

from scipy.cluster.hierarchy import linkage
mergings = linkage(samples, method='complete')
from scipy.cluster.hierarchy import fcluster
labels = fcluster(mergings, 15, criterion='distance')

Aligning cluster labels w/ country names

import pandas as pd
pairs = pd.DataFrame({'labels': labels, 'countries': country_names})
pairs.sort_values('labels')

t-SNE for 2-dimensional maps

t-SNE: unsupervised learning method for visualization
t-distributed stochastic neighbor embedding
maps samples from high-dimensional space into 2 or 3D space so they can be visualized
approximately preserves nearness of samples

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
model = TSNE(learning_rate=100)
transformed = model.fit_transform(samples)
xs = transformed[:,0]
ys = transformed[:,1]
plt.scatter(xs, ys, c=species)
plt.show()

fit_transform() method
- simultaneously fits the model & transforms the data
- no separate methods for this function
  - can’t extend the map to include new data samples
  - must start over each time
t-SNE learning rate
- choose learning rate for the data set
- wrong choice: points bunch together
- try values 50~200
axes of t-SNE plot have no meaning
- changes every time even on same dataset

3. Decorrelating your data and dimension reduction

Visualizing the PCA transformation

Dimension reduction: finds patterns in data & uses the patterns to re-express the data in a compressed form

more efficient storage & computation
remove less-informative noise features
- noise features cause problems for prediction tasks (i.e., classification, regression)

Principal Component Analysis (PCA)

fundamental dimension reduction technique
Steps
1. decorrelation
2. dimension reduction
decorrelation
- rotates data samples to be aligned w/ axes
- shifts the samples so that they have mean of zero
- no information is lost

PCA in coding

PCA is a scikit-learn component
fit() learns the transformation from given data
- how to shift & how to rotate the samples
- does not actually change them
transform() applies the transformation that fit learned
- can be also applied to new data
- returns a new array of transformed samples
  - same number of rows & columns
  - columns: PCA features

from sklearn.decomposition import PCA
model = PCA()
model.fit(samples)
transformed = model.transform(sample)

PCA features are not correlated (unlike features of original dataset)

Pearson correlation

measures linear correlation of features
value b/w -1 & 1
value of 0 = no linear correlation

from scipy.stats import pearsonr
correlation, pvalue = pearsonr(width, length)

Principal components

principal components = directions of variance
- directions in which samples vary the most
PCA aligns principal components w/ the axes
available as .components_ attribute of PCA object
- numpy array w/ 1 row for each principal component
- each row defines displacement from mean

attributes of PCA

.components_ : principal components
.mean_ : coordinates of the mean of data

Intrinsic dimension

Intrinsic dimension: number of features needed to approximate the dataset

informs dimension reduction b/c it tells how much a dataset can be compressed
- the most compact representation
can be detected w/ PCA

PCA identifies intrinsic dimensions

intrinsic dimension = number of PCA features w/ significant variance

Plotting variances of PCA features

samples: array of samples

import matplotlib as plt
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(samples)
features = range(pca.n_components_) #enumerates PCA features

#make a bar plot of variances
plt.bar(features, pca.explained_variance_)
plt.xticks(features)
plt.ylabel('variance')
plt.xlabel('PCA feature')
plt.show()

How to draw arrow

# Plot first_pc as an arrow, starting at mean
plt.arrow(mean[0], mean[1], first_pc[0], first_pc[1], color='red', width=0.01)

.arrow()
- 1st argument: x coordinate of starting point
- 2nd argument: y coordinate of starting point
- 3rd argument: x coordinate of ending point
- 4th argument: y coordinate of ending point

Dimension reduction w/ PCA

dimension reduction: represents same data using less features

Dimension reduction w/ PCA

PCA performs dimension reduction by discarding PCA features w/ lower variance, which it assumes to be noise & retaining higher variance PCA features, which it assumes to be informative
specify how many features to keep i.e., PCA(n_components=2)
- intrinsic dimension is a good choice

Code

samples: array of measurements (4 features)
- aim to decrease to 2 features
species: list of species numbers

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(samples)
transformed = pca.transform(samples)
print(transformed.shape)

results
- PCA reduced dimension to 2
- retained the 2 PCA features w/ highest variance
- important information preserved: species remain distinct

Word Frequency arrays

rows represent documents & columns represent words
entries measure presence of each word in each document
most entries of the word frequency array is zero
- = sparse
use scipy.sparse.csr_matrix
- csr_matrix remembers only the non-zero entries

from sklearn.decomposition import TruncatedSVD
model = TruncatedSVD(n_components=3)
model.fit(documents) #documents is csr_matrix
transformed = model.transform(documents)

How to create a tf-idf word frequency array

TfidfVectorizer from sklearn
transforms a list of documents into a word frequency array in the form of csr_matrix
has fit() & transform() methods
tf: frequency of word in document
idf: reduces influence of frequent words, i.e. the

from sklearn.feature_extractio.text import TfidfVectorizer
tfidf = TfidfVectorizer()
csr_mat = tfidf.fit_transform(documents) #word-frequency array in csr_matrix format
csr_mat.toarray()
words = tfidf.get_feature_names() #columns of the array correspond to words

4. Discovering interpretable features

Non-negative matrix factorization (NMF)

dimension reduction technique
NMF models are interpretable (unlike PCA)
all sample features must be non-negative for NMF to be applied

Interpretable parts

NMF achieves its interpretability by decomposing samples as sums of their parts
NMF expresses documents as combinations of topics (or themes)
- expresses images as combinations of patterns

Using scikit-learn NMF

unlike PCA, desired number of components must always be specified
works with NumPy arrays & csr_matrix

from sklearn.decomposition import NMF
model = NMF(n_components=2)
model.fit(samples)
nmf_features = model.transform(samples)
model.components_ #dimension of components = dimension of samples

NMF features

non-negative
can be used to reconstruct samples when combined with components

Reconstruction of sample

multiply components by feature values & add up
- [2, 1] * [[1 0.5 0], [0.2 0.1 2.1]] → [2.2, 1.1, 2.1]
can be expressed as a production of matrices

NMF fits to non-negative data only

NMF learns interpretable parts

from sklearn.decomposition import NMF
nmf = NMF(n_components=10)
nmf.fit(articles)

NMF components

for documents:
- NMF components represent topics
- NMF features combine topics into documents
for images, NMF components are parts of images

Grayscale images: no colors, only different shades of grey

since there are only shades of grey, a grayscale image can be encoded by the brightness of every pixel
represent brightness w/ value b/w 0 & 1
- 0 is black
convert to 2D array of numbers

Grayscale images as flat arrays

enumerate the entries
- read-off the values row-by-row
- from left to right, top to bottom
a flat array of non-negative numbers

A collection of grayscale images of the same size can be encoded as a 2D array

each row represents an image as a flattened array
each column represents a pixel
NMF can be used

To recover the image

reshape() method
- specify the dimensions of the original image as a tuple
- returns 2D array of pixel brightnesses
use pyplot to show the image

bitmap = sample.reshape((2, 3))
from matplotlib import pyplot as plt
plt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.show()

def show_as_image(sample):
		bitmap = sample.reshape((13,8))
		plt.figure()
		plt.imshow(bitmap, cmap='gray', interpolation='nearest')
		plt.colorbar()
		plt.show()

Building recommender systems using NMF

task: recommend articles similar to article being read by customers

Strategy

apply NMF to the word-frequency array of articles & use the resulting NMF features
NMF feature values describe the topics
- so similar documents have similar NMF feature values

Apply NMF to the word-frequency array

articles: word frequency array

from sklearn.decomposition import NMF
nmf = NMF(n_components=6)
nmf_features = nmf.fit_transform(articles)

Compare articles by NMF features

different versions of the same document have same topic proportions but exact feature values may be different
- i.e., because one version uses many meaningless words → reduces values of NMF features representing the topics
however, on a scatter plot of the NMF features, all these versions (weak & strong) lie on a single line passing through the origin

Cosine similarity: angle between the lines

higher values: greater similarity

from sklearn.preprocessing import normalize
norm_features = normalize(nmf_features)
#if has index 23
current_article = norm_features[23,:]
similarities = norm_features.dot(current_article)
similarities

Dataframes and Labels

label similarities with article titles

import pandas as pd
norm_features = normalize(nmf_features)
df = pd.DataFrame(norm_features, index=titles)
current_article = df.loc['Dog bites man']
similarities = df.dot(current_article) #calculate cosine similarities
similarities.nlargest() #5 top

MaxAbsScaler

transforms the data so that all users have the same influence on the model regardless of how many different artists they’ve listened to

from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler

지니🧸

우당탕탕

이전 포스트

Supervised Learning with scikit-learn

다음 포스트

Unsupervised Learning in Python

ML, DL, etc.

Unsupervised Learning in Python

1. Clustering for dataset exploration

2. Visualization with hierarchical clustering and t-SNE

3. Decorrelating your data and dimension reduction

4. Discovering interpretable features

Supervised Learning with scikit-learn

Linear Classifiers in Python

0개의 댓글