Unsupervised Learning in Python

지니🧸·2022년 10월 17일
0

ML, DL, etc.

목록 보기
2/3

Unsupervised Learning in Python

1. Clustering for dataset exploration


Unsupervised Learning

Unsupervised learning: a class of machine learning techniques for discovering patterns in data

  • (ex) clustering customers by their purchases, compressing data using purchase patterns (dimensions reduction)
  • Supervised learning vs. Unsupervised learning
    • supervised: finds patterns for a prediction task
      • (ex) classify tumors as benign/cancerous (labels)
    • unsupervised: finds patterns in data w/o labels (w/o a specific prediction task)

K-means clustering

  • finds clusters of samples
    • number of clusters must be specified
  • uses sklearn (scikit-learn)
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3) #3 species of Iris in the case of Iris dataset
model.fit(samples) #samples passed in form of array
model.cluster_centers_ #centroids
  • model.fit(samples)
    • fits the model to the data by locating & remembering the regions where the different clusters occur
labels = model.predict(samples)

Cluster labels for new samples

  • new samples can be assigned to existing clusters
  • k-means remembers the mean of each cluster (the centroids)
  • new samples are assigned to the cluster whose centroid is closest

Scatter plots

import matplotlib.pyplot as plt
xs = samples[:,0]
ys = samples = samples[:,2]
plt.scatter(xs, ys, c=labels) #c: color defined by labels
plt.show()
  • plt.scatter(x, y, c=labels, marker=’D’, s=50)
    • c: color of labels defined by labels
    • marker=’D’: diamond as marker
    • s: size of marker

Evaluating a clustering

  • compare the clusters with the original data
  • measure quality of a clustering
  • informs choice of how many clusters to look for
import pandas as pd
df = pd.DataFrame({'labels': labels, 'species': species})
ct = pd.crosstab(df['labels'], df['species'])

Crosstab of labels and species

Measuring clustering quality using only samples & cluster labels

  • a good clustering has tight clusters
    • tight clusters: samples in each cluster bunched together (not spread out)
  • inertia: measures how spread out the clusters are
    • lower is better
    • measures distance from each sample to centroid of its cluster
    • available after fit() method, as attribute inertia_
    • decreases with increasing number of clusters
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(samples)
print(model.inertia_)
  • how to choose good clustering when tradeoff between inertia & number of clusters
    • low inertia & not too many clusters
    • where inertia begins to decrease more slowly than the speed of increase in number of clusters

Transforming features for better clustering

  • in KMeans: feature variance = feature influence
    • variance of a feature corresponds to its influence on the clustering algorithm
  • to give every feature a chance, data needs to be transformed so that features have equal variance
    • StandardScaler transforms each feature to have mean 0 & variance 1
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(samples)
StandardScaler(copy=True, with_mean=True, with_std=True)
samples_scaled = scaler.transform(samples)
  • StandardScaler vs. KMeans
    • StandardScaler: fit()/transform()
      • transforms data
    • KMeans: fit()/predict()
      • assigns cluster labels to samples
  • StandardScaler then KMeans
    • use sklearn pipeline to combine the steps
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)
from sklearn.pipeline import make_pipeline
**pipeline = make_pipeline(scaler, kmeans)**
pipeline.fit(samples)
labels = pipeline.predict(samples)

Normalizer()

  • StandardScaler() standardizes features by removing the mean & scaling to unit variance
  • Normalizer() rescales each sample independently of the other

2. Visualization with hierarchical clustering and t-SNE


Visualizing hierarchies

  • t-SNE: creates 2D map of a dataset
    • conveys useful information about the proximity of samples from one another
  • hierarchical clustering

Hierarchical clustering (Agglomerative)

  1. every country begins in a separate cluster
  2. at each step, the 2 closest clusters are merged
  3. continue until all countries in a single cluster
  • divisive clustering works the other way around

The dendrogram

  • read from bottom up
  • vertical lines represent clusters
  • joining of vertical lines = merging of clusters
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
mergings = linkage(samples, method='complete')
dendrogram(mergings, labels=country_names, leaf_rotation=90, leaf_font_size=6)
plt.show()

Cluster labels in hierarchical clustering

Intermediate clusterings & heights on dendrogram

  • intermediate stage in the hierarchical clustering is specified by choosing a height on the dendrogram

Dendrograms

  • y-axis (height on dendrogram) = distance between merging clusters
    • don’t merge clusters further apart than this
  • distance b/w clusters
    • defined by Linkage method
    • “complete” linkage: distance b/w clusters is distance b/w furthest points
    • “single” linkage: distance b/w clusters is the distance b/w closest points
    • specified via method parameter

Extracting cluster labels

  • fcluster() function
    • 2nd argument: height
  • returns a NumPy array of cluster labels
    • cluster labels start at 1 (not 0 like scikit-learn)
from scipy.cluster.hierarchy import linkage
mergings = linkage(samples, method='complete')
from scipy.cluster.hierarchy import fcluster
labels = fcluster(mergings, 15, criterion='distance') 

Aligning cluster labels w/ country names

import pandas as pd
pairs = pd.DataFrame({'labels': labels, 'countries': country_names})
pairs.sort_values('labels')

t-SNE for 2-dimensional maps

  • t-SNE: unsupervised learning method for visualization
  • t-distributed stochastic neighbor embedding
  • maps samples from high-dimensional space into 2 or 3D space so they can be visualized
  • approximately preserves nearness of samples
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
model = TSNE(learning_rate=100)
transformed = model.fit_transform(samples)
xs = transformed[:,0]
ys = transformed[:,1]
plt.scatter(xs, ys, c=species)
plt.show()
  • fit_transform() method
    • simultaneously fits the model & transforms the data
    • no separate methods for this function
      • can’t extend the map to include new data samples
      • must start over each time
  • t-SNE learning rate
    • choose learning rate for the data set
    • wrong choice: points bunch together
    • try values 50~200
  • axes of t-SNE plot have no meaning
    • changes every time even on same dataset

3. Decorrelating your data and dimension reduction


Visualizing the PCA transformation

Dimension reduction: finds patterns in data & uses the patterns to re-express the data in a compressed form

  • more efficient storage & computation
  • remove less-informative noise features
    • noise features cause problems for prediction tasks (i.e., classification, regression)

Principal Component Analysis (PCA)

  • fundamental dimension reduction technique
  • Steps
    1. decorrelation
    2. dimension reduction
  • decorrelation
    • rotates data samples to be aligned w/ axes

    • shifts the samples so that they have mean of zero

    • no information is lost

PCA in coding

  • PCA is a scikit-learn component
  • fit() learns the transformation from given data
    • how to shift & how to rotate the samples
    • does not actually change them
  • transform() applies the transformation that fit learned
    • can be also applied to new data
    • returns a new array of transformed samples
      • same number of rows & columns
      • columns: PCA features
from sklearn.decomposition import PCA
model = PCA()
model.fit(samples)
transformed = model.transform(sample)
  • PCA features are not correlated (unlike features of original dataset)

Pearson correlation

  • measures linear correlation of features
  • value b/w -1 & 1
  • value of 0 = no linear correlation
from scipy.stats import pearsonr
correlation, pvalue = pearsonr(width, length)

Principal components

  • principal components = directions of variance
    • directions in which samples vary the most
  • PCA aligns principal components w/ the axes
  • available as .components_ attribute of PCA object
    • numpy array w/ 1 row for each principal component
    • each row defines displacement from mean

attributes of PCA

  • .components_ : principal components
  • .mean_ : coordinates of the mean of data

Intrinsic dimension

Intrinsic dimension: number of features needed to approximate the dataset

  • informs dimension reduction b/c it tells how much a dataset can be compressed
    • the most compact representation
  • can be detected w/ PCA

PCA identifies intrinsic dimensions

  • intrinsic dimension = number of PCA features w/ significant variance

Plotting variances of PCA features

  • samples: array of samples
import matplotlib as plt
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(samples)
features = range(pca.n_components_) #enumerates PCA features

#make a bar plot of variances
plt.bar(features, pca.explained_variance_)
plt.xticks(features)
plt.ylabel('variance')
plt.xlabel('PCA feature')
plt.show()

How to draw arrow

# Plot first_pc as an arrow, starting at mean
plt.arrow(mean[0], mean[1], first_pc[0], first_pc[1], color='red', width=0.01)
  • .arrow()
    • 1st argument: x coordinate of starting point
    • 2nd argument: y coordinate of starting point
    • 3rd argument: x coordinate of ending point
    • 4th argument: y coordinate of ending point

Dimension reduction w/ PCA

  • dimension reduction: represents same data using less features

Dimension reduction w/ PCA

  • PCA performs dimension reduction by discarding PCA features w/ lower variance, which it assumes to be noise & retaining higher variance PCA features, which it assumes to be informative
  • specify how many features to keep i.e., PCA(n_components=2)
    • intrinsic dimension is a good choice

Code

  • samples: array of measurements (4 features)
    • aim to decrease to 2 features
  • species: list of species numbers
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(samples)
transformed = pca.transform(samples)
print(transformed.shape)
  • results
    • PCA reduced dimension to 2
    • retained the 2 PCA features w/ highest variance
    • important information preserved: species remain distinct

Word Frequency arrays

  • rows represent documents & columns represent words
  • entries measure presence of each word in each document
  • most entries of the word frequency array is zero
    • = sparse
  • use scipy.sparse.csr_matrix
    • csr_matrix remembers only the non-zero entries
from sklearn.decomposition import TruncatedSVD
model = TruncatedSVD(n_components=3)
model.fit(documents) #documents is csr_matrix
transformed = model.transform(documents)

How to create a tf-idf word frequency array

  • TfidfVectorizer from sklearn
  • transforms a list of documents into a word frequency array in the form of csr_matrix
  • has fit() & transform() methods
  • tf: frequency of word in document
  • idf: reduces influence of frequent words, i.e. the
from sklearn.feature_extractio.text import TfidfVectorizer
tfidf = TfidfVectorizer()
csr_mat = tfidf.fit_transform(documents) #word-frequency array in csr_matrix format
csr_mat.toarray()
words = tfidf.get_feature_names() #columns of the array correspond to words

4. Discovering interpretable features


Non-negative matrix factorization (NMF)

  • dimension reduction technique
  • NMF models are interpretable (unlike PCA)
  • all sample features must be non-negative for NMF to be applied

Interpretable parts

  • NMF achieves its interpretability by decomposing samples as sums of their parts
  • NMF expresses documents as combinations of topics (or themes)
    • expresses images as combinations of patterns

Using scikit-learn NMF

  • unlike PCA, desired number of components must always be specified
  • works with NumPy arrays & csr_matrix
from sklearn.decomposition import NMF
model = NMF(n_components=2)
model.fit(samples)
nmf_features = model.transform(samples)
model.components_ #dimension of components = dimension of samples

NMF features

  • non-negative
  • can be used to reconstruct samples when combined with components

Reconstruction of sample

  • multiply components by feature values & add up
    • [2, 1] * [[1 0.5 0], [0.2 0.1 2.1]] → [2.2, 1.1, 2.1]
  • can be expressed as a production of matrices

NMF fits to non-negative data only

NMF learns interpretable parts

from sklearn.decomposition import NMF
nmf = NMF(n_components=10)
nmf.fit(articles)

NMF components

  • for documents:
    • NMF components represent topics
    • NMF features combine topics into documents
  • for images, NMF components are parts of images

Grayscale images: no colors, only different shades of grey

  • since there are only shades of grey, a grayscale image can be encoded by the brightness of every pixel
  • represent brightness w/ value b/w 0 & 1
    • 0 is black
  • convert to 2D array of numbers

Grayscale images as flat arrays

  • enumerate the entries
    • read-off the values row-by-row
    • from left to right, top to bottom
  • a flat array of non-negative numbers

A collection of grayscale images of the same size can be encoded as a 2D array

  • each row represents an image as a flattened array
  • each column represents a pixel
  • NMF can be used

To recover the image

  • reshape() method
    • specify the dimensions of the original image as a tuple
    • returns 2D array of pixel brightnesses
  • use pyplot to show the image
bitmap = sample.reshape((2, 3))
from matplotlib import pyplot as plt
plt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.show()
def show_as_image(sample):
		bitmap = sample.reshape((13,8))
		plt.figure()
		plt.imshow(bitmap, cmap='gray', interpolation='nearest')
		plt.colorbar()
		plt.show()

Building recommender systems using NMF

task: recommend articles similar to article being read by customers

Strategy

  • apply NMF to the word-frequency array of articles & use the resulting NMF features
  • NMF feature values describe the topics
    • so similar documents have similar NMF feature values

Apply NMF to the word-frequency array

  • articles: word frequency array
from sklearn.decomposition import NMF
nmf = NMF(n_components=6)
nmf_features = nmf.fit_transform(articles)

Compare articles by NMF features

  • different versions of the same document have same topic proportions but exact feature values may be different
    • i.e., because one version uses many meaningless words → reduces values of NMF features representing the topics
  • however, on a scatter plot of the NMF features, all these versions (weak & strong) lie on a single line passing through the origin

Cosine similarity: angle between the lines

  • higher values: greater similarity
from sklearn.preprocessing import normalize
norm_features = normalize(nmf_features)
#if has index 23
current_article = norm_features[23,:]
similarities = norm_features.dot(current_article)
similarities

Dataframes and Labels

  • label similarities with article titles
import pandas as pd
norm_features = normalize(nmf_features)
df = pd.DataFrame(norm_features, index=titles)
current_article = df.loc['Dog bites man']
similarities = df.dot(current_article) #calculate cosine similarities
similarities.nlargest() #5 top 

MaxAbsScaler

  • transforms the data so that all users have the same influence on the model regardless of how many different artists they’ve listened to
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler
profile
우당탕탕

0개의 댓글