[Clustering] Multinomial Mixture Model (MMM)

안암동컴맹·2024년 3월 3일
0

Machine Learning

목록 보기
31/103

Multinomial Mixture Model (MMM)

Introduction

Multinomial Mixture Models are probabilistic models used for clustering categorical data. These models assume that the data are generated from a mixture of several multinomial distributions, each representing a cluster. The model is particularly useful in applications where objects are represented by counts or frequencies of events, such as text document clustering, where documents are represented by word counts.

Background and Theory

In contrast to Gaussian Mixture Models (GMMs) that are suited for continuous data, Multinomial Mixture Models are designed for discrete data. The key assumption is that the observed data are generated from a finite mixture of multinomial distributions, with each distribution corresponding to a different underlying process or cluster.

Mathematical Formulation

Given a dataset of NN items, where each item is a DD-dimensional vector of counts, the probability of observing a particular item xx under a Multinomial Mixture Model is given by:

P(xθ)=k=1KπkMult(xpk)P(x|\theta) = \sum_{k=1}^{K} \pi_k \cdot \text{Mult}(x|p_k)

where:

  • KK is the number of clusters (multinomial distributions) in the mixture.
  • πk\pi_k is the mixing coefficient for cluster kk, indicating the probability that an item belongs to cluster kk, with k=1Kπk=1\sum_{k=1}^{K} \pi_k = 1.
  • Mult(xpk)\text{Mult}(x|p_k) is the probability of item xx under the multinomial distribution for cluster kk, characterized by the parameter vector pk=(pk1,pk2,,pkD)p_k = (p_{k1}, p_{k2}, \cdots, p_{kD}), where pkjp_{kj} is the probability of the jj-th category in the kk-th cluster, and j=1Dpkj=1\sum_{j=1}^{D} p_{kj} = 1.

Procedural Steps

Initialization

Choose the number of clusters KK. Initialize the mixing coefficients πk\pi_k and the probability vectors pkp_k for each cluster.

Expectation (E) Step

Calculate the posterior probabilities (responsibilities) that each item belongs to each cluster, given the current parameters. For item ii and cluster kk, this is:

γik=πkMult(xipk)l=1KπlMult(xipl)\gamma_{ik} = \frac{\pi_k \cdot \text{Mult}(x_i|p_k)}{\sum_{l=1}^{K} \pi_l \cdot \text{Mult}(x_i|p_l)}

Maximization (M) Step

Update the parameters πk\pi_k and pkp_k to maximize the expected log-likelihood of the observed data, given the current responsibilities:

  • Update πk\pi_k:
    πk=1Ni=1Nγik\pi_k = \frac{1}{N} \sum_{i=1}^{N} \gamma_{ik}
  • Update pkp_k: For each category jj in cluster kk, update pkjp_{kj}:
    pkj=i=1Nγikxijj=1Di=1Nγikxijp_{kj} = \frac{\sum_{i=1}^{N} \gamma_{ik} x_{ij}}{\sum_{j=1}^{D} \sum_{i=1}^{N} \gamma_{ik} x_{ij}}

Iteration and Convergence

Repeat the E and M steps until the change in the log-likelihood or the parameters between iterations falls below a predefined threshold.

Implementation

Parameters

  • n_clusters: int
    Number of clusters(mixtures) to estimate
  • max_iter: int, default = 100
    Maximum amount of iteration
  • tol: float, default = 0.00001
    Tolerance for early convergence

Examples

Test on synthesized 2D multinomial dataset with 2 mixtures:

from luma.clustering.mixture import MultinomialMixture
from luma.visual.evaluation import ConfusionMatrix

import matplotlib.pyplot as plt
import numpy as np


def generate_multi_dataset(n_samples: int, 
                           component_probs: list, 
                           mixture_weights: list) -> tuple:
    n_components = len(component_probs)
    dataset, labels = [], []
    for _ in range(n_samples):
        component = np.random.choice(range(n_components), p=mixture_weights)
        sample = np.random.multinomial(1, component_probs[component])
        
        dataset.append(sample)
        labels.append(component)
        
    return np.array(dataset), np.array(labels)


X, y = generate_multi_dataset(n_samples=300,
                              component_probs=[[0.2, 0.8],
                                               [0.7, 0.3]],
                              mixture_weights=[0.5, 0.5])

mmm = MultinomialMixture(n_clusters=2, max_iter=1000)
mmm.fit(X)

fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)

n_clusters = 2
bincounts = [np.bincount(X[y == i, 0]) for i in range(2)]

width = 0.2
ax1.bar(np.arange(n_clusters) - width / 2, bincounts[0], 
        width=width,
        label='Cluster 0')

ax1.bar(np.arange(n_clusters) + width / 2, bincounts[1], 
        width=width,
        label='Cluster 1')

ax1.set_xticks([0, 1])
ax1.set_ylabel('Conut')
ax1.set_title('Frequency Counts')
ax1.legend()

conf = ConfusionMatrix(y_true=y, y_pred=mmm.labels)
conf.plot(ax=ax2, show=True)

Applications

  • Text Clustering: Grouping documents into topics based on word frequencies.
  • Market Basket Analysis: Clustering customer purchase patterns to identify common baskets of products.
  • Genetic Sequence Analysis: Clustering sequences into groups based on the frequency of nucleotides or amino acids.

Strengths and Limitations

Strengths

  • Suitable for Categorical Data: Specifically designed to handle discrete count data efficiently.
  • Interpretable Clusters: The parameters of the multinomial distributions can provide insights into the characteristics of each cluster.

Limitations

  • Assumption of Independence: Assumes that the features (e.g., words in text clustering) are conditionally independent given the cluster, which may not always be true.
  • Selection of Number of Clusters: Like other mixture models, determining the optimal number of clusters KK requires additional criteria or validation methods.

Advanced Topics

  • Model Selection: Techniques like Bayesian Information Criterion (BIC) or Cross-Validation can be used to select the number of clusters.
  • Extension to Other Distributions: For different types of data, the multinomial distribution can be replaced with other appropriate distributions within the mixture model framework.

References

  1. McLachlan, Geoffrey, and Thriyambakam Krishnan. The EM algorithm and extensions. John Wiley & Sons, 2007.
  2. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Journal of machine Learning research 3.Jan (2003): 993-1022.
profile
𝖪𝗈𝗋𝖾𝖺 𝖴𝗇𝗂𝗏. 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝗋 𝖲𝖼𝗂𝖾𝗇𝖼𝖾 & 𝖤𝗇𝗀𝗂𝗇𝖾𝖾𝗋𝗂𝗇𝗀

0개의 댓글