Multinomial Mixture Models are probabilistic models used for clustering categorical data. These models assume that the data are generated from a mixture of several multinomial distributions, each representing a cluster. The model is particularly useful in applications where objects are represented by counts or frequencies of events, such as text document clustering, where documents are represented by word counts.
In contrast to Gaussian Mixture Models (GMMs) that are suited for continuous data, Multinomial Mixture Models are designed for discrete data. The key assumption is that the observed data are generated from a finite mixture of multinomial distributions, with each distribution corresponding to a different underlying process or cluster.
Given a dataset of items, where each item is a -dimensional vector of counts, the probability of observing a particular item under a Multinomial Mixture Model is given by:
where:
Choose the number of clusters . Initialize the mixing coefficients and the probability vectors for each cluster.
Calculate the posterior probabilities (responsibilities) that each item belongs to each cluster, given the current parameters. For item and cluster , this is:
Update the parameters and to maximize the expected log-likelihood of the observed data, given the current responsibilities:
Repeat the E and M steps until the change in the log-likelihood or the parameters between iterations falls below a predefined threshold.
n_clusters
: int
max_iter
: int
, default = 100tol
: float
, default = 0.00001Test on synthesized 2D multinomial dataset with 2 mixtures:
from luma.clustering.mixture import MultinomialMixture
from luma.visual.evaluation import ConfusionMatrix
import matplotlib.pyplot as plt
import numpy as np
def generate_multi_dataset(n_samples: int,
component_probs: list,
mixture_weights: list) -> tuple:
n_components = len(component_probs)
dataset, labels = [], []
for _ in range(n_samples):
component = np.random.choice(range(n_components), p=mixture_weights)
sample = np.random.multinomial(1, component_probs[component])
dataset.append(sample)
labels.append(component)
return np.array(dataset), np.array(labels)
X, y = generate_multi_dataset(n_samples=300,
component_probs=[[0.2, 0.8],
[0.7, 0.3]],
mixture_weights=[0.5, 0.5])
mmm = MultinomialMixture(n_clusters=2, max_iter=1000)
mmm.fit(X)
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
n_clusters = 2
bincounts = [np.bincount(X[y == i, 0]) for i in range(2)]
width = 0.2
ax1.bar(np.arange(n_clusters) - width / 2, bincounts[0],
width=width,
label='Cluster 0')
ax1.bar(np.arange(n_clusters) + width / 2, bincounts[1],
width=width,
label='Cluster 1')
ax1.set_xticks([0, 1])
ax1.set_ylabel('Conut')
ax1.set_title('Frequency Counts')
ax1.legend()
conf = ConfusionMatrix(y_true=y, y_pred=mmm.labels)
conf.plot(ax=ax2, show=True)
- McLachlan, Geoffrey, and Thriyambakam Krishnan. The EM algorithm and extensions. John Wiley & Sons, 2007.
- Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Journal of machine Learning research 3.Jan (2003): 993-1022.