Bernoulli Naive Bayes is a variant of the Naive Bayes algorithm, designed specifically for binary/boolean features. It operates on the principle that features are independent given the class, but unlike its counterparts, it assumes that all features are binary-valued (Bernoulli distributed). This model is particularly effective in text classification problems where the presence or absence of a feature (e.g., a word) is more significant than its frequency.
Naive Bayes classifiers apply Bayes' theorem with the assumption of independence among predictors. Bernoulli Naive Bayes is tailored for dichotomous variables and models the presence or absence of a characteristic with a Bernoulli distribution. This approach is suited for datasets where features can be encoded as binary variables, representing the presence or absence of a feature.
The foundation of Naive Bayes classification is Bayes' Theorem, which describes the probability of an event, based on prior knowledge of conditions that might be related to the event. It is mathematically expressed as:
where:
The Naive Bayes classifier simplifies to computing the posterior probability of each class given a set of predictors and assuming that the predictors are independent of each other given the class. For a set of predictors and a class , the posterior probability is:
The Bernoulli distribution is a discrete distribution having two possible outcomes: 1 (success) with probability , and 0 (failure) with probability . For a random variable following a Bernoulli distribution, the probability mass function (PMF) is given by:
where .
In Bernoulli Naive Bayes, we are interested in estimating the probability that a feature is present (i.e., equals 1) in a given class . The likelihood of observing a dataset given the parameters can be formulated as the product of the probabilities of observing each individual data point. The goal of MLE in this context is to find the probability that maximizes this likelihood.
Given a dataset, the MLE for the probability is calculated as the ratio of:
Mathematically, this is expressed as:
where:
To deal with the issue of zero probabilities (for instance, when a feature does not appear in any sample of a class in the training set), Laplace smoothing (or add-one smoothing) is often applied:
where is the smoothing parameter, typically set to 1. This adjustment ensures that each class-feature combination has a non-zero probability.
For a binary feature and class , the probability of observing given is modeled as:
where is the probability of feature being present (1) in class , estimated from the training data.
The posterior probability for class given an observation is calculated using Bayes' theorem:
Classification is performed by selecting the class that maximizes .
No parameters.
Test with synthesized binary classification dataset with 3 classes:
from luma.classifier.naive_bayes import BernoulliNaiveBayes
from luma.model_selection.split import TrainTestSplit
from luma.visual.evaluation import ConfusionMatrix, ROCCurve
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
X, y = make_classification(n_samples=500,
n_informative=10,
n_redundant=10,
n_clusters_per_class=1,
random_state=42,
n_classes=3)
X_binary = (X > 0).astype(int)
X_train, X_test, y_train, y_test = TrainTestSplit(X_binary, y,
test_size=0.2,
random_state=42).get
bnb = BernoulliNaiveBayes()
bnb.fit(X_train, y_train)
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
X_concat = np.concatenate((X_train, X_test))
y_concat = np.concatenate((y_train, y_test))
conf = ConfusionMatrix(y_concat, bnb.predict(X_concat))
conf.plot(ax=ax1)
roc = ROCCurve(y_concat, bnb.predict_proba(X_concat))
roc.plot(ax=ax2, show=True)
Bernoulli Naive Bayes is especially useful in:
Further exploration into Bernoulli Naive Bayes can include:
- McCallum, Andrew, and Kamal Nigam. "A comparison of event models for Naive Bayes text classification." AAAI-98 workshop on learning for text categorization. Vol. 752. No. 1. 1998.
- Rish, Irina. "An empirical study of the naive Bayes classifier." IJCAI 2001 workshop on empirical methods in artificial intelligence. Vol. 3. No. 22. 2001.
- Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. "Introduction to Information Retrieval." Cambridge University Press, 2008.