Bernoulli Naive Bayes is a subcategory of the Naive Bayes Algorithm. It is typically used when the data is binary and it models the occurrence of features using Bernoulli distribution. It is used for the classification of binary features such as 'Yes' or 'No', '1' or '0', 'True' or 'False' etc. Here it is to be noted that the features are independent of one another. In this article we will be discussing more about it.
Mathematics Behind Bernoulli Naive Bayes
The core of Bernoulli Naive Bayes is based on Bayes' Theorem which helps in calculating the conditional probability of a given class y given some data x = (x_1, x_2, ..., x_n). Now in the Bernoulli Naive Bayes model we assume that each feature is conditionally independent given the class y. This means that we can calculate the likelihood of each feature occurring as:
p(x_i|y)=p(i|y)x_i+(1-p(i|y))(1-x_i)
- Here, p(x_i |y) is the conditional probability of xi occurring provided y has occurred.
- i is the event
- x_i holds binary value either 0 or 1
Now we will learn Bernoulli distribution as Bernoulli Naive Bayes works on that.
Bernoulli distribution
Bernoulli distribution is used for discrete probability calculation. It either calculates success or failure. Here the random variable is either 1 or 0 whose chance of occurring is either denoted by p or (1-p) respectively.
The mathematical formula is given
f(x)=\begin{cases} p^x*(1-p)^{1-x} & \text{if x=0,1} \\ 0 \; otherwise\\ \end{cases}
Now in the above function if we put x=1 then the value of f(x) is p and if we put x=0 then the value of f(x) is 1-p. Here p denotes the success of an event.
Example:
To understand how Bernoulli Naive Bayes works, here's a simple binary classification problem.
Message ID | Message Text | Class |
---|
M1 | "buy cheap now" | Spam |
M2 | "limited offer buy" | Spam |
M3 | "meet me now" | Not Spam |
M4 | "let's catch up" | Not Spam |
1. Vocabulary
Extract all unique words from the training data:
\text{Vocabulary} = \{\text{buy, cheap, now, limited, offer, meet, me, let's, catch, up}\}
Vocabulary size V = 10
2. Binary Feature Matrix (Presence = 1, Absence = 0)
Each message is represented using binary features indicating the presence (1) or absence (0) of a word.
ID | buy | cheap | now | limited | offer | meet | me | let's | catch | up | Class |
---|
M1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Spam |
---|
M2 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | Spam |
---|
M3 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | Not Spam |
---|
M4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | Not Spam |
---|
3. Apply Laplace Smoothing
P(w_i = 1 \mid C) = \frac{\text{count}(w_i, C) + 1}{N_C + 2}
where N_C = 2 for both classes (2 documents per class), so the denominator becomes 4.
4. Word Probabilities
For Spam class:
- P(\text{buy} \mid \text{Spam}) = \frac{2+1}{4} = 0.75
- P(\text{cheap} \mid \text{Spam}) = \frac{1+1}{4} = 0.5
- P(\text{now} \mid \text{Spam}) = \frac{1+1}{4} = 0.5
- P(\text{limited} \mid \text{Spam}) = \frac{1+1}{4} = 0.5
- P(\text{offer} \mid \text{Spam}) = \frac{1+1}{4} = 0.5
- P(\text{others} \mid \text{Spam}) = \frac{0+1}{4} = 0.25
For Not Spam class:
- P(\text{now} \mid \text{Not Spam}) = \frac{1+1}{4} = 0.5
- P(\text{meet} \mid \text{Not Spam}) = \frac{1+1}{4} = 0.5
- P(\text{me} \mid \text{Not Spam}) = \frac{1+1}{4} = 0.5
- P(\text{let's} \mid \text{Not Spam}) = \frac{1+1}{4} = 0.5
- P(\text{catch} \mid \text{Not Spam}) = \frac{1+1}{4} = 0.5
- P(\text{up} \mid \text{Not Spam}) = \frac{1+1}{4} = 0.5
- P(\text{others} \mid \text{Not Spam}) = \frac{0+1}{4} = 0.25
5. Classify Message "buy now"
The message contains words "buy" and "now, so the feature vector is:
\text{buy}=1, \quad \text{now}=1, \quad \text{others}=0
P(\text{Spam} \mid d) \propto P(\text{Spam}) \cdot P(\text{buy}=1 \mid \text{Spam}) \cdot P(\text{now}=1 \mid \text{Spam}) = 0.5 \cdot 0.75 \cdot 0.5 = 0.1875
P(\text{Not Spam} \mid d) \propto P(\text{Not Spam}) \cdot P(\text{buy}=1 \mid \text{Not Spam}) \cdot P(\text{now}=1 \mid \text{Not Spam}) = 0.5 \cdot 0.25 \cdot 0.5 = 0.0625
5. Final Classification
P(\text{Spam} \mid d) = 0.1875,\quad P(\text{Not Spam} \mid d) = 0.0625
Since P(\text{Spam} \mid d) > P(\text{Not Spam} \mid d), the message is classified as: \boxed{\text{Spam}}
Implementing Bernoulli Naive Bayes
For performing classification using Bernoulli Naive Bayes we have considered an email dataset.
The email dataset comprises of four columns named Unnamed: 0, label, label_num and text. The category of label is either ham or spam. For ham the number assigned is 0 and for spam 1 is assigned. Text comprises the body of the mail. The length of the dataset is 5171.
The dataset can be downloaded from here.
1. Importing Libraries
In the code we have imported necessary libraries like pandas, numpy and sklearn. Bernoulli Naive Bayes is a part of sklearn package.
Python import numpy as np import pandas as pd from sklearn.naive_bayes import BernoulliNB from sklearn.feature_extraction.text import CountVectorizer
2. Data Analysis
In this code we have performed a quick data analysis that includes reading the data, dropping unnecessary columns, printing shape of data, information about dataset etc.
Python df=pd.read_csv("spam_ham_dataset.csv") print(df.shape) print(df.columns) df= df.drop(['Unnamed: 0'], axis=1)
Output:
(5171, 4) Index(['Unnamed: 0', 'label', 'text', 'label_num'], dtype='object')
3. Count Vectorizer
In the code since text data is used to train our classifier we convert the text into a matrix comprising numbers using Count Vectorizer so that the model can perform well.
Python x = df["text"].values y = df["label_num"].values cv = CountVectorizer() x = cv.fit_transform(x)
4. Data Splitting, Model Training and Prediction
Python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)\ bnb = BernoulliNB(binarize=0.0) model = bnb.fit(X_train, y_train) y_pred = bnb.predict(X_test) from sklearn.metrics import classification_report print(classification_report(y_test, y_pred))
Output:
classification_reportThe classification report shows that for class 0 (not spam) precision, recall and F1 score are 0.84, 0.98 and 0.91 respectively. For class 1 (spam) they are 0.92, 0.56 and 0.70. The recall for class 1 drops due to the 13% spam data. The overall accuracy of the model is 86%, which is good.
Bernoulli Naive Bayes is used for spam detection, text classification, Sentiment Analysis and used to determine whether a certain word is present in a document or not.
Difference Between Different Naive Bayes Model
Aspect | Gaussian Naive Bayes | Multinomial Naive Bayes | Bernoulli Naive Bayes |
---|
Feature Type | Continuous (real-valued features) | Discrete (count data or frequency-based features) | Binary (presence or absence of features) |
---|
Assumption | Assumes data follows a Gaussian (normal) distribution | Assumes data follows a multinomial distribution | Assumes data follows a Bernoulli (binary) distribution |
---|
Common Use Case | Suitable for continuous features like height, weight, etc. | Suitable for text classification (word counts) | Suitable for binary classification tasks (e.g., spam detection) |
---|
Data Representation | Features are treated as continuous variables | Features are treated as discrete counts or frequencies | Features are treated as binary (0 or 1) values |
---|
Mathematical Model | Uses Gaussian distribution (mean and variance) for each feature | Uses the multinomial distribution for word counts in text classification | Uses Bernoulli distribution (probability of a feature being present) |
---|
Example | Predicting whether an email is spam based on numeric features | Predicting whether a document is spam based on word counts | Classifying a document as spam or not based on word presence |
---|
Here is the quick comparison between types of Naive Bayes that are Gaussian Naive Bayes, Multinomial Naive Bayes and Bernoulli Naive Bayes
Bernoulli Naive Bayes is a simple yet effective for binary classification tasks. Its efficiency in handling binary data makes it suitable for applications like spam detection, sentiment analysis and many more. Its simplicity and speed makes it suitable for real-time classification problems.