Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • NLP
  • Data Analysis Tutorial
  • Python - Data visualization tutorial
  • NumPy
  • Pandas
  • OpenCV
  • R
  • Machine Learning Tutorial
  • Machine Learning Projects
  • Machine Learning Interview Questions
  • Machine Learning Mathematics
  • Deep Learning Tutorial
  • Deep Learning Project
  • Deep Learning Interview Questions
  • Computer Vision Tutorial
  • Computer Vision Projects
  • NLP
  • NLP Project
  • NLP Interview Questions
  • Statistics with Python
  • 100 Days of Machine Learning
Open In App
Next Article:
Classification of Text Documents using Naive Bayes
Next article icon

Classification of Text Documents using Naive Bayes

Last Updated : 16 May, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

In natural language processing and machine learning Naive Bayes is a popular method for classifying text documents. It can be used to classifies documents into pre defined types based on likelihood of a word occurring by using Bayes theorem. In this article we will implement Text Classification using Naive Bayes in Python.

Text-Classification-using-naive-bayes
Text Classification using Naive Bayes

The dataset we will be using will be of text data categorized into four labels: Technology, Sports, Politics and Entertainment. Each entry contains a short sentence or statement related to a specific topic with the label indicating the category it belongs to.

1. Importing Libraries

We will need to import the necessary libraries like scikit-learn, Pandas and Numpy.

  • CountVectorizer to convert text data into numerical features using word counts.
  • MultinomialNB: The Naive Bayes classifier for multinomial data and is ideal for text classification.
Python
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, confusion_matrix from sklearn.feature_extraction.text import CountVectorizer 

2. Loading the Dataset

You can download the dataset from here.

Python
data = pd.read_csv('synthetic_text_data.csv') X = data['text'] y = data['label'] 

3. Splitting the Data

Now we split the dataset into training and testing sets. The training set is used to train the model while the testing set is used to evaluate its performance.

  • train_test_split: Splits the data into training (80%) and testing (20%) sets.
  • random_state: ensures reproducibility.
Python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

4. Text Preprocessing: Converting Text to Numeric Features

We need to convert the text data into numerical format before feeding it to the model. We use CountVectorizer to convert the text into a matrix of token counts.

  • CountVectorizer(): Converts the raw text into a matrix of word counts.
  • fit_transform(): Learns the vocabulary from the training data and transforms the text into vector.
  • transform(): Applies the learned vocabulary from the training data to the test data.
Python
vectorizer = CountVectorizer() X_train_vectorized = vectorizer.fit_transform(X_train) X_test_vectorized = vectorizer.transform(X_test) 

5. Training the Naive Bayes Classifier

With the data now in the right format we train the Naive Bayes classifier on the training data. Here we use Multinomial Naive Bayes.

Multinomial Naive Bayes is a variant of the Naive Bayes classifier specifically suited for classification tasks where the features or input data are discrete such as word counts or frequencies in text classification.

Python
model = MultinomialNB() model.fit(X_train_vectorized, y_train) 

6. Making Predictions

Now that the model is trained we can use it to predict the labels for the test data using X_test_vectorized.

Python
y_pred = model.predict(X_test_vectorized) 

7. Evaluating the Model

After making predictions we need to evaluate the model's performance. We'll calculate the accuracy and confusion matrix to understand how well the model is performing.

  • accuracy_score(): Calculates the accuracy of the model by comparing the predicted labels (y_pred) with the true labels (y_test).
  • confusion_matrix(): Generates a confusion matrix to visualize how well the model classifies each category.
Python
accuracy = accuracy_score(y_test, y_pred) conf_matrix = confusion_matrix(y_test, y_pred)  print(f'Accuracy: {accuracy *100}%')  class_labels = np.unique(y_test)  plt.figure(figsize=(8, 6)) sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=class_labels, yticklabels=class_labels) plt.title('Confusion Matrix Heatmap') plt.xlabel('Predicted Label') plt.ylabel('True Label') plt.show() 

Output:

Accuracy: 88.23529411764706%

download-
Confusion Matrix

The accuracy of the model is approximately 88% meaning it correctly predicted the categories for about 88% of the test data.

Looking at the confusion matrix heatmap we can see the model made correct predictions for Sports (2), Technology (5), Politics (2) and Entertainment (6). Heatmap shows these values with darker colors representing correct predictions. However there were some misclassifications.

8. Prediction on Unseen Data

Python
user_input = ("I love artificial intelligence and machine learning")  user_input_vectorized = vectorizer.transform([user_input]) predicted_label = model.predict(user_input_vectorized) print(f"The input text belongs to the '{predicted_label[0]}' category.") 

Output:

The input text belongs to the 'Technology' category.

Here we can see our model is working fine and can predict on unseen data accurately. Naive Bayes is a strong baseline model for text classification tasks especially when the dataset is large and the features (words) are relatively independent.


Next Article
Classification of Text Documents using Naive Bayes

K

kothavvsaakash
Improve
Article Tags :
  • Machine Learning
  • AI-ML-DS
  • data mining
  • Natural-language-processing
Practice Tags :
  • Machine Learning

Similar Reads

    Naive Bayes vs. SVM for Text Classification
    Text classification is a fundamental task in natural language processing (NLP), with applications ranging from spam detection to sentiment analysis and document categorization. Two popular machine learning algorithms for text classification are Naive Bayes classifier (NB) and Support Vector Machines
    9 min read
    Text classification using CNN
    Text classification is a widely used NLP task in different business problems, and using Convolution Neural Networks (CNNs) has become the most popular choice. In this article, you will learn about the basics of Convolutional neural networks and the implementation of text classification using CNNs, a
    5 min read
    Toxic Comment Classification using BERT
    Social media users frequently encounter abuse, harassment, and insults from other users on a majority of online communication platforms like Facebook, Instagram and Youtube due to which many users stop expressing their ideas and opinions. What is the solution? The solution to this problem is to crea
    15+ min read
    Text Classification using Logistic Regression
    Text classification is a fundamental task in Natural Language Processing (NLP) that involves assigning predefined categories or labels to textual data. It has a wide range of applications, including spam detection, sentiment analysis, topic categorization, and language identification. Logistic Regre
    4 min read
    Text Classification using Decision Trees in Python
    Text classification is the process of classifying the text documents into predefined categories. In this article, we are going to explore how we can leverage decision trees to classify the textual data. Text Classification and Decision Trees Text classification involves assigning predefined categori
    5 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences