Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Continuous bag of words (CBOW) in NLP
Next article icon

Continuous bag of words (CBOW) in NLP

Last Updated : 03 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

In order to make the computer understand a written text, we can represent the words as numerical vectors. One way to do so is by Using Word embeddings, they are a way of representing words as numerical vectors. These vectors capture the meaning of the words and their relationships to other words in the language. Word embeddings can be generated using unsupervised learning algorithms such as Word2vec, GloVe, or FastText. 

Word2vec is a neural network-based method for generating word embeddings, which are dense vector representations of words that capture their semantic meaning and relationships. There are two main approaches to implementing Word2vec:

  • Continuous bag-of-words (CBOW) 
  • Skip-gram

What is a Continuous Bag of Words (CBOW)?

Continuous Bag of Words (CBOW) is a popular natural language processing technique used to generate word embeddings. Word embeddings are important for many NLP tasks because they capture semantic and syntactic relationships between words in a language. CBOW is a neural network-based algorithm that predicts a target word given its surrounding context words. It is a type of "unsupervised" learning, meaning that it can learn from unlabeled data, and it is often used to pre-train word embeddings that can be used for various NLP tasks such as sentiment analysis, text classification, and machine translation. 

Example of a CBOW Model
Example of a CBOW Model

Is there any difference between  Bag-of-Words (BoW) model and the Continuous Bag-of-Words (CBOW)?

  • The Bag-of-Words model and the Continuous Bag-of-Words model are both techniques used in natural language processing to represent text in a computer-readable format, but they differ in how they capture context.
  • The BoW model represents text as a collection of words and their frequency in a given document or corpus. It does not consider the order or context in which the words appear, and therefore, it may not capture the full meaning of the text. The BoW model is simple and easy to implement, but it has limitations in capturing the meaning of language.
  • In contrast, the CBOW model is a neural network-based approach that captures the context of words. It learns to predict the target word based on the words that appear before and after it in a given context window. By considering the surrounding words, the CBOW model can better capture the meaning of a word in a given context.

Architecture of the CBOW model

The CBOW model uses the context words around the target word in order to predict it. Consider the above example "She is a great dancer." The CBOW model converts this phrase into pairs of context words and target words. The word pairings would appear like this ([she, a], is), ([is, great], a) ([a, dancer], great) having window size=2. 


 

CBOW Architecture
CBOW Architecture

The model considers the context words and tries to predict the target term. The four 1∗W input vectors will be passed to the input layer if have four words as context words are used to predict one target word. The hidden layer will receive the input vectors and then multiply them by a W∗N matrix. The 1∗N output from the hidden layer finally enters the sum layer, where the vectors are element-wise summed before a final activation is carried out and the output is obtained from the output layer.

Code Implementation of CBOW

Let's implement a word embedding to show the similarity of words using the CBOW model. In this article I have defined my own corpus of words, you use any dataset. First, we will import all the necessary libraries and load the dataset. Next, we will tokenize each word and convert it into a vector of integers.

Python
# Re-import necessary modules import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Embedding, Lambda from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.utils import to_categorical import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA  # Define the corpus corpus = [     'The cat sat on the mat',     'The dog ran in the park',     'The bird sang in the tree' ]  # Convert the corpus to a sequence of integers tokenizer = Tokenizer() tokenizer.fit_on_texts(corpus) sequences = tokenizer.texts_to_sequences(corpus) print("After converting our words in the corpus into vector of integers:") print(sequences) 

Output:

After converting our words in the corpus into vector of integers:
[[1, 3, 4, 5, 1, 6], [1, 7, 8, 2, 1, 9], [1, 10, 11, 2, 1, 12]]

Now, we will build the CBOW model having window size = 2.

Python
# Define the parameters vocab_size = len(tokenizer.word_index) + 1 embedding_size = 10 window_size = 2  # Generate the context-target pairs contexts = [] targets = [] for sequence in sequences:     for i in range(window_size, len(sequence) - window_size):         context = sequence[i - window_size:i] + sequence[i + 1:i + window_size + 1]         target = sequence[i]         contexts.append(context)         targets.append(target)  # Convert the contexts and targets to numpy arrays X = np.array(contexts) y = to_categorical(targets, num_classes=vocab_size)  # Define the CBOW model model = Sequential() model.add(Embedding(input_dim=vocab_size, output_dim=embedding_size, input_length=2 * window_size)) model.add(Lambda(lambda x: tf.reduce_mean(x, axis=1))) model.add(Dense(units=vocab_size, activation='softmax'))  # Compile the model model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])  # Train the model model.fit(X, y, epochs=100, verbose=0) 

Next, we will use the model to visualize the embeddings.

Python
# Extract the embeddings embedding_layer = model.layers[0] embeddings = embedding_layer.get_weights()[0]  # Perform PCA to reduce the dimensionality of the embeddings pca = PCA(n_components=2) reduced_embeddings = pca.fit_transform(embeddings)  # Visualize the embeddings plt.figure(figsize=(5, 5)) for word, idx in tokenizer.word_index.items():     x, y = reduced_embeddings[idx]     plt.scatter(x, y)     plt.annotate(word, xy=(x, y), xytext=(5, 2),                  textcoords='offset points', ha='right', va='bottom') plt.title("Word Embeddings Visualized") plt.show() 

Output:

Word-Embeddings-Visualized
Graphical Representation of the Word Vectors Created via COBW


This visualization allows us to observe the similarity of the words based on their embeddings. Words that are similar in meaning or context are expected to be close to each other in the plot.


Next Article
Continuous bag of words (CBOW) in NLP

R

ram9119
Improve
Article Tags :
  • NLP
  • AI-ML-DS
  • AI-ML-DS With Python

Similar Reads

    Bag of words (BoW) model in NLP
    In this article, we are going to discuss a Natural Language Processing technique of text modeling known as Bag of Words model. Whenever we apply any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm. Hence, Bag of Words model is used to preprocess the text b
    4 min read
    Bag-Of-Words Model In R
    Effectively representing textual data is crucial for training models in Machine Learning. The Bag-of-Words (BOW) model serves this purpose by transforming text into numerical form. This article comprehensively explores the Bag-of-Words model, elucidating its fundamental concepts and utility in text
    6 min read
    NLP | Word Collocations
    Collocations are two or more words that tend to appear frequently together, for example - United States. There are many other words that can come after United, such as the United Kingdom and United Airlines. As with many aspects of natural language processing, context is very important. And for coll
    3 min read
    Word Embeddings in NLP
    Word Embeddings are numeric representations of words in a lower-dimensional space, that capture semantic and syntactic information. They play a important role in Natural Language Processing (NLP) tasks. Here, we'll discuss some traditional and neural approaches used to implement Word Embeddings, suc
    14 min read
    Bag-of-Words Representations in TensorFlow
    Bag-of-Words (BoW) converts text into numerical vectors based on word occurrences, ignoring grammar and word order. The model represents text as a collection (bag) of words, where each word's frequency or presence is recorded. It follows these steps:Tokenization – Splitting text into words.Vocabular
    2 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences