Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • NLP
  • Data Analysis Tutorial
  • Python - Data visualization tutorial
  • NumPy
  • Pandas
  • OpenCV
  • R
  • Machine Learning Tutorial
  • Machine Learning Projects
  • Machine Learning Interview Questions
  • Machine Learning Mathematics
  • Deep Learning Tutorial
  • Deep Learning Project
  • Deep Learning Interview Questions
  • Computer Vision Tutorial
  • Computer Vision Projects
  • NLP
  • NLP Project
  • NLP Interview Questions
  • Statistics with Python
  • 100 Days of Machine Learning
Open In App
Next Article:
Overview of Word Embedding using Embeddings from Language Models (ELMo)
Next article icon

Pre-trained Word embedding using Glove in NLP models

Last Updated : 03 Jan, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

In this article, we are going to see Pre-trained Word embedding using Glove in NLP models using Python.

What is GloVe?

Global Vectors for Word Representation, or GloVe for short, is an unsupervised learning algorithm that generates vector representations, or embeddings, of words. Researchers Richard Socher, Christopher D. Manning, and Jeffrey Pennington first presented it in 2014. By using the statistical co-occurrence data of words in a given corpus, GloVe is intended to capture the semantic relationships between words.

The fundamental concept underlying GloVe is the representation of words as vectors in a continuous vector space, where the angle and direction of the vectors correspond to the semantic connections between the appropriate words. To do this, GloVe builds a co-occurrence matrix using word pairs and then optimizes the word vectors to minimize the difference between the pointwise mutual information of the corresponding words and the dot product of vectors.

Word embedding

In NLP models, we deal with texts which are human-readable and understandable. But the machine doesn’t understand texts, it only understands numbers. Thus, word embedding is the technique to convert each word into an equivalent float vector. Various techniques exist depending on the use case of the model and dataset. Some of the techniques are One Hot Encoding, TF-IDF, Word2Vec, and FastText.

Example: 

'the': [-0.123, 0.353, 0.652, -0.232]
'the' is very often used word in texts of any kind.
its equivalent 4-dimension dense vector has been given.

Glove Data

The glove has pre-defined dense vectors for around every 6 billion words of English literature along with many other general-use characters like commas, braces, and semicolons. The algorithm’s developers frequently make the pre-trained GloVe embeddings available. It is not necessary to train the model from scratch when using these pre-trained embeddings, which can be downloaded and used immediately in a variety of natural language processing (NLP) applications. Users can select a pre-trained GloVe embedding in a dimension (e.g., 50-d, 100-d, 200-d, or 300-d vectors) that best fits their needs in terms of computational resources and task specificity.

Here d stands for dimension. 100d means, in this file each word has an equivalent vector of size 100. Glove files are simple text files in the form of a dictionary. Words are key and dense vectors are values of key.

GloVe Embeddings Applications

GloVe embeddings are a popular option for representing words in text data and have found applications in various natural language processing (NLP) tasks. The following are some typical uses for GloVe embeddings:

Text Classification:

  • GloVe embeddings can be utilised as features in machine learning models for sentiment analysis, topic classification, spam detection, and other applications.

Named Entity Recognition (NER):

  • By capturing the semantic relationships between words and enhancing the model’s capacity to identify entities in text, GloVe embeddings can improve the performance of NER systems.

Machine Translation:

  • GloVe embeddings can be used to represent words in the source and target languages in machine translation systems, which aim to translate text from one language to another, thereby enhancing the quality of the translation.

Question Answering Systems:

  • To help models comprehend the context and relationships between words and produce more accurate answers, GloVe embeddings are used in question-answering tasks.

Document Similarity and Clustering:

  • GloVe embeddings enable applications in information retrieval and document organization by measuring the semantic similarity between documents or grouping documents according to their content.

Word Analogy Tasks:

  • In word analogy tasks, GloVe embeddings frequently yield good results. For instance, the generated vector for “king-man + woman” might resemble the “queen” vector, demonstrating the capacity to recognize semantic relationships.

Semantic Search:

  • In semantic search applications, where retrieving documents or passages according to their semantic relevance to a user’s query is the aim, GloVe embeddings are helpful.

How GloVe works?

The creation of a word co-occurrence matrix is the fundamental component of GloVe. This matrix provides a quantitative measure of the semantic affinity between words by capturing the frequency with which they appear together in a given context. Next, by minimising the difference between the dot product of vectors and the pointwise mutual information of corresponding words, GloVe optimises word vectors. GloVe is able to produce dense vector representations that capture syntactic and semantic relationships thanks to its innovative methodology.

Create Vocabulary Dictionary

Vocabulary is the collection of all unique words present in the training dataset. The first dataset is tokenized into words, then all the frequency of each word is counted. Then words are sorted in decreasing order of their frequencies. Words having high frequency are placed at the beginning of the dictionary.

Dataset= {The peon is ringing the bell}
Vocabulary= {'The':2, 'peon':1, 'is':1, 'ringing':1}

Algorithm for word embedding

  • Preprocess the text data.
  • Created the dictionary.
  • Traverse the glove file of a specific dimension and compare each word with all words in the dictionary,
  • if a match occurs, copy the equivalent vector from the glove and paste into embedding_matrix at the corresponding index.

Code Implementation:

Using TensorFlow’s Tokenizer, the code tokenizes a set of words and outputs vocabulary data. Lines for downloading and unzipping GloVe vectors are included, but they are commented out. The code defines a function that uses pre-trained GloVe vectors to create an embedding matrix for a given vocabulary. It prints the dense vector representation for the first word in the vocabulary after loading the GloVe vectors. Before executing the code, the download lines must be commented out.

Python3

# code for Glove word embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
 
x = {'text', 'the', 'leader', 'prime',
     'natural', 'language'}
 
# create the dict.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x)
 
# number of unique words in dict.
print("Number of unique words in dictionary=",
      len(tokenizer.word_index))
print("Dictionary is = ", tokenizer.word_index)
 
# download glove and unzip it in Notebook.
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip glove*.zip
 
# vocab: 'the': 1, mapping of words with
# integers in seq. 1,2,3..
# embedding: 1->dense vector
def embedding_for_vocab(filepath, word_index,
                        embedding_dim):
    vocab_size = len(word_index) + 1
     
    # Adding again 1 because of reserved 0 index
    embedding_matrix_vocab = np.zeros((vocab_size,
                                       embedding_dim))
 
    with open(filepath, encoding="utf8") as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word]
                embedding_matrix_vocab[idx] = np.array(
                    vector, dtype=np.float32)[:embedding_dim]
 
    return embedding_matrix_vocab
 
 
# matrix for vocab: word_index
embedding_dim = 50
embedding_matrix_vocab = embedding_for_vocab(
    '../glove.6B.50d.txt', tokenizer.word_index,
  embedding_dim)
 
print("Dense vector for first word is => ",
      embedding_matrix_vocab[1])
                      
                       

Output:

Number of unique words in dictionary= 6
Dictionary is = {'leader': 1, 'the': 2, 'prime': 3, 'natural': 4, 'language': 5, 'text': 6}
Dense vector for first word is => [-0.1567 0.26117 0.78881001 0.65206999 1.20019996 0.35400999
-0.34298 0.31702 -1.15020001 -0.16099 0.15798 -0.53501999
-1.34679997 0.51783001 -0.46441001 -0.19846 0.27474999 -0.26154
0.25531 0.33388001 -1.04130006 0.52525002 -0.35442999 -0.19137
-0.08964 -2.33139992 0.12433 -0.94405001 -1.02330005 1.35070002
2.55240011 -0.16897 -1.72899997 0.32548001 -0.30914 -0.63056999
-0.22211 -0.15589 -0.43597999 0.0568 -0.090885 0.75028002
-1.31529999 -0.75358999 0.82898998 0.051397 -1.48049998 -0.11134
0.27090001 -0.48712999]


Next Article
Overview of Word Embedding using Embeddings from Language Models (ELMo)
author
pintusaini
Improve
Article Tags :
  • AI-ML-DS
  • NLP
  • Python
  • Natural-language-processing
Practice Tags :
  • python

Similar Reads

  • Pre-Trained Word Embedding in NLP
    Word Embedding is an important term in Natural Language Processing and a significant breakthrough in deep learning that solved many problems. In this article, we'll be looking into what pre-trained word embeddings in NLP are. Table of ContentWord EmbeddingsChallenges in building word embedding from
    9 min read
  • Overview of Word Embedding using Embeddings from Language Models (ELMo)
    What is word embeddings? It is the representation of words into vectors. These vectors capture important information about the words such that the words sharing the same neighborhood in the vector space represent similar meaning. There are various methods for creating word embeddings, for example, W
    2 min read
  • Word Embedding Techniques in NLP
    Word embedding techniques are a fundamental part of natural language processing (NLP) and machine learning, providing a way to represent words as vectors in a continuous vector space. In this article, we will learn about various word embedding techniques. Table of Content Importance of Word Embeddin
    6 min read
  • How to Generate Word Embedding using BERT?
    Word embedding is an important part of the NLP process. It is responsible to capture the semantic meaning of words, reduce dimensionality, add contextual information, and promote efficient learning by transferring linguistic knowledge via pre-trained embeddings. As a result, we get enhanced performa
    11 min read
  • Top 5 PreTrained Models in Natural Language Processing (NLP)
    Pretrained models are deep learning models that have been trained on huge amounts of data before fine-tuning for a specific task. The pre-trained models have revolutionized the landscape of natural language processing as they allow the developer to transfer the learned knowledge to specific tasks, e
    7 min read
  • Word Embedding using Universal Sentence Encoder in Python
    Unlike the word embedding techniques in which you represent word into vectors, in Sentence Embeddings entire sentence or text along with its semantics information is mapped into vectors of real numbers. This technique makes it possible to understand and process useful information of an entire text,
    3 min read
  • Word Embeddings Using FastText
    FastText embeddings are a type of word embedding developed by Facebook's AI Research (FAIR) lab. They are based on the idea of subword embeddings, which means that instead of representing words as single entities, FastText breaks them down into smaller components called character n-grams. By doing s
    8 min read
  • Next Word Prediction with Deep Learning in NLP
    Next Word Prediction is a natural language processing (NLP) task where a model predicts the most likely word that should follow a given sequence of words in a sentence. It is a fundamental concept in language modeling and is widely used in various applications such as autocomplete systems, chatbots,
    7 min read
  • Word Embeddings in NLP
    Word Embeddings are numeric representations of words in a lower-dimensional space, capturing semantic and syntactic information. They play a vital role in Natural Language Processing (NLP) tasks. This article explores traditional and neural approaches, such as TF-IDF, Word2Vec, and GloVe, offering i
    15+ min read
  • How to Use Hugging Face Pretrained Model
    Hugging Face has become a prominent player in the field of Natural Language Processing (NLP), providing a wide range of pretrained models that can be seamlessly incorporated into different applications. If you need to do tasks like text classification, sentiment analysis, machine translation, or any
    5 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences