Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Self -attention in NLP
Next article icon

Self - Attention in NLP

Last Updated : 06 May, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Self-attention is a technique used in NLP that helps models to understand relationships between words or entities in a sentence, no matter where they appear. It is a important part of transformers model which is used in tasks like translation and text generation.

Understanding Attention in NLP

The goal of self attention mechanism is to improve performance of traditional models such as encoder-decoder models used in RNNs (Recurrent Neural Networks). In traditional encoder-decoder models input sequence is compressed into a single fixed-length vector which is then used to generate the output. This works well for short sequences but struggles with long ones because important information can be lost when compressed into a single vector. To overcome this problem self attention mechanism was introduced.

Encoder-Decoder Model

An encoder-decoder model is used in machine learning tasks that involve sequences like translating sentences, generating text or creating captions for images. Here's how it works:

  • Encoder: It takes the input sequence like sentences and processes them. It converts input into a fixed-size summary called a latent vector or context vector. This vector holds all the important information from the input sequence.
  • Decoder: It then uses this summary to generate an output sequence such as a translated sentence. It tries to reconstruct the desired output based on the encoded information.
frame_3053
Encoder-Decoder Model

Attention Layer in Transformer

It includes:

Transformer-Geeksforgeeks
Transformer

1. Input Embedding: Input text like a sentences are first converted into embeddings. These are vector representations of words in a continuous space.

2. Positional Encoding: Since Transformer doesn’t process words in a sequence like RNNs positional encodings are added to the input embeddings and these encode the position of each word in the sentence.

3. Multi-Head Attention:

  • In this multiple attention heads are applied in parallel to process different part of sequences simultaneously.
  • Each head finds the attention scores based on queries (Q), keys (K) and values (V) and adds information from different parts of input.
  • Output of all attention heads is combined and passed through further processing.

4. Add and Norm: This layer helps in residual connections and layer normalization. This helps to avoid vanishing gradient problems and ensures stable training.

5. Feed Forward: After attention output is passed through a feed-forward neural network for further transformation.

6. Masked Multi-Head Attention for the Decoder: This is used in the decoder and ensures that each word can only attend to previous words in the sequence not future ones.

7. Output Embedding: Finally transformed output is mapped to a final output space and processed by softmax function to generate output probabilities.

Self-Attention Mechanism

This mechanism captures long-range dependencies by calculating attention between all words in the sequence and helping the model to look at the entire sequence at once. Unlike traditional models that process words one by one it helps the model to find which words are most relevant to each other helpful for tasks like translation or text generation. Here’s how the self-attention mechanism works:

  1. Input Vectors and Weight Matrices: Each encoder input vector is multiplied by three trained weight matrices (W(Q), W(K), W(V)) to generate the key, query and value vectors.
  2. Query-Key Interaction: Multiply the query vector of the current input by the key vectors from all other inputs to calculate the attention scores.
  3. Scaling Scores: Attention scores are divided by the square root of the key vector's dimension (dk) usually 64 to prevent the values from becoming too large and making calculations unstable.
  4. Softmax Function: Apply the softmax function to the calculated attention scores to normalize them into probabilities.
  5. Weighted Value Vectors: Multiply the softmax scores by the corresponding value vectors.
  6. Summing Weighted Vectors: Sum the weighted value vectors to produce the self-attention output for the input.

Above procedure is applied to all the input sequences. Mathematically self-attention matrix for input matrices (Q, K, V) is calculated as:

Attention\left ( Q, K, V \right ) = softmax\left ( \frac{QK^{T}}{\sqrt{d_{k}}} \right )V

where Q, K, V are the concatenation of query, key and value vectors where,

head_{i} = Attention \left ( QW_{i}^{Q},KW_{i}^{K}, VW_{i}^{V} \right )

Multi-headed-attention

In multi-headed attention mechanism, multiple attention heads are used in parallel which allows the model to focus on different parts of the input sequence simultaneously. This approach increases model's ability to capture various relationships between words in the sequence. Here’s a step-by-step breakdown of how multi-headed attention works:

MHA
Multi-headed-attention
  1. Generate Embeddings: For each word in the input sentence it generate its embedding representation.
  2. Create Multiple Attention Heads: Create h (e.g h=8) attention heads and each with its own weight matrices W(Q),W(K),W(V).
  3. Matrix Multiplication: Multiply the input matrix by each of the weight matrices W(Q),W(K),W(V) for each attention head to produce key, query and value matrices.
  4. Apply Attention: Apply attention mechanism to the key, query and value matrices for each attention head which helps in generating an output matrix from each head.
  5. Concatenate and Transform: Concatenate the output matrices from all attention heads and apply a dot product with weight W_{O} to generate the final output of the multi-headed attention layer.

Mathematically multi-head attention can be represented by:

MultiHead\left ( Q, K, V \right ) = concat\left ( head_{1} head_{2} ... head_{n} \right )W_{O}

Why Multi-Headed Attention?

  1. Captures Different Aspects: Each attention head focuses on a different part of the sequence which allows the model to capture a variety of relationships between words.
  2. Parallel Processing: It helps in parallel computation of attention for different heads which speeds up the training process.
  3. Improved Performance: By combining the results from multiple heads the model becomes more efficient and flexible in understanding complex relationships within the input data.

Transformer architecture uses multi-headed attention at three steps:

  • Encoder-Decoder Attention: In this layer, queries come from the previous decoder layer while the keys and values come from the encoder’s output. This allows each position in the decoder to focus on all positions in the input sequence.
  • Encoder Self-Attention: This layer receives queries, keys and values from the output of the previous encoder layer. Each position in the encoder looks at all positions from the previous layer to calculate attention scores.
self attention in encoder-Geeksforgeeks
  • Decoder Self-Attention: Similar to the encoder's self-attention but here the queries, keys and values come from the previous decoder layer. Each position can attend to the current and previous positions but future positions are masked (with (-Inf)) to prevent the model from looking ahead when generating the output and this is called masked self-attention.
Self attention in decoder-Geeksforgeeks

Advantages of Self-Attention

  1. Parallelization: Unlike sequential models it allows for full parallel processing which speeds up training.
  2. Long-Range Dependencies: It provides direct access to distant elements making it easier to model complex structures and relationships across long sequences.
  3. Contextual Understanding: Each token’s representation is influenced by the entire sequence which integrates global context and improves accuracy.
  4. Interpretable Weights: Attention maps can show which parts of the input were most influential in making decisions.

Key challenges of Self-attention

Despite having many advantages, it also lacks in few way which are as follows:

  1. Computational Cost: Self-attention requires computing pairwise interactions between all input tokens which causes a time and memory complexity of O(n^2), where n is the sequence length. This becomes inefficient for long sequences.
  2. Memory Usage: Large number of pairwise calculations in self-attention uses high memory while working with very long sequences or large batch sizes.
  3. Lack of Local Context: It focuses on global dependencies across all tokens but it may not effectively capture local patterns. This can cause inefficiencies when local context is more important than global context.
  4. Overfitting: Due to its ability to model complex relationships it can overfit when it is trained on small datasets.

As machine learning continues to grow the exploration of attention mechanisms is opening up new opportunities and changing the way models understand and process complex data.


Next Article
Self -attention in NLP

K

kartik
Improve
Article Tags :
  • Machine Learning
  • Deep Learning
  • NLP
  • AI-ML-DS
Practice Tags :
  • Machine Learning

Similar Reads

    Self - attention in NLP
    Self-attention is a technique used in NLP that helps models to understand relationships between words or entities in a sentence, no matter where they appear. It is a important part of transformers model which is used in tasks like translation and text generation.Understanding Attention in NLPThe goa
    7 min read
    Self -attention in NLP
    Self-attention was proposed by researchers at Google Research and Google Brain. It was proposed due to challenges faced by encoder-decoder in dealing with long sequences. The authors also provide two variants of attention and transformer architecture. This transformer architecture generates the stat
    5 min read
    Attention Layers in TensorFlow
    Attention Mechanism allows models to focus on specific parts of input data, enabling more effective processing and prediction. In this article, we'll explore what attention layers are, and how to implement them in TensorFlow.What is Attention in Deep Learning?Attention mechanisms in neural networks
    3 min read
    Transformer Attention Mechanism in NLP
    Transformer model is a type of neural network architecture designed to handle sequential data primarily for tasks such as language translation, text generation and many more. Unlike traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs), Transformers uses attention mech
    7 min read
    Bidirectional RNNs in NLP
    The state of a recurrent network at a given time unit only knows about the inputs that have passed before it up to that point in the sentence; it is unaware of the states that will come after that. With knowledge of both past and future situations, the outcomes are significantly enhanced in some app
    10 min read
    How Do Self-attention Masks Work?
    Self-attention mechanism enables each word or token in a sequence to focus on other relevant words or tokens within the same sequence, allowing for relationships between elements to be dynamic and context-dependent.For example, in the sentence "The cat sat on the mat", the word "sat" has to notice "
    6 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences