Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Multi-Head Attention Mechanism
Next article icon

Multi-Head Attention Mechanism

Last Updated : 13 Feb, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

The multi-head attention mechanism is a key component of the Transformer architecture, introduced in the seminal paper "Attention Is All You Need" by Vaswani et al. in 2017. It plays a crucial role in enhancing the ability of models to focus on different parts of an input sequence simultaneously, making it particularly effective for tasks such as machine translation, text generation, and more.

Understanding Attention Mechanism

Before diving into multi-head attention, let’s first understand the standard self-attention mechanism, also known as scaled dot-product attention.

Given a set of input vectors, self-attention computes attention scores to determine how much focus each element in the sequence should have on the others. This is done using three key matrices:

  • Query (Q) – Represents the current word's relationship with others.
  • Key (K) – Represents the words that are being compared against.
  • Value (V) – Contains the actual word representations.

The self-attention is computed as:

\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V

What is Multi-Head Attention?

Multi-head attention extends self-attention by splitting the input into multiple heads, enabling the model to capture diverse relationships and patterns.

Instead of using a single set of Q, K, V matrices, the input embeddings are projected into multiple sets (heads), each with its own Q, K, V:

  1. Linear Transformation: The input X is projected into multiple smaller-dimensional subspaces using different weight matrices.
    Q_i = XW_i^Q, \quad K_i = XW_i^K, \quad V_i = XW_i^V
    where iii denotes the head index.
  2. Independent Attention Computation: Each head independently computes its own self-attention using the scaled dot-product formula.
  3. Concatenation: The outputs from all heads are concatenated.
  4. Final Linear Transformation: A final weight matrix is applied to transform the concatenated output into the desired dimension.

Mathematically, multi-head attention is expressed as:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \dots, \text{head}_h) W^O

where:

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

W^O is a final weight matrix to project the concatenated output back into the model’s required dimensions.

Why Use Multiple Attention Heads?

Multi-head attention provides several advantages:

  • Captures different relationships: Different heads attend to different aspects of the input.
  • Improves learning efficiency: By operating in parallel, multiple heads allow for better learning of dependencies.
  • Enhances robustness: The model doesn’t rely on a single attention pattern, reducing overfitting.

Multi-Head Attention in Transformers

Multi-head attention is used in several places within a Transformer model:

  1. Encoder Self-Attention: Helps the encoder learn contextual relationships among words in the input sequence.
  2. Decoder Self-Attention: Ensures that the decoder pays attention to relevant parts of the already generated sequence.
  3. Encoder-Decoder Attention: Helps the decoder attend to the encoded input sequence.

Implementing Multi-head Attention using PyTorch

Here's how you can implement multi-head attention using PyTorch's nn.MultiheadAttention. This code initializes an 8-head multi-head attention mechanism with a 64-dimensional embedding size and applies it to a sample input tensor.

Python
import torch import torch.nn as nn  # Define model parameters embed_dim = 64   num_heads = 8    seq_length = 10  batch_size = 2    # Create random input tensor x = torch.rand(seq_length, batch_size, embed_dim)   # Define multi-head attention layer multihead_attn = nn.MultiheadAttention(embed_dim, num_heads) output, _ = multihead_attn(x, x, x)  print("Output shape:", output.shape)   

Output:

Output shape: torch.Size([10, 2, 64])

Applications of Multi-Head Attention

Multi-head attention is widely used in various domains:

  • Natural Language Processing (NLP)
    • Machine translation (e.g., Google Translate)
    • Text summarization
    • Chatbots and conversational AI
  • Computer Vision: Vision Transformers (ViTs) for image recognition
  • Speech Processing: Speech-to-text models (e.g., Whisper by OpenAI)

The multi-head attention mechanism is one of the most powerful innovations in deep learning. By attending to multiple aspects of the input sequence in parallel, it enables better representation learning, enhanced contextual understanding, and improved performance across NLP, vision, and speech tasks.


Next Article
Multi-Head Attention Mechanism

S

sanjulika_sharma
Improve
Article Tags :
  • NLP
  • AI-ML-DS
  • AI-ML-DS With Python

Similar Reads

    ML - Attention mechanism
    Attention mechanism helps models to focus on the most important parts of input data like humans prioritize certain information in a complex environment. It helps in improving models ability to perform tasks like language translation, image recognition and speech processing. In this article, we will
    7 min read
    Types of Attention Mechanism
    Attention mechanisms are crucial in deep learning, helping models perform better in tasks like NLP and computer vision. They enable models to focus on important parts of the input data, much like how humans concentrate on key details while ignoring irrelevant information helping in better understand
    7 min read
    Transformer Attention Mechanism in NLP
    Transformer model is a type of neural network architecture designed to handle sequential data primarily for tasks such as language translation, text generation and many more. Unlike traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs), Transformers uses attention mech
    7 min read
    Attention Mechanisms for Computer Vision
    Attention mechanisms have revolutionized the field of computer vision, enhancing the capability of neural networks to focus on the most relevant parts of an image. By dynamically adjusting the focus, these mechanisms mimic human visual attention, enabling more precise and efficient processing of vis
    11 min read
    Cross-Attention Mechanism in Transformers
    Cross-attention mechanism is a key part of the Transformer model. It allows the decoder to access and use relevant information from the encoder. This helps the model focus on important details, ensuring tasks like translation are accurate.Imagine generating captions for images (decoder) from a detai
    5 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences