Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
DistilBERT in Natural Language Processing
Next article icon

DistilBERT in Natural Language Processing

Last Updated : 24 Mar, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

DistilBERT is a distilled version of BERT meaning it is trained using knowledge distillation a technique where a smaller model (student) learns from a larger model (teacher). It retains 97% of BERT’s performance while being 40% smaller and 60% faster making it highly efficient for NLP tasks such as text classification, sentiment analysis and question-answering.

DistilBERT focuses on the following key objectives:

  • Computational Efficiency: While BERT requires more computational resources to operate due to its large number of parameters. DistilBERT reduces the size of a BERT model by 40%. It requires less computation and time, which is especially useful when working with large datasets.
  • Faster Inference Speed: BERT's complexity leads to slow inference times. DistilBERT addresses this problem by being smaller and optimized for speed and giving 60% faster inference times compared to BERT. On-device applications, such as mobile question-answering apps DistilBERT is 71% faster than BERT.
  • Comparable Performance: Although DistilBERT is much smaller it retains 97% of BERT’s accuracy on popular NLP benchmarks. This balance between size reduction and minimal performance degradation makes it a solid alternative to BERT.

How DistilBERT Works?

DistilBERT utilizes knowledge distillation where a smaller model (student) learns to replicate the behavior of a larger model (teacher). This process involves training the student model to mimic the predictions and internal representations of the teacher model.

teacher_student_model_for_knowledge_distillation
Teacher - Student model for Knowledge Distillation

In the above diagram the teacher model (BERT) is a large neural network with many parameters. The student model (DistilBERT) is a smaller network trained to replicate the teacher’s behavior using knowledge transfer. The distillation process involves minimizing the difference between the teacher’s soft predictions and the student’s output allowing the student model to retain most of the teacher’s knowledge while being significantly smaller.

Training DistilBERT

DistilBERT is trained using a triple loss function which combines:

  1. Language Modeling Loss – Predicts the next word in a sentence.
  2. Distillation Loss – Encourages the student model to mimic the teacher’s soft predictions.
  3. Cosine-Distance Loss – Aligns the hidden state representations of the student and teacher models.

By combining these losses DistilBERT is able to learn efficiently from BERT while maintaining high performance.

Implementation: Text Classification with DistilBERT

Let’s implement DistilBERT for a text classification task using the transformers library by Hugging Face. We’ll use the IMDb movie review dataset to classify reviews as positive or negative.

Step 1: Install Required Libraries

First install the necessary libraries:

pip install transformers datasets torch

Step 2: Load the Dataset

We'll use the IMDb dataset available in Hugging Face's datasets library.

Python
from datasets import load_dataset  dataset = load_dataset("imdb") train_dataset, test_dataset = dataset['train'], dataset['test'] 

Step 3: Preprocess the Data

DistilBERT requires input data to be tokenized. We’ll use the AutoTokenizer class to preprocess the text.

Python
from transformers import AutoTokenizer  tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")  def preprocess_function(examples):     return tokenizer(examples['text'], truncation=True, padding=True, max_length=512)  tokenized_train = train_dataset.map(preprocess_function, batched=True) tokenized_test = test_dataset.map(preprocess_function, batched=True) 

Step 4: Load the Pre-trained DistilBERT Model

We’ll use the AutoModelForSequenceClassification class to load a pre-trained DistilBERT model fine-tuned for sequence classification.

Python
from transformers import AutoModelForSequenceClassification  model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2) 

Step 5: Train the Model

We’ll use the Trainer API from Hugging Face to simplify the training process.

Python
from transformers import Trainer, TrainingArguments  training_args = TrainingArguments(     output_dir="./results",     evaluation_strategy="epoch",     learning_rate=2e-5,     per_device_train_batch_size=16,     per_device_eval_batch_size=16,     num_train_epochs=3,     weight_decay=0.01,     logging_dir='./logs',     logging_steps=10, )  trainer = Trainer(     model=model,     args=training_args,     train_dataset=tokenized_train,     eval_dataset=tokenized_test,     tokenizer=tokenizer, )  trainer.train() 

Output:

Capture

TrainOutput(global_step=4689, training_loss=0.17010223817522782, metrics={'train_runtime': 4774.8481, 'train_samples_per_second': 15.707, 'train_steps_per_second': 0.982, 'total_flos': 9935054899200000.0, 'train_loss': 0.17010223817522782, 'epoch': 3.0})

Step 6: Evaluate the Model

After training evaluate the model on the test dataset.

Python
results = trainer.evaluate() print(f"Evaluation Results: {results}") 

Output:

Evaluation Results: {'eval_loss': 0.28448769450187683, 'eval_runtime': 383.5344, 'eval_samples_per_second': 65.183, 'eval_steps_per_second': 4.075, 'epoch': 3.0}

Step 7: Make Predictions

You can use the trained model to make predictions on new data.

Python
import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device)  new_review = "This movie was fantastic! I loved every minute of it." inputs = tokenizer(new_review, return_tensors="pt", truncation=True, padding=True, max_length=512) inputs = {key: value.to(device) for key, value in inputs.items()}  # Move inputs to device  # Get model predictions outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=-1) print("Positive" if predictions.item() == 1 else "Negative") 

Output:

Positive

Advantages of DistilBERT

  • Speed and Efficiency: With fewer parameters (66 million vs. BERT’s 110 million), DistilBERT is faster to train and deploy, making it ideal for resource-constrained settings.
  • Scalability: Its smaller footprint allows it to scale across edge devices, democratizing access to advanced NLP.
  • Performance: Despite its size, DistilBERT delivers near-BERT-level accuracy, making it a practical choice without sacrificing too much quality.

Applications in NLP

DistilBERT shines in a variety of NLP tasks:

  • Sentiment Analysis: Businesses use it to quickly analyze customer reviews or social media posts.
  • Chatbots: Its efficiency powers responsive, context-aware conversational agents.
  • Text Summarization: DistilBERT can condense lengthy documents into concise summaries.
  • Named Entity Recognition (NER): It identifies key entities like names or locations in text with high accuracy.

Limitations of DistilBERT

While DistilBERT is impressive, it’s not without trade-offs. The reduction in size means it may struggle with extremely complex language tasks where BERT’s deeper architecture excels. For cutting-edge research or niche applications requiring peak performance, the original BERT or even larger models like RoBERTa might still be preferred.

DistilBERT offers an excellent balance between performance and efficiency, making it a go-to choice for many NLP applications. Whether you’re working on sentiment analysis, question answering, or any other NLP task DistilBERT is a powerful tool that can help you achieve great results without breaking the bank on computational resources.


Next Article
DistilBERT in Natural Language Processing

A

ayushimalm50
Improve
Article Tags :
  • NLP
  • AI-ML-DS
  • AI-ML-DS With Python

Similar Reads

    Contrastive Decoding in Natural Language Processing
    Contrastive decoding is an NLP technique that improves text generation by comparing outputs from different models and selecting the most contextually appropriate one.In this article, we are going to explore the need for contrastive decoding and it's working, along with its implementation and applica
    8 min read
    Best Tools for Natural Language Processing in 2024
    Natural language processing, also known as Natural Language Interface, has recently received a boost over the past several years due to the increasing demands on the ability of machines to understand and analyze human language. Best Tools for Natural Language Processing in 2024This article explores
    6 min read
    Word Sense Disambiguation in Natural Language Processing
    Word sense disambiguation (WSD) in Natural Language Processing (NLP) is the problem of identifying which "sense" (meaning) of a word is activated by the use of the word in a particular context or scenario. In people, this appears to be a largely unconscious process. The challenge of correctly identi
    8 min read
    Natural Language Processing with R
    Natural Language Processing (NLP) is a field of artificial intelligence (AI) that enables machines to understand and process human language. R, known for its statistical capabilities, provides a wide range of libraries to perform various NLP tasks. Understanding Natural Language ProcessingNLP involv
    4 min read
    Advanced Natural Language Processing Interview Question
    Natural Language Processing (NLP) is a rapidly evolving field at the intersection of computer science and linguistics. As companies increasingly leverage NLP technologies, the demand for skilled professionals in this area has surged. Whether preparing for a job interview or looking to brush up on yo
    9 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences