Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Understanding TF-IDF (Term Frequency-Inverse Document Frequency)
Next article icon

Understanding TF-IDF (Term Frequency-Inverse Document Frequency)

Last Updated : 07 Feb, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in natural language processing and information retrieval to evaluate the importance of a word in a document relative to a collection of documents (corpus).

Unlike simple word frequency, TF-IDF balances common and rare words to highlight the most meaningful terms.

How TF-IDF Works?

TF-IDF combines two components: Term Frequency (TF) and Inverse Document Frequency (IDF).

Term Frequency (TF): Measures how often a word appears in a document. A higher frequency suggests greater importance. If a term appears frequently in a document, it is likely relevant to the document’s content. Formula:

The-TF-Formula
Term Frequency (TF)

Limitations of TF Alone:

  • TF does not account for the global importance of a term across the entire corpus.
  • Common words like "the" or "and" may have high TF scores but are not meaningful in distinguishing documents.

Inverse Document Frequency (IDF): Reduces the weight of common words across multiple documents while increasing the weight of rare words. If a term appears in fewer documents, it is more likely to be meaningful and specific. Formula:

IDF-Formula
Inverse Document Frequency (IDF)
  • The logarithm is used to dampen the effect of very large or very small values, ensuring the IDF score scales appropriately.
  • It also helps balance the impact of terms that appear in extremely few or extremely many documents.

Limitations of IDF Alone:

  • IDF does not consider how often a term appears within a specific document.
  • A term might be rare across the corpus (high IDF) but irrelevant in a specific document (low TF).

Converting Text into vectors with TF-IDF : Example

To better grasp how TF-IDF works, let’s walk through a detailed example. Imagine we have a corpus (a collection of documents) with three documents:

  1. Document 1: "The cat sat on the mat."
  2. Document 2: "The dog played in the park."
  3. Document 3: "Cats and dogs are great pets."

Our goal is to calculate the TF-IDF score for specific terms in these documents. Let’s focus on the word "cat" and see how TF-IDF evaluates its importance.

Step 1: Calculate Term Frequency (TF)

For Document 1:

  • The word "cat" appears 1 time.
  • The total number of terms in Document 1 is 6 ("the", "cat", "sat", "on", "the", "mat").
  • So, TF(cat,Document 1) = 1/6

For Document 2:

  • The word "cat" does not appear.
  • So, TF(cat,Document 2)=0.

For Document 3:

  • The word "cat" appears 1 time (as "cats").
  • The total number of terms in Document 3 is 6 ("cats", "and", "dogs", "are", "great", "pets").
  • So, TF(cat,Document 3)=1/6
  • In Document 1 and Document 3, the word "cat" has the same TF score. This means it appears with the same relative frequency in both documents.
  • In Document 2, the TF score is 0 because the word "cat" does not appear.

Step 2: Calculate Inverse Document Frequency (IDF)

  • Total number of documents in the corpus (D): 3
  • Number of documents containing the term "cat": 2 (Document 1 and Document 3).

So, IDF(cat,D)=log \frac{3}{2} ≈0.176

The IDF score for "cat" is relatively low. This indicates that the word "cat" is not very rare in the corpus—it appears in 2 out of 3 documents. If a term appeared in only 1 document, its IDF score would be higher, indicating greater uniqueness.

Step 3: Calculate TF-IDF

The TF-IDF score for "cat" is 0.029 in Document 1 and Document 3, and 0 in Document 2 that reflects both the frequency of the term in the document (TF) and its rarity across the corpus (IDF).

The-TF-IDF-score_
TF-IDF

A higher TF-IDF score means the term is more important in that specific document.

Why is TF-IDF Useful in This Example?

1. Identifying Important Terms: TF-IDF helps us understand that "cat" is somewhat important in Document 1 and Document 3 but irrelevant in Document 2.

If we were building a search engine, this score would help rank Document 1 and Document 3 higher for a query like "cat".

2. Filtering Common Words: Words like "the" or "and" would have high TF scores but very low IDF scores because they appear in almost all documents. Their TF-IDF scores would be close to 0, indicating they are not meaningful.

3. Highlighting Unique Terms: If a term like "mat" appeared only in Document 1, it would have a higher IDF score, making its TF-IDF score more significant in that document.

Implementing TF-IDF in Sklearn with Python

In python tf-idf values can be computed using TfidfVectorizer() method in sklearn module. 

Syntax:

sklearn.feature_extraction.text.TfidfVectorizer(input)

Parameters:

  • input: It refers to parameter document passed, it can be a filename, file or content itself.

Attributes:

  • vocabulary_: It returns a dictionary of terms as keys and values as feature indices.
  • idf_: It returns the inverse document frequency vector of the document passed as a parameter.

Returns:

  • fit_transform(): It returns an array of terms along with tf-idf values.
  • get_feature_names(): It returns a list of feature names.

Step-by-step Approach:

  • Import modules.
Python
# import required module from sklearn.feature_extraction.text import TfidfVectorizer 
  • Collect strings from documents and create a corpus having a collection of strings from the documents d0, d1, and d2.
Python
# assign documents d0 = 'Geeks for geeks' d1 = 'Geeks' d2 = 'r2j'  # merge documents into a single corpus string = [d0, d1, d2] 
  • Get tf-idf values from fit_transform() method.
Python
# create object tfidf = TfidfVectorizer()  # get tf-df values result = tfidf.fit_transform(string) 
  • Display idf values of the words present in the corpus.
Python
# get idf values print('\nidf values:') for ele1, ele2 in zip(tfidf.get_feature_names(), tfidf.idf_):     print(ele1, ':', ele2) 

Output:

  • Display tf-idf values along with indexing.
Python
# get indexing print('\nWord indexes:') print(tfidf.vocabulary_)  # display tf-idf values print('\ntf-idf value:') print(result)  # in matrix form print('\ntf-idf values in matrix form:') print(result.toarray()) 

Output:

The result variable consists of unique words as well as the tf-if values. It can be elaborated using the below image:

From the above image the below table can be generated:

DocumentWordDocument IndexWord Indextf-idf value
d0for000.549
d0geeks010.8355
d1geeks111.000
d2r2j221.000

Below are some examples which depict how to compute tf-idf values of words from a corpus: 

Example 1: Below is the complete program based on the above approach:

Python
# import required module from sklearn.feature_extraction.text import TfidfVectorizer  # assign documents d0 = 'Geeks for geeks' d1 = 'Geeks' d2 = 'r2j'  # merge documents into a single corpus string = [d0, d1, d2]  # create object tfidf = TfidfVectorizer()  # get tf-df values result = tfidf.fit_transform(string)  # get idf values print('\nidf values:') for ele1, ele2 in zip(tfidf.get_feature_names(), tfidf.idf_):     print(ele1, ':', ele2)  # get indexing print('\nWord indexes:') print(tfidf.vocabulary_)  # display tf-idf values print('\ntf-idf value:') print(result)  # in matrix form print('\ntf-idf values in matrix form:') print(result.toarray()) 

Output:

Example 2: Here, tf-idf values are computed from a corpus having unique values. 

Python
# import required module from sklearn.feature_extraction.text import TfidfVectorizer  # assign documents d0 = 'geek1' d1 = 'geek2' d2 = 'geek3' d3 = 'geek4'  # merge documents into a single corpus string = [d0, d1, d2, d3]  # create object tfidf = TfidfVectorizer()  # get tf-df values result = tfidf.fit_transform(string)  # get indexing print('\nWord indexes:') print(tfidf.vocabulary_)  # display tf-idf values print('\ntf-idf values:') print(result) 

Output:

Example 3: In this program, tf-idf values are computed from a corpus having similar documents.

Python
# import required module from sklearn.feature_extraction.text import TfidfVectorizer  # assign documents d0 = 'Geeks for geeks!' d1 = 'Geeks for geeks!'   # merge documents into a single corpus string = [d0, d1]  # create object tfidf = TfidfVectorizer()  # get tf-df values result = tfidf.fit_transform(string)  # get indexing print('\nWord indexes:') print(tfidf.vocabulary_)  # display tf-idf values print('\ntf-idf values:') print(result) 

Output:

Example 4: Below is the program in which we try to calculate tf-idf value of a single word geeks is repeated multiple times in multiple documents.

Python
# import required module from sklearn.feature_extraction.text import TfidfVectorizer  # assign corpus string = ['Geeks geeks']*5  # create object tfidf = TfidfVectorizer()  # get tf-df values result = tfidf.fit_transform(string)  # get indexing print('\nWord indexes:') print(tfidf.vocabulary_)  # display tf-idf values print('\ntf-idf values:') print(result) 

Output:


Next Article
Understanding TF-IDF (Term Frequency-Inverse Document Frequency)

R

riturajsaha
Improve
Article Tags :
  • Technical Scripter
  • Machine Learning
  • NLP
  • AI-ML-DS
  • Technical Scripter 2020
  • Python scikit-module
  • python
  • AI-ML-DS With Python
Practice Tags :
  • Machine Learning
  • python

Similar Reads

    Bag of word and Frequency count in text using sklearn
    Text data is ubiquitous in today's digital world, from emails and social media posts to research articles and customer reviews. To analyze and derive insights from this textual information, it's essential to convert text into a numerical form that machine learning algorithms can process. One of the
    3 min read
    Understanding min_df and max_df in scikit CountVectorizer
    In natural language processing (NLP), text preprocessing is a critical step that significantly impacts the performance of machine learning models. One common preprocessing step is converting text data into numerical data using techniques like Bag of Words (BoW). The CountVectorizer class in the scik
    6 min read
    Counting Word Frequency and Making a Dictionary from it
    We need to count how often each word appears in a given text and store these counts in a dictionary. For instance, if the input string is "Python with Python gfg with Python", we want the output to be {'Python': 3, 'with': 2, 'gfg': 1}. Each key in the dictionary represents a unique word, and the co
    3 min read
    Classification of text documents using sparse features in Python Scikit Learn
    Classification is a type of machine learning algorithm in which the model is trained, so as to categorize or label the given input based on the provided features for example classifying the input image as an image of a dog or a cat (binary classification) or to classify the provided picture of a liv
    5 min read
    An Easy Approach to TF-IDF Using R
    TF-IDF (Term Frequency-Inverse Document Frequency) is a fundamental technique in natural language processing and information retrieval for assessing the importance of a term within a document relative to a collection of documents. In this article, we'll explore how to implement TF-IDF using R Progra
    5 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences