TF-IDF Representations in TensorFlow

Last Updated : 12 Feb, 2025

Text data is one of the most common forms of unstructured data, and converting it into a numerical representation is essential for machine learning models.

Term Frequency-Inverse Document Frequency (TF-IDF) is a widely used text vectorization technique that helps represent text in a way that captures word importance. It evaluates the importance of a word in a document relative to a collection (corpus) of documents. It consists of two components:

Term Frequency (TF): Measures how often a word appears in a document.
TF(w) = \frac{\text{Number of times word w appears in the document}}{\text{Total number of words in the document}}
Inverse Document Frequency (IDF): Measures the significance of a word across multiple documents.
IDF(w) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing the word w}} + 1 \right)

The final TF-IDF score is calculated as:

TF-IDF(w) = TF(w) \times IDF(w)

Words that appear frequently in a document but are rare across the corpus will have higher TF-IDF scores.

Implementing TF-IDF in TensorFlow

TensorFlow provides efficient ways to handle text preprocessing, including TF-IDF representation. We will use the tf.keras.layers.TextVectorization layer to compute TF-IDF features.

Step 1: Import Required Libraries

Python

import tensorflow as tf import numpy as np

Step 2: Prepare the Dataset

Python

corpus = [     "TensorFlow is an open-source machine learning framework.",     "Machine learning models improve by training on data.",     "Deep learning is a subset of machine learning.",     "TF-IDF helps in text vectorization for NLP tasks." ]

Step 3: Create a TextVectorization Layer with TF-IDF Mode

TensorFlow’s TextVectorization layer can be used to automatically compute TF-IDF values.

Python

vectorizer = tf.keras.layers.TextVectorization(     output_mode="tf_idf",     ngrams=None )  # Adapting the vectorizer to the corpus vectorizer.adapt(corpus)

Step 4: Convert Text to TF-IDF Representation

Python

tfidf_matrix = vectorizer(corpus) tfidf_matrix_np = tfidf_matrix.numpy()  # Print the TF-IDF matrix print(tfidf_matrix_np)

Output:

Each row in the TF-IDF matrix corresponds to a document in the corpus, and each column represents a tokenized word. The values indicate the importance of words within each document.

Advantages of Using TensorFlow for TF-IDF

Scalability: TensorFlow handles large text datasets efficiently using GPU acceleration.
Ease of Integration: Works seamlessly with other TensorFlow components like tf.data pipelines.
Customization: Allows users to apply preprocessing (lowercasing, tokenization) and integrate TF-IDF with deep learning models.

TF-IDF is a fundamental technique for representing text in a way that emphasizes important words. TensorFlow’s TextVectorization layer simplifies TF-IDF computation, making it a great choice for NLP applications. With this approach, you can efficiently preprocess text and feed it into machine learning models for tasks like classification, clustering, and information retrieval.

TF-IDF Representations in TensorFlow

sanjulika_sharma

Improve

Article Tags :

TF-IDF Representations in TensorFlow

Implementing TF-IDF in TensorFlow

Step 1: Import Required Libraries

Step 2: Prepare the Dataset

Step 3: Create a TextVectorization Layer with TF-IDF Mode

Step 4: Convert Text to TF-IDF Representation

Advantages of Using TensorFlow for TF-IDF

Similar Reads