TF-IDF Representations in TensorFlow
Last Updated : 12 Feb, 2025
Text data is one of the most common forms of unstructured data, and converting it into a numerical representation is essential for machine learning models.
Term Frequency-Inverse Document Frequency (TF-IDF) is a widely used text vectorization technique that helps represent text in a way that captures word importance. It evaluates the importance of a word in a document relative to a collection (corpus) of documents. It consists of two components:
- Term Frequency (TF): Measures how often a word appears in a document.
TF(w) = \frac{\text{Number of times word w appears in the document}}{\text{Total number of words in the document}} - Inverse Document Frequency (IDF): Measures the significance of a word across multiple documents.
IDF(w) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing the word w}} + 1 \right)
The final TF-IDF score is calculated as:
TF-IDF(w) = TF(w) \times IDF(w)
Words that appear frequently in a document but are rare across the corpus will have higher TF-IDF scores.
Implementing TF-IDF in TensorFlow
TensorFlow provides efficient ways to handle text preprocessing, including TF-IDF representation. We will use the tf.keras.layers.TextVectorization layer to compute TF-IDF features.
Step 1: Import Required Libraries
Python import tensorflow as tf import numpy as np
Step 2: Prepare the Dataset
Python corpus = [ "TensorFlow is an open-source machine learning framework.", "Machine learning models improve by training on data.", "Deep learning is a subset of machine learning.", "TF-IDF helps in text vectorization for NLP tasks." ]
Step 3: Create a TextVectorization Layer with TF-IDF Mode
TensorFlow’s TextVectorization layer can be used to automatically compute TF-IDF values.
Python vectorizer = tf.keras.layers.TextVectorization( output_mode="tf_idf", ngrams=None ) # Adapting the vectorizer to the corpus vectorizer.adapt(corpus)
Step 4: Convert Text to TF-IDF Representation
Python tfidf_matrix = vectorizer(corpus) tfidf_matrix_np = tfidf_matrix.numpy() # Print the TF-IDF matrix print(tfidf_matrix_np)
Output:
Each row in the TF-IDF matrix corresponds to a document in the corpus, and each column represents a tokenized word. The values indicate the importance of words within each document.
Advantages of Using TensorFlow for TF-IDF
- Scalability: TensorFlow handles large text datasets efficiently using GPU acceleration.
- Ease of Integration: Works seamlessly with other TensorFlow components like tf.data pipelines.
- Customization: Allows users to apply preprocessing (lowercasing, tokenization) and integrate TF-IDF with deep learning models.
TF-IDF is a fundamental technique for representing text in a way that emphasizes important words. TensorFlow’s TextVectorization layer simplifies TF-IDF computation, making it a great choice for NLP applications. With this approach, you can efficiently preprocess text and feed it into machine learning models for tasks like classification, clustering, and information retrieval.