Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • NLP
  • Data Analysis Tutorial
  • Python - Data visualization tutorial
  • NumPy
  • Pandas
  • OpenCV
  • R
  • Machine Learning Tutorial
  • Machine Learning Projects
  • Machine Learning Interview Questions
  • Machine Learning Mathematics
  • Deep Learning Tutorial
  • Deep Learning Project
  • Deep Learning Interview Questions
  • Computer Vision Tutorial
  • Computer Vision Projects
  • NLP
  • NLP Project
  • NLP Interview Questions
  • Statistics with Python
  • 100 Days of Machine Learning
Open In App
Next Article:
Print the Content of a Txt File in Python
Next article icon

Text Preprocessing in Python

Last Updated : 26 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Text processing is a key part of Natural Language Processing (NLP). It helps us clean and convert raw text data into a format suitable for analysis and machine learning. In this article, we will learn how to perform text preprocessing using various Python libraries and techniques focusing on the NLTK (Natural Language Toolkit) library.

1. Importing Libraries

We will be importing nltk, regex, string and inflect.

Python
import nltk import string import re import inflect from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer from nltk.stem.porter import PorterStemmer from nltk.tokenize import word_tokenize 

2. Convert to Lowercase

We lowercase the text to reduce the size of the vocabulary of our text data.

Python
def text_lowercase(text):     return text.lower()  input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days !!"; text_lowercase(input_str) 

Output: 

“hey, did you know that the summer break is coming? amazing right !! it’s only 5 more days !!”

3. Removing Numbers

We can either remove numbers or convert the numbers into their textual representations. To remove the numbers we can use regular expressions.

Python
def remove_numbers(text):     result = re.sub(r'\d+', '', text)     return result  input_str = "There are 3 balls in this bag, and 12 in the other one." remove_numbers(input_str) 

Output: 

‘There are balls in this bag, and in the other one.’

4. Converting Numerical Values

We can also convert the numbers into words. This can be done by using the inflect library.

Python
p = inflect.engine()  def convert_number(text):     temp_str = text.split()     new_string = []      for word in temp_str:         if word.isdigit():             temp = p.number_to_words(word)             new_string.append(temp)          else:             new_string.append(word)      temp_str = ' '.join(new_string)     return temp_str  input_str = 'There are 3 balls in this bag, and 12 in the other one.' convert_number(input_str) 

Output:

‘There are three balls in this bag, and twelve in the other one.’

5. Removing Punctuation

We remove punctuations so that we don’t have different forms of the same word. For example if we don’t remove the punctuation then been. been, been! will be treated separately.
 

Python
def remove_punctuation(text):     translator = str.maketrans('', '', string.punctuation)     return text.translate(translator) input_str = "Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days !!" remove_punctuation(input_str) 

Output:

‘Hey did you know that the summer break is coming Amazing right Its only 5 more days ‘

6. Removing Whitespace

We can use the join and split function to remove all the white spaces in a string.
 

Python
def remove_whitespace(text):     return  " ".join(text.split()) input_str = "we don't need   the given questions" remove_whitespace(input_str) 

Output:

“we don’t need the given questions”

7. Removing Stopwords

Stopwords are words that do not contribute much to the meaning of a sentence hence they can be removed. The NLTK library has a set of stopwords and we can use these to remove stopwords from our text. Below is the list of stopwords available in NLTK

Python
nltk.download('punkt_tab') def remove_stopwords(text):     stop_words = set(stopwords.words("english"))     word_tokens = word_tokenize(text)     filtered_text = [word for word in word_tokens if word not in stop_words]     return filtered_text  example_text = "This is a sample sentence and we are going to remove the stopwords from this." remove_stopwords(example_text) 

Output:

[‘This’, ‘sample’, ‘sentence’, ‘going’, ‘remove’, ‘stopwords’, ‘.’]

8. Applying Stemming

Stemming is the process of getting the root form of a word. Stem or root is the part to which affixes like -ed, -ize, -de, -s, etc are added. The stem of a word is created by removing the prefix or suffix of a word.

Example: 

books —> book
looked —> look
denied —> deni
flies —> fli

There are mainly three algorithms for stemming. These are the Porter Stemmer, the Snowball Stemmer and the Lancaster Stemmer. Porter Stemmer is the most common among them.

Python
stemmer = PorterStemmer()  def stem_words(text):     word_tokens = word_tokenize(text)     stems = [stemmer.stem(word) for word in word_tokens]     return stems  text = 'data science uses scientific methods algorithms and many types of processes' stem_words(text) 

Output:

[‘data’,
‘scienc’,
‘use’,
‘scientif’,
‘method’,
‘algorithm’,
‘and’,
‘mani’,
‘type’,
‘of’,
‘process’]

9. Applying Lemmatization

Lemmatization is a NLP technique that reduces a word to its root form. This can be helpful for tasks such as text analysis and search as it allows you to compare words that are related but have different forms.

Python
nltk.download('wordnet') lemmatizer = WordNetLemmatizer()  def lemma_words(text):     word_tokens = word_tokenize(text)     lemmas = [lemmatizer.lemmatize(word) for word in word_tokens]     return lemmas    input_str = "data science uses scientific methods algorithms and many types of processes" lemma_words(input_str) 

Output:

[‘data’,
‘science’,
‘us’,
‘scientific’,
‘method’,
‘algorithm’,
‘and’,
‘many’,
‘type’,
‘of’,
‘process’]

In this guide we learned different NLP text preprocessing technique which can be used to make a NLP based application and project.

Must Read:

  • Natural Language Processing (NLP) Tutorial
  • Phases of Natural Language Processing (NLP)
  • POS(Parts-Of-Speech) Tagging in NLP


Next Article
Print the Content of a Txt File in Python

J

jacobperalta
Improve
Article Tags :
  • Machine Learning
  • Python Programs
  • Natural-language-processing
  • python
  • Python-nltk
Practice Tags :
  • Machine Learning
  • python

Similar Reads

  • Text Preprocessing in Python | Set 2
    Text Preprocessing is one of the initial steps of Natural Language Processing (NLP) that involves cleaning and transforming raw data into suitable data for further processing. It enhances the quality of the text makes it easier to work and improves the performance of machine learning models. In this
    4 min read
  • Python Program to Replace Text in a File
    In this article, we are going to replace Text in a File using Python. Replacing Text could be either erasing the entire content of the file and replacing it with new text or it could mean modifying only specific words or sentences within the existing text. Method 1: Removing all text and write new t
    3 min read
  • Print the Content of a Txt File in Python
    Python provides a straightforward way to read and print the contents of a .txt file. Whether you are a beginner or an experienced developer, understanding how to work with file operations in Python is essential. In this article, we will explore some simple code examples to help you print the content
    3 min read
  • Count Words in Text File in Python
    Our task is to create a Python program that reads a text file, counts the number of words in the file and prints the word count. This can be done by opening the file, reading its contents, splitting the text into words, and then counting the total number of words. Example 1: Count String WordsFirst,
    3 min read
  • Word location in String - Python
    Word location in String problem in Python involves finding the position of a specific word or substring within a given string. This problem can be approached using various methods in Python, such as using the find(), index() methods or by regular expressions with the re module. Using str.find()str.f
    4 min read
  • Python Regex Metacharacters
    Metacharacters are considered as the building blocks of regular expressions. Regular expressions are patterns used to match character combinations in the strings. Metacharacter has special meaning in finding patterns and are mostly used to define the search criteria and any text manipulations. Some
    2 min read
  • Output of Python programs | Set 7
    Prerequisite - Strings in Python Predict the output of the following Python programs. These question set will make you conversant with String Concepts in Python programming language. Program 1[GFGTABS] Python var1 = 'Hello Geeks!' var2 = "GeeksforGeeks" print "var1[0]: ",
    3 min read
  • Python - Retain Numbers in String
    Retaining numbers in a string involves extracting only the numeric characters while ignoring non-numeric ones. Using List Comprehensionlist comprehension can efficiently iterate through each character in the string, check if it is a digit using the isdigit() method and join the digits together to fo
    2 min read
  • Python - Phrase removal in String
    Sometimes, while working with Python strings, we can have a problem in which we need to extract certain words in a string excluding the initial and rear K words. This can have application in many domains including all those include data. Lets discuss certain ways in which this task can be performed.
    2 min read
  • Output of Python programs | Set 8
    Prerequisite - Lists in Python Predict the output of the following Python programs. Program 1 [GFGTABS] Python list = [1, 2, 3, None, (1, 2, 3, 4, 5), ['Geeks', 'for', 'Geeks']] print len(list) [/GFGTABS]Output: 6Explanation: The beauty of python list datatype is that within
    3 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences