Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Unleashing the Power of Natural Language Processing
Next article icon

Processing text using NLP | Basics

Last Updated : 22 Sep, 2022
Comments
Improve
Suggest changes
Like Article
Like
Report

In this article, we will be learning the steps followed to process the text data before using it to train the actual Machine Learning Model.

Importing Libraries

The following must be installed in the current working environment:

  • NLTK Library: The NLTK library is a collection of libraries and programs written for processing of English language written in Python programming language.
  • urllib library: This is a URL handling library for python.
  • BeautifulSoup library: This is a library used for extracting data out of HTML and XML documents.

Python3




import nltk
from bs4 import BeautifulSoup
from urllib.request import urlopen
 
 

Once importing all the libraries, we need to extract the text. Text can be in string datatype or a file that we have to process.

Extracting Data

For this article, we are using web scraping to read a webpage then we will be using get_text() function for changing it to str format.

Python3




raw = urlopen("https://www.w3.org/TR/PNG/iso_8859-1.txt").read()
 
raw1 = BeautifulSoup(raw)
raw2 = raw1.get_text()
raw2
 
 

Output :

input

 

Data Preprocessing

Once the data extraction is done, the data is now ready to process. For that follow these steps :

1. Deletion of Punctuations and numerical text

Python3




# deletion of punctuations and numerical values
def punc(raw2):
  raw2 = re.sub('[^a-zA-Z]', ' ', raw2)
  return raw2
 
 

2. Creating Tokens

Python3




# extracting tokens
def token(raw2):
  tokens = nltk.word_tokenize(raw2)
  return tokens
 
 

3. Removing Stopwords

Python3




# lowercase the letters
# removing stopwords
def remove_(tokens):
  final = [word.lower()
         for word in tokens if word not in stopwords.words("english")]
  return final
 
 

4. Lemmatization

Python3




# Lemmatizing
from textblob import TextBlob
 
def lemma(final):
  # initialize an empty string
  str1 = ' '.join(final)
  s = TextBlob(str1)
  lemmatized_sentence = " ".join([w.lemmatize() for w in s.words])
  return final
 
 

5. Joining the final tokens

Python3




# Joining the final results
def join_(final):
  review = ' '.join(final)
  return ans
 
 

To execute the above functions refer this code : 

Python3




# Calling all the functions
raw2 = punc(raw2)
tokens = token(raw2)
final = remove_(tokens)
final = lemma(final)
ans = join_(final)
ans
 
 

Output : 

output

 



Next Article
Unleashing the Power of Natural Language Processing

N

noob_coders_ka_baap
Improve
Article Tags :
  • AI-ML-DS
  • Machine Learning
  • NLP
  • python
Practice Tags :
  • Machine Learning
  • python

Similar Reads

  • Text Preprocessing in NLP
    Natural Language Processing (NLP) has seen tremendous growth and development, becoming an integral part of various applications, from chatbots to sentiment analysis. One of the foundational steps in NLP is text preprocessing, which involves cleaning and preparing raw text data for further analysis o
    6 min read
  • Rule-based Stemming in Natural Language Processing
    Rule-based stemming is a technique in natural language processing (NLP) that reduces words to their root forms by applying specific rules for removing suffixes and prefixes. This method relies on a predefined set of rules that dictate how words should be altered, making it a straightforward approach
    2 min read
  • Unleashing the Power of Natural Language Processing
    Imagine talking to a computer and it understands you just like a human would. That’s the magic of Natural Language Processing. It a branch of AI that helps computers understand and respond to human language. It works by combining computer science to process text, linguistics to understand grammar an
    6 min read
  • Tokenize text using NLTK in python
    To run the below python program, (NLTK) natural language toolkit has to be installed in your system.The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology.In order to install NLTK run the following commands in your terminal. sudo pip
    3 min read
  • TensorFlow for NLU and Text Processing
    Natural Language Understanding (NLU) focuses on the interaction between computers and humans through natural language. The main goal of NLU is to enable computers to understand, interpret, and generate human languages in a valuable way. It is crucial for processing and analyzing large amounts of uns
    7 min read
  • Natural Language Processing (NLP) Tutorial
    Natural Language Processing (NLP) is the branch of Artificial Intelligence (AI) that gives the ability to machine understand and process human languages. Human languages can be in the form of text or audio format. Applications of NLPThe applications of Natural Language Processing are as follows: Voi
    5 min read
  • Sentiment Analysis using Fuzzy Logic
    Sentiment analysis, also known as opinion mining, is a crucial area of natural language processing (NLP) that involves determining the sentiment expressed in a piece of text. This sentiment can be positive, negative, or neutral. Traditional sentiment analysis methods often rely on machine learning t
    7 min read
  • Restaurant Review Analysis Using NLP and SQLite
    Normally, a lot of businesses are remained as failures due to lack of profit, lack of proper improvement measures. Mostly, restaurant owners face a lot of difficulties to improve their productivity. This project really helps those who want to increase their productivity, which in turn increases thei
    9 min read
  • Stemming with R Text Analysis
    Text analysis is a crucial component of data science and natural language processing (NLP). One of the fundamental techniques in this field is stemming is a process that reduces words to their root or base form. Stemming is vital in simplifying text data, making it more amenable to analysis and patt
    4 min read
  • Natural Language Processing with R
    Natural Language Processing (NLP) is a field of artificial intelligence (AI) that enables machines to understand and process human language. R, known for its statistical capabilities, provides a wide range of libraries to perform various NLP tasks. Understanding Natural Language ProcessingNLP involv
    4 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences