Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Beautiful Soup
  • Selenium
  • Scrapy
  • urllib
  • Request
  • open cv
  • Data analysis
  • Machine learning
  • NLP
  • Deep learning
  • Data Science
  • Interview question
  • ML math
  • ML Projects
  • ML interview
  • DL interview
Open In App
Next Article:
GET and POST Requests Using Python
Next article icon

Clean Web Scraping Data Using clean-text in Python

Last Updated : 31 Mar, 2022
Comments
Improve
Suggest changes
Like Article
Like
Report

If you like to play with API’s or like to scrape data from various websites, you must’ve come around random annoying text, numbers, keywords that come around with data. Sometimes it can be really complicating and frustrating to clean scraped data to obtain the actual data that we want. 

In this article, we are going to explore a python library called clean-text which will help you to clean your scraped data in a matter of seconds without writing any fancy, long code. Let’s begin

Installation

Use the following command

pip install clean-text

Note: CleanText package requires Python 3.7 or greater.

Syntax

cleantext.clean_words( text , {operations})

  • text: string
  • operations: mentions below 

Different cleantext operations:

The clean-text function provides a range of arguments that specifies how to clean the given raw text input and return the cleaned text in the form of a string. Here is the list of arguments that you can use to clean your required data.

  • fix_unicode: Fix Unicode errors, takes the value as True or False
  • to_ascii: Translate to ASCII representation, takes the value as True or False
  • lower: Convert input to lowercase, takes the value as True or False            
  • no_line_breaks: Remove all the line breaks
  • no_urls: replace all URLs with a special token
  • no_emails: replace all email addresses with a special token
  • no_phone_numbers: replace all phone numbers with a special token
  • no_numbers: replace all numbers with a special token
  • no_digits: replace all digits with a special token
  • no_currency_symbols: replace all currency symbols with a special token   
  • no_punct : Remove all punctuation       
  • replace_with_punct=”” : Replace all punctuation with given input
  • replace_with_url=”<URL>” : Replace data URL’s with given input
  • replace_with_email=”<EMAIL>” : Replace data email’s with given input
  • replace_with_phone_number=”<PHONE>” : Replace phone numbers with given input
  • replace_with_number=”<NUMBER>” : Replace numbers with given input
  • replace_with_digit=”0″ : Replace digits with given input
  • replace_with_currency_symbol=”<CUR>” : Replace data email’s with given input
  • lang=“en” (Only English and German languages are supported)

Code Implementation:

Python3

# import library
from cleantext import clean
 
# input string
text = """
    A bunch of \\u2018new\\u2019 references,
    including [Moana]. »Yóù àré rïght <3!«
    """
 
print(clean(text=text,
            fix_unicode=True,
            to_ascii=True,
            lower=True,
            no_line_breaks=False,
            no_urls=False,
            no_emails=False,
            no_phone_numbers=False,
            no_numbers=False,
            no_digits=False,
            no_currency_symbols=False,
            no_punct=False,
            replace_with_punct="",
            replace_with_url="This is a URL",
            replace_with_email="Email",
            replace_with_phone_number="",
            replace_with_number="123",
            replace_with_digit="0",
            replace_with_currency_symbol="$",
            lang="en"
            ))
                      
                       

Output:

 



Next Article
GET and POST Requests Using Python

A

adityasangave21
Improve
Article Tags :
  • Geeks Premier League
  • Python
  • Geeks-Premier-League-2022
  • python-modules
  • Web-scraping
Practice Tags :
  • python

Similar Reads

  • Python Web Scraping Tutorial
    In today’s digital world, data is the key to unlocking valuable insights, and much of this data is available on the web. But how do you gather large amounts of data from websites efficiently? That’s where Python web scraping comes in.Web scraping, the process of extracting data from websites, has em
    12 min read
  • Introduction to Web Scraping

    • Introduction to Web Scraping
      Web scraping is a technique to fetch data from websites. While surfing on the web, many websites prohibit the user from saving data for personal use. This article will brief you about What is Web Scraping, Uses, Techniques, Tools, and challenges of Web Scraping. Table of Content What is Web Scraping
      6 min read

    • What is Web Scraping and How to Use It?
      Suppose you want some information from a website. Let’s say a paragraph on Donald Trump! What do you do? Well, you can copy and paste the information from Wikipedia into your file. But what if you want to get large amounts of information from a website as quickly as possible? Such as large amounts o
      7 min read

    • Web Scraping - Legal or Illegal?
      If you're connected with the term 'Web Scraping' anyhow, then you must come across a question - Is Web Scraping legal or illegal? Okay, so let's discuss it. If you look closely, you will find out that in today's era the biggest asset of any business is Data! Even the top giants like Facebook, Amazon
      5 min read

    • Difference between Web Scraping and Web Crawling
      1. Web Scraping : Web Scraping is a technique used to extract a large amount of data from websites and then saving it to the local machine in the form of XML, excel or SQL. The tools used for web scraping are known as web scrapers. On the basis of the requirements given, they can extract the data fr
      2 min read

    • Web Scraping using cURL in PHP
      We all have tried getting data from a website in many ways. In this article, we will learn how to web scrape using bots to extract content and data from a website.  We will use PHP cURL to scrape a web page, it looks like a typo from leaving caps lock on, but that’s really how you write it. cURL is
      2 min read

    Basics of Web Scraping

    • HTML Basics
      HTML (HyperText Markup Language) is the standard markup language for creating and structuring web pages. It defines the structure of a webpage using elements and tags.HTML is responsible for displaying text, images, and other content.It serves as the foundation for building websites and web applicat
      6 min read

    • Tags vs Elements vs Attributes in HTML
      In HTML, tags represent the structural components of a document, such as <h1> for headings. Elements are formed by tags and encompass both the opening and closing tags along with the content. Attributes provide additional information or properties to elements, enhancing their functionality or
      2 min read

    • CSS Introduction
      CSS (Cascading Style Sheets) is a language designed to simplify the process of making web pages presentable. It allows you to apply styles to HTML documents by prescribing colors, fonts, spacing, and positioning.The main advantages are the separation of content (in HTML) and styling (in CSS) and the
      5 min read

    • CSS Syntax
      CSS (Cascading Style Sheets) is a stylesheet language used to describe the presentation of a document written in HTML. Understanding CSS syntax is fundamental for creating visually appealing and well-structured web pages. Basic CSS SyntaxCSS is written as rulesets. A ruleset consists of a selector a
      6 min read

    • JavaScript Cheat Sheet - A Basic Guide to JavaScript
      JavaScript is a lightweight, open, and cross-platform programming language. It is omnipresent in modern development and is used by programmers across the world to create dynamic and interactive web content like applications and browsers JavaScript (JS) is a versatile, high-level programming language
      15+ min read

    Setting Up the Environment

    • Installing BeautifulSoup: A Beginner's Guide
      BeautifulSoup is a Python library that makes it easy to extract data from HTML and XML files. It helps you find, navigate, and change the information in these files quickly and simply. It’s a great tool that can save you a lot of time when working with web data. The latest version of BeautifulSoup i
      2 min read

    • How to Install Requests in Python - For Windows, Linux, Mac
      Requests is an elegant and simple HTTP library for Python, built for human beings. One of the most famous libraries for Python is used by developers all over the world. This article revolves around how one can install the requests library of Python in Windows/ Linux/ macOS using pip. Table of Conten
      7 min read

    • Selenium Python Introduction and Installation
      Selenium's Python Module is built to perform automated testing with Python. Selenium in Python bindings provides a simple API to write functional/acceptance tests using Selenium WebDriver. Through Selenium Python API you can access all functionalities of python selenium webdriver intuitively. Table
      4 min read

    • How to Install Python Scrapy on Windows?
      Scrapy is a web scraping library that is used to scrape, parse and collect web data. Now once our spider has scrapped the data then it decides whether to: Keep the data.Drop the data or items.stop and store the processed data items. In this article, we will look into the process of installing the Sc
      2 min read

    Extracting Data from Web Pages

    • Implementing Web Scraping in Python with BeautifulSoup
      There are mainly two ways to extract data from a website: Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.Access the HTML of the webpage and extract useful information/data from it. This technique is called
      8 min read

    • How to extract paragraph from a website and save it as a text file?
      Perquisites: Beautiful soupUrllib Scraping is an essential technique which helps us to retrieve useful data from a URL or a html file that can be used in another manner. The given article shows how to extract paragraph from a URL and save it as a text file. Modules Needed bs4: Beautiful Soup(bs4) is
      2 min read

    • Extract all the URLs from the webpage Using Python
      Scraping is a very essential skill for everyone to get data from any website. In this article, we are going to write Python scripts to extract all the URLs from the website or you can save it as a CSV file. Module Needed: bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and
      2 min read

    • How to Scrape Nested Tags using BeautifulSoup?
      We can scrap the Nested tag in beautiful soup with help of. (dot) operator. After creating a soup of the page if we want to navigate nested tag then with the help of. we can do it. For scraping Nested Tag using Beautifulsoup follow the below-mentioned steps. Step-by-step Approach Step 1: The first s
      3 min read

    • Extract all the URLs that are nested within <li> tags using BeautifulSoup
      Beautiful Soup is a python library used for extracting html and xml files. In this article we will understand how we can extract all the URLSs from a web page that are nested within <li> tags. Module needed and installation:BeautifulSoup: Our primary module contains a method to access a webpag
      4 min read

    • Clean Web Scraping Data Using clean-text in Python
      If you like to play with API's or like to scrape data from various websites, you must've come around random annoying text, numbers, keywords that come around with data. Sometimes it can be really complicating and frustrating to clean scraped data to obtain the actual data that we want.  In this arti
      2 min read

    Fetching Web Pages

    • GET and POST Requests Using Python
      This post discusses two HTTP (Hypertext Transfer Protocol) request methods  GET and POST requests in Python and their implementation in Python.  What is HTTP? HTTP is a set of protocols designed to enable communication between clients and servers. It works as a request-response protocol between a cl
      7 min read

    • BeautifulSoup - Scraping Paragraphs from HTML
      In this article, we will discuss how to scrap paragraphs from HTML using Beautiful Soup Method 1: using bs4 and urllib. Module Needed: bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. For installing the module-pip install bs4.urllib: urllib is a package that c
      3 min read

    HTTP Request Methods

    • GET method - Python requests
      Requests library is one of the important aspects of Python for making HTTP requests to a specified URL. This article revolves around how one can make GET request to a specified URL using requests.GET() method. Before checking out GET method, let's figure out what a GET request is - GET Http Method T
      2 min read

    • POST method - Python requests
      Requests library is one of the important aspects of Python for making HTTP requests to a specified URL. This article revolves around how one can make POST request to a specified URL using requests.post() method. Before checking out the POST method, let's figure out what a POST request is -   POST Ht
      2 min read

    • PUT method - Python requests
      The requests library is a powerful and user-friendly tool in Python for making HTTP requests. The PUT method is one of the key HTTP request methods used to update or create a resource at a specific URI. Working of HTTP PUT Method If the resource exists at the given URI, it is updated with the new da
      2 min read

    • DELETE method- Python requests
      Requests library is one of the important aspects of Python for making HTTP requests to a specified URL. This article revolves around how one can make DELETE request to a specified URL using requests.delete() method. Before checking out the DELETE method, let's figure out what a Http DELETE request i
      2 min read

    • HEAD method - Python requests
      Requests library is one of the important aspects of Python for making HTTP requests to a specified URL. This article revolves around how one can make HEAD request to a specified URL using requests.head() method. Before checking out the HEAD method, let's figure out what a Http HEAD request is - HEAD
      2 min read

    • PATCH method - Python requests
      Requests library is one of the important aspects of Python for making HTTP requests to a specified URL. This article revolves around how one can make PATCH request to a specified URL using requests.patch() method. Before checking out the PATCH method, let's figure out what a Http PATCH request is -
      3 min read

    Searching and Extract for specific tags Beautifulsoup

    • Python BeautifulSoup - find all class
      Prerequisite:- Requests , BeautifulSoup The task is to write a program to find all the classes for a given Website URL. In Beautiful Soup there is no in-built method to find all classes. Module needed: bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This modu
      2 min read

    • BeautifulSoup - Search by text inside a tag
      Prerequisites: Beautifulsoup Beautifulsoup is a powerful python module used for web scraping. This article discusses how a specific text can be searched inside a given tag. INTRODUCTION: BeautifulSoup is a Python library for parsing HTML and XML documents. It provides a simple and intuitive API for
      4 min read

    • Scrape Google Search Results using Python BeautifulSoup
      In this article, we are going to see how to Scrape Google Search Results using Python BeautifulSoup. Module Needed:bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the te
      3 min read

    • Get tag name using Beautifulsoup in Python
      Prerequisite: Beautifulsoup Installation Name property is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. Name object corresponds to the name of an XML or HTML t
      1 min read

    • Extracting an attribute value with beautifulsoup in Python
      Prerequisite: Beautifulsoup Installation Attributes are provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. A tag may have any number of attributes. For example, the
      2 min read

    • BeautifulSoup - Modifying the tree
      Prerequisites: BeautifulSoup Beautifulsoup is a Python library used for web scraping. This powerful python tool can also be used to modify html webpages. This article depicts how beautifulsoup can be employed to modify the parse tree. BeautifulSoup is used to search the parse tree and allow you to m
      5 min read

    • Find the text of the given tag using BeautifulSoup
      Web scraping is a process of using software bots called web scrapers in extracting information from HTML or XML content of a web page. Beautiful Soup is a library used for scraping data through python. Beautiful Soup works along with a parser to provide iteration, searching, and modifying the conten
      2 min read

    • Remove spaces from a string in Python
      Removing spaces from a string is a common task in Python that can be solved in multiple ways. For example, if we have a string like " g f g ", we might want the output to be "gfg" by removing all the spaces. Let's look at different methods to do so: Using replace() methodTo remove all spaces from a
      2 min read

    • Understanding Character Encoding
      Ever imagined how a computer is able to understand and display what you have written? Ever wondered what a UTF-8 or UTF-16 meant when you were going through some configurations? Just think about how "HeLLo WorlD" should be interpreted by a computer. We all know that a computer stores data in bits an
      6 min read

    • XML parsing in Python
      This article focuses on how one can parse a given XML file and extract some useful data out of it in a structured way. XML: XML stands for eXtensible Markup Language. It was designed to store and transport data. It was designed to be both human- and machine-readable.That's why, the design goals of X
      7 min read

    • Python - XML to JSON
      A JSON file is a file that stores simple data structures and objects in JavaScript Object Notation (JSON) format, which is a standard data interchange format. It is primarily used for transmitting data between a web application and a server. A JSON object contains data in the form of a key/value pai
      4 min read

    Scrapy Basics

    • Scrapy - Command Line Tools
      Prerequisite: Implementing Web Scraping in Python with Scrapy Scrapy is a python library that is used for web scraping and searching the contents throughout the web. It uses Spiders which crawls throughout the page to find out the content specified in the selectors. Hence, it is a very handy tool to
      5 min read

    • Scrapy - Item Loaders
      In this article, we are going to discuss Item Loaders in Scrapy. Scrapy is used for extracting data, using spiders, that crawl through the website. The obtained data can also be processed, in the form, of Scrapy Items. The Item Loaders play a significant role, in parsing the data, before populating
      15+ min read

    • Scrapy - Item Pipeline
      Scrapy is a web scraping library that is used to scrape, parse and collect web data. For all these functions we are having a pipelines.py file which is used to handle scraped data through various components (known as class) which are executed sequentially. In this article, we will be learning throug
      10 min read

    • Scrapy - Selectors
      Scrapy Selectors as the name suggest are used to select some things. If we talk of CSS, then there are also selectors present that are used to select and apply CSS effects to HTML tags and text. In Scrapy we are using selectors to mention the part of the website which is to be scraped by our spiders
      7 min read

    • Scrapy - Shell
      Scrapy is a well-organized framework, used for large-scale web scraping. Using selectors, like XPath or CSS expressions, one can scrape data seamlessly. It allows systematic crawling, and scraping the data, and storing the content in different file formats. Scrapy comes equipped with a shell, that h
      9 min read

    • Scrapy - Spiders
      Scrapy is a free and open-source web-crawling framework which is written purely in python. Thus, scrapy can be installed and imported like any other python package. The name of the package is self-explanatory. It is derived from the word 'scraping' which literally means extracting desired substance
      11 min read

    • Scrapy - Feed exports
      Scrapy is a fast high-level web crawling and scraping framework written in Python used to crawl websites and extract structured data from their pages. It can be used for many purposes, from data mining to monitoring and automated testing. This article is divided into 2 sections:Creating a Simple web
      5 min read

    • Scrapy - Link Extractors
      In this article, we are going to learn about Link Extractors in scrapy. "LinkExtractor" is a class provided by scrapy to extract links from the response we get while fetching a website. They are very easy to use which we'll see in the below post.  Scrapy - Link Extractors Basically using the "LinkEx
      5 min read

    • Scrapy - Settings
      Scrapy is an open-source tool built with Python Framework. It presents us with a strong and robust web crawling framework that can easily extract the info from the online page with the assistance of selectors supported by XPath. We can define the behavior of Scrapy components with the help of Scrapy
      7 min read

    • Scrapy - Sending an E-mail
      Prerequisites: Scrapy Scrapy provides its own facility for sending e-mails which is extremely easy to use, and it’s implemented using Twisted non-blocking IO, to avoid interfering with the non-blocking IO of the crawler. This article discusses how mail can be sent using scrapy.  For this MailSender
      2 min read

    • Scrapy - Exceptions
      Python-based Scrapy is a robust and adaptable web scraping platform. It provides a variety of tools for systematic, effective data extraction from websites. It helps us to automate data extraction from numerous websites. Scrapy Python Scrapy describes the spider that browses websites and gathers dat
      7 min read

    Selenium Python Basics

    • Navigating links using get method in Selenium - Python
      Selenium's Python module allows you to automate web testing using Python. The Selenium Python bindings provide a straightforward API to write functional and acceptance tests with Selenium WebDriver. Through this API, you can easily access all WebDriver features in a user-friendly way. This article e
      2 min read

    • Interacting with Webpage - Selenium Python
      Selenium’s Python module is designed for automating web testing tasks in Python. It provides a straightforward API through Selenium WebDriver, allowing you to write functional and acceptance tests. To open a webpage, you can use the get() method for navigation. However, the true power of Selenium li
      3 min read

    • Locating single elements in Selenium Python
      Locators Strategies in Selenium Python are methods that are used to locate elements from the page and perform an operation on the same. Selenium’s Python Module is built to perform automated testing with Python. Selenium Python bindings provide a simple API to write functional/acceptance tests using
      5 min read

    • Locating multiple elements in Selenium Python
      Locators Strategies in Selenium Python are methods that are used to locate single or multiple elements from the page and perform operations on the same. Selenium’s Python Module is built to perform automated testing with Python. Selenium Python bindings provide a simple API to write functional/accep
      5 min read

    • Locator Strategies - Selenium Python
      Locators Strategies in Selenium Python are methods that are used to locate elements from the page and perform an operation on the same. Selenium’s Python Module is built to perform automated testing with Python. Selenium Python bindings provides a simple API to write functional/acceptance tests usin
      2 min read

    • Writing Tests using Selenium Python
      Selenium's Python Module is built to perform automated testing with Python. Selenium Python bindings provides a simple API to write functional/acceptance tests using Selenium WebDriver. Through Selenium Python API you can access all functionalities of Selenium WebDriver in an intuitive way. This art
      2 min read

geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences