Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Python Tutorial
  • Interview Questions
  • Python Quiz
  • Python Glossary
  • Python Projects
  • Practice Python
  • Data Science With Python
  • Python Web Dev
  • DSA with Python
  • Python OOPs
Open In App
Next Article:
Python - Filtering text using Enchant
Next article icon

Python – Tokenize text using Enchant

Last Updated : 26 May, 2020
Comments
Improve
Suggest changes
Like Article
Like
Report

Enchant is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not.

Enchant also provides the enchant.tokenize module to tokenize text. Tokenizing involves splitting words from the body of the text.

Some terms that will be frequently used are :

  • Corpus – Body of text, singular. Corpora is the plural of this.
  • Lexicon – Words and their meanings.
  • Token – Each “entity” that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is “tokenized” into words.

We will be using get_tokenizer() to tokenize the text. It takes a language code as input and returns the appropriate tokenization class. Then we instantiate this class with some text and it will return an iterator which will yield the words contained in that text.
The items produced by the tokenizer are tuples of the form (WORD, POS), where WORD is the tokenized word and POS is the position of string at which that word is located.




# import the module
from enchant.tokenize import get_tokenizer
  
# the text to be tokenized 
text = ("Natural language processing (NLP) is a field " + 
       "of computer science, artificial intelligence " + 
       "and computational linguistics concerned with " +  
       "the interactions between computers and human " +  
       "(natural) languages, and, in particular, " +  
       "concerned with programming computers to " + 
       "fruitfully process large natural language " +  
       "corpora. Challenges in natural language " +  
       "processing frequently involve natural " + 
       "language understanding, natural language" +  
       "generation frequently from formal, machine" +  
       "-readable logical forms), connecting language " +  
       "and machine perception, managing human-" + 
       "computer dialog systems, or some combination " +  
       "thereof.")
  
# getting tokenizer class
tokenizer = get_tokenizer("en_US")
  
token_list =[]
for words in tokenizer(text):
    token_list.append(words)
  
# print the words with POS
print(token_list)
 
 

Output :

[(‘Natural’, 0), (‘language’, 8), (‘processing’, 17), (‘NLP’, 29), (‘is’, 34), (‘a’, 37), (‘field’, 39), (‘of’, 45), (‘computer’, 48), (‘science’, 57), (‘artificial’, 66), (‘intelligence’, 77), (‘and’, 90), (‘computational’, 94), (‘linguistics’, 108), (‘concerned’, 120), (‘with’, 130), (‘the’, 135), (‘interactions’, 139), (‘between’, 152), (‘computers’, 160), (‘and’, 170), (‘human’, 174), (‘natural’, 181), (‘languages’, 190), (‘and’, 201), (‘in’, 206), (‘particular’, 209), (‘concerned’, 221), (‘with’, 231), (‘programming’, 236), (‘computers’, 248), (‘to’, 258), (‘fruitfully’, 261), (‘process’, 272), (‘large’, 280), (‘natural’, 286), (‘language’, 294), (‘corpora’, 303), (‘Challenges’, 312), (‘in’, 323), (‘natural’, 326), (‘language’, 334), (‘processing’, 343), (‘frequently’, 354), (‘involve’, 365), (‘natural’, 373), (‘language’, 381), (‘understanding’, 390), (‘natural’, 405), (‘languagegeneration’, 413), (‘frequently’, 432), (‘from’, 443), (‘formal’, 448), (‘machine’, 456), (‘readable’, 464), (‘logical’, 473), (‘forms’, 481), (‘connecting’, 489), (‘language’, 500), (‘and’, 509), (‘machine’, 513), (‘perception’, 521), (‘managing’, 533), (‘human’, 542), (‘computer’, 548), (‘dialog’, 557), (‘systems’, 564), (‘or’, 573), (‘some’, 576), (‘combination’, 581), (‘thereof’, 593)]

To only print the words, not the POS :




# print only the words
word_list =[]
  
for tokens in token_list:
    word_list.append(tokens[0])
print(word_list)
 
 

Output :

[‘Natural’, ‘language’, ‘processing’, ‘NLP’, ‘is’, ‘a’, ‘field’, ‘of’, ‘computer’, ‘science’, ‘artificial’, ‘intelligence’, ‘and’, ‘computational’, ‘linguistics’, ‘concerned’, ‘with’, ‘the’, ‘interactions’, ‘between’, ‘computers’, ‘and’, ‘human’, ‘natural’, ‘languages’, ‘and’, ‘in’, ‘particular’, ‘concerned’, ‘with’, ‘programming’, ‘computers’, ‘to’, ‘fruitfully’, ‘process’, ‘large’, ‘natural’, ‘language’, ‘corpora’, ‘Challenges’, ‘in’, ‘natural’, ‘language’, ‘processing’, ‘frequently’, ‘involve’, ‘natural’, ‘language’, ‘understanding’, ‘natural’, ‘languagegeneration’, ‘frequently’, ‘from’, ‘formal’, ‘machine’, ‘readable’, ‘logical’, ‘forms’, ‘connecting’, ‘language’, ‘and’, ‘machine’, ‘perception’, ‘managing’, ‘human’, ‘computer’, ‘dialog’, ‘systems’, ‘or’, ‘some’, ‘combination’, ‘thereof’]



Next Article
Python - Filtering text using Enchant

Y

Yash_R
Improve
Article Tags :
  • Python
  • Python Enchant-module
Practice Tags :
  • python

Similar Reads

  • Python - Chunking text using Enchant
    Enchant is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not. Enchant also provides the enchant.tokenize module to tokenize text. Tokenizing involves spl
    2 min read
  • Python - Filtering text using Enchant
    Enchant is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not. Enchant also provides the enchant.tokenize module to tokenize text. Tokenizing involves spl
    3 min read
  • Python NLTK | tokenize.regexp()
    With the help of NLTK tokenize.regexp() module, we are able to extract the tokens from string by using regular expression with RegexpTokenizer() method. Syntax : tokenize.RegexpTokenizer() Return : Return array of tokens using regular expression Example #1 : In this example we are using RegexpTokeni
    1 min read
  • Python NLTK | nltk.TweetTokenizer()
    With the help of NLTK nltk.TweetTokenizer() method, we are able to convert the stream of words into small tokens so that we can analyse the audio stream with the help of nltk.TweetTokenizer() method. Syntax : nltk.TweetTokenizer() Return : Return the stream of token Example #1 : In this example when
    1 min read
  • Python - UwU text convertor GUI using Tkinter
    Prerequisites :Introduction to tkinter | UwU text convertor Python offers multiple options for developing a GUI (Graphical User Interface). Out of all the GUI methods, Tkinter is the most commonly used method. It is a standard Python interface to the Tk GUI toolkit shipped with Python. Python with T
    4 min read
  • Python Tokens and Character Sets
    Python is a general-purpose, high-level programming language. It was designed with an emphasis on code readability, and its syntax allows programmers to express their concepts in fewer lines of code, and these codes are known as scripts. These scripts contain character sets, tokens, and identifiers.
    6 min read
  • Python - Spelling checker using Enchant
    Enchant is a module in Python, which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in the dictionary or not. Enchant can also be used to check the spelling of words. The check() method returns Tr
    2 min read
  • Sentiment Detector GUI using Tkinter - Python
    Prerequisites : Introduction to tkinter | Sentiment Analysis using Vader Python offers multiple options for developing a GUI (Graphical User Interface). Out of all the GUI methods, Tkinter is the most commonly used method. Python with Tkinter outputs the fastest and easiest way to create GUI applica
    4 min read
  • Python NLTK | nltk.tokenize.mwe()
    With the help of NLTK nltk.tokenize.mwe() method, we can tokenize the audio stream into multi_word expression token which helps to bind the tokens with underscore by using nltk.tokenize.mwe() method. Remember it is case sensitive. Syntax : MWETokenizer.tokenize() Return : Return bind tokens as one i
    1 min read
  • Python Tkinter - Text Widget
    Tkinter is a GUI toolkit used in python to make user-friendly GUIs.Tkinter is the most commonly used and the most basic GUI framework available in python. Tkinter uses an object-oriented approach to make GUIs.Note: For more information, refer to Python GUI – tkinter  Text WidgetText Widget is used w
    5 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences