Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Data Science
  • Data Science Projects
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • ML Projects
  • Deep Learning
  • NLP
  • Computer Vision
  • Artificial Intelligence
Open In App
Next Article:
Mastering Text Summarization with Sumy: A Python Library Overview
Next article icon

Mastering Text Summarization with Sumy: A Python Library Overview

Last Updated : 25 Jul, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Sumy is one of the Python libraries for Natural Language Processing tasks. It is mainly used for automatic summarization of paragraphs using different algorithms. We can use different summarizers that are based on various algorithms, such as Luhn, Edmundson, LSA, LexRank, and KL-summarizers. We will learn in-depth about each of these algorithms in the upcoming sections. Sumy requires minimal code to build a summary, and it can be easily integrated with other Natural Language Processing tasks. This library is suitable for summarizing large documents.

In this article, we will first understand the benefits of using this article for our summarization task. Then we will look into the process of installing this library into our systems. Then we will first understand how to build a tokenizer and a stemmer before going into the summarization algorithms. We will then understand and implement each summarizer Sumy provides.

Benefits

  • Sumy provides many summarization algorithms, allowing users to choose from a wide range of summarizers based on their preferences.
  • This library integrates efficiently with other NLP libraries.
  • The library is easy to install and use, requiring minimal setup.
  • We can summarize lengthy documents using this library.
  • Sumy can be easily customized to fit specific summarization needs.

Installation

Now let’s look at the different ways to install this library on our system.

Via PyPI

To install it via PyPI, then paste the below command in your terminal.

pip install sumy

If you are working in a notebook such as Jupyter Notebook, Kaggle, or Google Colab, then add ‘!’ before the above command.

Via GitHub

There are two ways to install this library from GitHub: the first is to install it directly using the git link, and the second is to install it using pip.

pip install git+git://github.com/miso-belica/sumy.git
python3 setup.py install

Tokenizer

Tokenization is one of the most important task in text preprocessing. In tokenization, we divide a paragraph into sentences and then breakdown those sentences into individual words. By tokenizing the text, Sumy can better understand its structure and meaning, which improves the accuracy and quality of the summaries generated.

Now, let’s see how to build a tokenizer using Sumy lirary. We will first import the Tokenizer module from sumy, then we will download the ‘punkt’ from NLTK. We will then create an object or instance of Tokenizer for English language. We will then convert a sample text into sentences, then we will print the tokenized words for each sentence.

Python
from sumy.nlp.tokenizers import Tokenizer import nltk nltk.download('punkt') tokenizer = Tokenizer("en")  sentences = tokenizer.to_sentences("Hello, this is GeeksForGeeks! We are a computer science portal for geeks, offering a wide range of articles, tutorials, and resources on various topics in computer science and programming. Our mission is to provide quality education and knowledge sharing to help you excel in your career and academic pursuits. Whether you're a beginner looking to learn the basics of coding or an experienced developer seeking advanced concepts, GeeksForGeeks has something for everyone. ") for sentence in sentences:     print(tokenizer.to_words(sentence)) 

OUTPUT:

('Hello', 'this', 'is', 'GeeksForGeeks')

('We', 'are', 'a', 'computer', 'science', 'portal', 'for', 'geeks', 'offering', 'a', 'wide', 'range', 'of', 'articles', 'tutorials', 'and', 'resources', 'on', 'various', 'topics', 'in', 'computer', 'science', 'and', 'programming')

('Our', 'mission', 'is', 'to', 'provide', 'quality', 'education', 'and', 'knowledge', 'sharing', 'to', 'help', 'you', 'excel', 'in', 'your', 'career', 'and', 'academic', 'pursuits')

('Whether', 'you', 'a', 'beginner', 'looking', 'to', 'learn', 'the', 'basics', 'of', 'coding', 'or', 'an', 'experienced', 'developer', 'seeking', 'advanced', 'concepts', 'GeeksForGeeks', 'has', 'something', 'for', 'everyone')

Stemmer

Stemming is the process of reducing a word to its base or root form. This helps in normalizing words so that different forms of a word are treated as the same term. By doing this, summarization algorithms can more effectively recognize and group similar words, thereby improving the summarization quality. The stemmer is particularly useful when we have large texts that have various forms of the same words.

To create a stemmer using the Sumy library, we will first import the `Stemmer` module from Sumy. Then, we will create an object of `Stemmer` for the English language. Next, we will pass a word to the stemmer to reduce it to its root form. Finally, we will print the stemmed word.

Python
from sumy.nlp.stemmers import Stemmer stemmer = Stemmer("en") stem = stemmer("Blogging")  print(stem) 

Output:

blog

Summarizers

Luhn Summarizer

The Luhn Summarizer is one of the summarization algorithms provided by the Sumy library. This summarizer is based on the concept of frequency analysis, where the importance of a sentence is determined by the frequency of significant words within it. The algorithm identifies words that are most relevant to the topic of the text by filterin gout some common stop words and then ranks sentences. The Luhn Summarizer is effective for extracting key sentences from a document. Here's how to build the Luhn Summarizer:

Python
from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.luhn import LuhnSummarizer from sumy.nlp.stemmers import Stemmer from sumy.utils import get_stop_words import nltk nltk.download('punkt')  def summarize_paragraph(paragraph, sentences_count=2):     parser = PlaintextParser.from_string(paragraph, Tokenizer("english"))      summarizer = LuhnSummarizer(Stemmer("english"))     summarizer.stop_words = get_stop_words("english")      summary = summarizer(parser.document, sentences_count)     return summary  if __name__ == "__main__":     paragraph = """Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast                     to the natural intelligence displayed by humans and animals. Leading AI textbooks define                     the field as the study of "intelligent agents": any device that perceives its environment                     and takes actions that maximize its chance of successfully achieving its goals. Colloquially,                     the term "artificial intelligence" is often used to describe machines (or computers) that mimic                     "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving"."""      sentences_count = 2     summary = summarize_paragraph(paragraph, sentences_count)      for sentence in summary:         print(sentence) 

Output:

Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals.
Colloquially, the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".

Edmundson Summarizer

The Edmundson Summarizer is another powerful algorithm provided by the Sumy library. Unlike other summarizers that primarily rely on statistical and frequency-based methods, the Edmundson Summarizer allows for a more tailored approach through the use of bonus words, stigma words, and null words. These type of woreds enable the algorithm to emphasize or de-emphasize those words in the summarized text. Here's how to build the Edmundson Summarizer:

Python
from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.edmundson import EdmundsonSummarizer from sumy.nlp.stemmers import Stemmer from sumy.utils import get_stop_words import nltk nltk.download('punkt')  def summarize_paragraph(paragraph, sentences_count=2, bonus_words=None, stigma_words=None, null_words=None):     parser = PlaintextParser.from_string(paragraph, Tokenizer("english"))      summarizer = EdmundsonSummarizer(Stemmer("english"))     summarizer.stop_words = get_stop_words("english")      if bonus_words:         summarizer.bonus_words = bonus_words     if stigma_words:         summarizer.stigma_words = stigma_words     if null_words:         summarizer.null_words = null_words      summary = summarizer(parser.document, sentences_count)     return summary  if __name__ == "__main__":     paragraph = """Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast                     to the natural intelligence displayed by humans and animals. Leading AI textbooks define                     the field as the study of "intelligent agents": any device that perceives its environment                     and takes actions that maximize its chance of successfully achieving its goals. Colloquially,                     the term "artificial intelligence" is often used to describe machines (or computers) that mimic                     "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving"."""      sentences_count = 2     bonus_words = ["intelligence", "AI"]     stigma_words = ["contrast"]     null_words = ["the", "of", "and", "to", "in"]      summary = summarize_paragraph(paragraph, sentences_count, bonus_words, stigma_words, null_words)      for sentence in summary:         print(sentence) 

Output:

Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals.
Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.

LSA Summarizer

The LSA summarizer is the best one amognst all because it works by identifying patterns and relationships between texts, rather than soley rely on frequency analysis. This LSA summarizer generates more contextually accurate summaries by understanding the meaning and context of the input text. Here's how to build the LSA Summarizer:

Python
from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.summarizers.lsa import LsaSummarizer from sumy.nlp.stemmers import Stemmer from sumy.utils import get_stop_words import nltk nltk.download('punkt')  def summarize_paragraph(paragraph, sentences_count=2):     parser = PlaintextParser.from_string(paragraph, Tokenizer("english"))      summarizer = LsaSummarizer(Stemmer("english"))     summarizer.stop_words = get_stop_words("english")      summary = summarizer(parser.document, sentences_count)     return summary  if __name__ == "__main__":     paragraph = """Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast                     to the natural intelligence displayed by humans and animals. Leading AI textbooks define                     the field as the study of "intelligent agents": any device that perceives its environment                     and takes actions that maximize its chance of successfully achieving its goals. Colloquially,                     the term "artificial intelligence" is often used to describe machines (or computers) that mimic                     "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving"."""      sentences_count = 2     summary = summarize_paragraph(paragraph, sentences_count)      for sentence in summary:         print(sentence) 

Output:

Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.
Colloquially, the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".

Conclusion

In conclusion, Sumy is one of the best automatic text summarizing libraries available. We can also use this library for tasks like tokenization and stemming. By using different algorithms like Luhn, Edmundson, and LSA, we can generate concise and meaningful summaries based on our specific needs. Although we have used a smaller paragraph for examples, we can summarize lengthy documents using this library in no time.




Next Article
Mastering Text Summarization with Sumy: A Python Library Overview

A

ananyamanjunath
Improve
Article Tags :
  • AI-ML-DS
  • python
  • NLP Blogs
  • AI-ML-DS With Python
Practice Tags :
  • python

Similar Reads

    Text Summarization in NLP
    Automatic Text Summarization is a key technique in Natural Language Processing (NLP) that uses algorithms to reduce large texts while preserving essential information. Although it doesn’t receive as much attention as other machine learning breakthroughs, text summarization technology has seen contin
    7 min read
    Text Summarizations using HuggingFace Model
    Text summarization is a crucial task in natural language processing (NLP) that involves generating concise and coherent summaries from longer text documents. This task has numerous applications, such as creating summaries for news articles, research papers, and long-form content, making it easier fo
    5 min read
    Ted Talks Recommendation System with Machine Learning
    When did we see a video on youtube let's say it was funny then the next time you open your youtube app you get recommendations of some funny videos in your feed ever thought about how? This is nothing but an application of Machine Learning using which recommender systems are built to provide persona
    5 min read
    Mastering TF-IDF Calculation with Pandas DataFrame in Python
    Term Frequency-Inverse Document Frequency (TF-IDF) is a popular technique in Natural Language Processing (NLP) to transform text into numerical features. It measures the importance of a word in a document relative to a collection of documents (corpus). In this article, we will explore how to compute
    5 min read
    Mastering Python Libraries for Effective data processing
    Python has become the go-to programming language for data science and data processing due to its simplicity, readability, and extensive library support. In this article, we will explore some of the most effective Python libraries for data processing, highlighting their key features and applications.
    7 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences