Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Beautiful Soup
  • Selenium
  • Scrapy
  • urllib
  • Request
  • open cv
  • Data analysis
  • Machine learning
  • NLP
  • Deep learning
  • Data Science
  • Interview question
  • ML math
  • ML Projects
  • ML interview
  • DL interview
Open In App
Next Article:
Downloading PDFs with Python using Requests and BeautifulSoup
Next article icon

How to Scrape Websites with Beautifulsoup and Python ?

Last Updated : 03 Jun, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Have you ever wondered how much data is created on the internet every day, and what if you want to work with those data? Unfortunately, this data is not properly organized like some CSV or JSON file but fortunately, we can use web scraping to scrape the data from the internet and can use it according to our own needs. There are many ways to scrape data and one such way is using BeautifulSoup. 

Before starting learning the BeautifulSoup let’s learn what is a web scraping and if we should do it or not?

What is Web Scraping?

In Layman’s term, web scraping is the process of gathering data from any website. It is just like copying and pasting the data from a website to your own file but automatically. In technical terms, Web Scripting is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. 

Note: For more information, refer to What is Web Scraping and How to Use It?

Legalization of Web Scraping

The legalization of web scraping is a sensitive topic, depending on how it is used it can either be a boon or a bane. On one hand, web scraping with good bot enables search engines to index web content, price comparison services to save customer money and value. But web scraping can be re-targeted to meet more malicious and abusive ends. Web scraping can be aligned with other forms of malicious automation, named “bad bots”, which enable other harmful activities like denial of service attacks, competitive data mining, account hijacking, data theft etc.

Now after learning the basics of web scraping let’s not waste any more of time and dive straight into the BeautifulSoup. Let’s start with the Installation.

Installation

To install Beautifulsoup on Windows, Linux, or any operating system, one would need pip package. To check how to install pip on your operating system, check out – PIP Installation – Windows || Linux. Now run the below command in the terminal.

pip install beautifulsoup4

Python BeautifulSoup install

Refer to the below articles to know more ways of installing BeautifulSoup if the above method does not work for you.

  • Beautifulsoup Installation – Python

Inspecting the Website

Before scraping any website, the first thing you need to do is to know about the structure of the website. This is needed to be done in order to select the desired data from the entire page. We can do this by right clicking on the page we want to scrape and select inspect element.

Note: We will be scraping Python Programming Page for this Tutorial.

python bs4 inspect element

After clicking the inspect button the Developer Tools of the browser gets open. Now almost all the browsers come with the developers tools installed, and we will be using Chrome for this tutorial. 

html of the page bs4

The developers tools allows to see the site’s Document Object Model (DOM). If you don’t know about DOM then don’t worry just consider the text displayed as the HTML structure of the page. 

Getting the HTML of the Page

After inspecting the HTML of the page we still need to get all the HTML into our Python Code so that we can scrape the desired data. For this Python provides a module called requests. Requests library is one of the integral part of Python for making HTTP requests to a specified URL. Requests installation depends on type of operating system on eis using, the basic command anywhere would be to open a command terminal and run,

pip install requests

Now let’s make a simple GET request using the get() method.

Example:

Python
import requests  # Making a GET request r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')  # check status code for response received # success code - 200 print(r)  # print content of request print(r.content) 

Output:

python requests get

Refer to the below tutorial to get detailed and well explained information about the requests module.

  • Python Requests Tutorial

Parsing the HTML

After getting the HTML of the page let’s see how to parse this raw HTML code into some useful information. First of all, we will create a BeautifulSoup object by specifying the parser we want to use.

Note: BeautifulSoup library is built on the top of the HTML parsing libraries like html5lib, lxml, html.parser, etc. So BeautifulSoup object and specify the parser library can be created at the same time.

Example 1:

Python
import requests from bs4 import BeautifulSoup    # Making a GET request r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')  # check status code for response received # success code - 200 print(r)  # Parsing the HTML soup = BeautifulSoup(r.content, 'html.parser') print(soup.prettify())  

Output:

souping object bs4

Example 2:

Python
import requests from bs4 import BeautifulSoup    # Making a GET request r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')  # Parsing the HTML soup = BeautifulSoup(r.content, 'html.parser')  # Getting the title tag print(soup.title)  # Getting the name of the tag print(soup.title.name)  # Getting the name of parent tag print(soup.title.parent.name)  # use the child attribute to get  # the name of the child tag 

Output: 

<title>Python Programming Language - GeeksforGeeks</title>
title
meta

Finding Elements

Now, we would like to extract some useful data from the HTML content. The soup object contains all the data in the nested structure which could be programmatically extracted. The website we want to scrape contains a lot of text so now let’s scrape all those content.

First let’s inspect the webpage we want to scrape. 

findallbs4pythontutorial-copy


Finding Elements by Class

In the above image we can see that all the content of the page is under the div with class entry-content. We will store all the result found under this class. 

Example:

Python
import requests from bs4 import BeautifulSoup    # Making a GET request r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')  # check status code for response received # success code - 200 print(r)  # Parsing the HTML soup = BeautifulSoup(r.content, 'html.parser')  s = soup.find('div', class_='entry-content') print(s) 

Output: 

find bs4


 In the above example  we have used the find class. This class will find the given tag with the given attribute. In our case it will find all the div having class as entry-content. We have got all the content from the site but you can see that all the images and links are also scrapped. So our next task is to find only the content from the above parsed HTML.

Let’s again inspect the HTML of our website.
 

find_all bs4 python tutorial


We can see that the content of the page is under the <p> tag. Now we have to find all the p tags present in this class. We can use the find_all class of the BeautifulSoup.

Example: 

Python
import requests from bs4 import BeautifulSoup    # Making a GET request r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')  # Parsing the HTML soup = BeautifulSoup(r.content, 'html.parser')  s = soup.find('div', class_='entry-content')  lines = s.find_all('p') print(lines) 

 
Output: 

find_all bs4

We finally get all the content stored under the <p> tag. 

Finding Elements by ID

In the above example, we have found the elements by the class name but let’s see how to find elements by id. Now for this task let’s scrape the content of the leftbar of the page. The first step is to inspect the page and see the leftbar falls under which tag.

find elements by id bs4


The above image shows that the leftbar falls under the <div> tag with id as main. Now lets’s get the HTML content under this tag.

Example: 

Python
import requests from bs4 import BeautifulSoup    # Making a GET request r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')  # Parsing the HTML soup = BeautifulSoup(r.content, 'html.parser')  # finding element by id s = soup.find('div', id= 'main')  print(s) 

Output: 

find elements by id bs4 python tutorial

Now let’s inspect more of the page get the content of the leftbar. 

python bs4 find by elements

We can see that the list in the leftbar is under the <ul> tag with the class as leftBarList and our task is to find all the li under this ul.

Example: 

Python
import requests from bs4 import BeautifulSoup    # Making a GET request r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')  # Parsing the HTML soup = BeautifulSoup(r.content, 'html.parser')  # Finding by id s = soup.find('div', id= 'main')  # Getting the leftbar leftbar = s.find('ul', class_='leftBarList')  # All the li under the above ul content = leftbar.find_all('li') print(content) 

Output: 

find all elements by id bs4

Refer to the below articles to get detailed information about finding elements. 

  • Python BeautifulSoup – find all class
  • How to extract a div tag and its contents by id with BeautifulSoup?
  • Find the siblings of tags using BeautifulSoup
  • Extracting an attribute value with beautifulsoup in Python
  • BeautifulSoup – Find all <li> in <ul>
  • Find text using beautifulSoup then replace in original soup variable
  • BeautifulSoup – Search by text inside a tag
  • BeautifulSoup – Find tags by CSS class with CSS Selectors

Extracting Text from the tags

In the above examples, you must have seen that while scraping the data the tags also gets scraped but what if we want only the text without any tags. Don’t worry we will discuss the same in this section. We will be using the text property. It only prints the text from the tag. We will be using the above example and will remove all the tags from them.

Example 1: Removing the tags from the content of the page 

Python
import requests from bs4 import BeautifulSoup    # Making a GET request r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')  # Parsing the HTML soup = BeautifulSoup(r.content, 'html.parser')  s = soup.find('div', class_='entry-content')  lines = s.find_all('p')  for line in lines:     print(line.text) 

Output:

get all text bs4 python

Now we have successfully scraped the content from our first website. This script will run on every system until and unless there is some changes the HTML of the webpage itself.

Example 2: Removing the tags from the content of the leftbar. 

Python
import requests from bs4 import BeautifulSoup    # Making a GET request r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')  # Parsing the HTML soup = BeautifulSoup(r.content, 'html.parser')  # Finding by id s = soup.find('div', id= 'main')  # Getting the leftbar leftbar = s.find('ul', class_='leftBarList')  # All the li under the above ul lines = leftbar.find_all('li')  for line in lines:     print(line.text) 

 Output:

extract text from tags bs4

Refer to the below articles to get detailed information about extracting text.

  • Show text inside the tags using BeautifulSoup
  • Find the text of the given tag using BeautifulSoup
  • How to scrape all the text from body tag using Beautifulsoup in Python?

More Topics on BeautifulSoup

  • Beautifulsoup – nextSibling
  • BeautifulSoup – Remove the contents of tag
  • BeautifulSoup – Append to the contents of tag
  • How to delete child element in BeautifulSoup?
  • Pretty-Printing in BeautifulSoup
  • BeautifulSoup – Modifying the tree
  • Converting HTML to Text with BeautifulSoup
  • How to modify HTML using BeautifulSoup ?
  • Change the tag’s contents and replace with the given string using BeautifulSoup
  • Remove all style, scripts, and HTML tags using BeautifulSoup
  • Insert tags or strings immediately before and after specified tags using BeautifulSoup
  • How to parse local HTML file in Python?
  • How to use Xpath with BeautifulSoup ?
  • BeautifulSoup – Wrap an element in a new tag
  • BeautifulSoup – Parsing only section of a document
  • How to write the output to HTML file with Python BeautifulSoup?
  • Encoding in BeautifulSoup
  • How to Scrape Nested Tags using BeautifulSoup?
  • Convert XML structure to DataFrame using BeautifulSoup – Python

BeautifulSoup Exercises and Projects

  • Get all HTML tags with BeautifulSoup
  • Find the title tags from a given html document using BeautifulSoup in Python
  • Extract all the URLs that are nested within <li> tags using BeautifulSoup
  • Get a list of all the heading tags using BeautifulSoup
  • BeautifulSoup – Scraping List from HTML
  • BeautifulSoup – Scraping Paragraphs from HTML
  • How to Scrape all PDF files in a Website?
  • Downloading PDFs with Python using Requests and BeautifulSoup
  • How to Extract Weather Data from Google in Python?
  • How to Scrape Videos using Python ?


 



Next Article
Downloading PDFs with Python using Requests and BeautifulSoup
author
abhishek1
Improve
Article Tags :
  • Python
  • Web-scraping
Practice Tags :
  • python

Similar Reads

  • Scraping Reddit with Python and BeautifulSoup
    In this article, we are going to see how to scrape Reddit with Python and BeautifulSoup. Here we will use Beautiful Soup and the request module to scrape the data. Module neededbs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in
    3 min read
  • How to Use lxml with BeautifulSoup in Python
    In this article, we will explore how to use lxml with BeautifulSoup in Python. lxml is a high-performance XML and HTML parsing library for Python, known for its speed and comprehensive feature set. It supports XPath, XSLT, validation, and efficient handling of large documents, making it a preferred
    3 min read
  • How to use Xpath with BeautifulSoup ?
    We have an HTML page and our task is to extract specific elements using XPath, which BeautifulSoup doesn't support directly. For example, if we want to extract the heading from the Wikipedia page on Nike, we can’t do it with just BeautifulSoup, but with a mix of lxml and etree, we can. This article
    2 min read
  • Downloading PDFs with Python using Requests and BeautifulSoup
    BeautifulSoup object is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. The BeautifulSoup object represents the parsed document as a whole. For most purposes, yo
    2 min read
  • Implementing Web Scraping in Python with BeautifulSoup
    There are mainly two ways to extract data from a website: Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.Access the HTML of the webpage and extract useful information/data from it. This technique is called
    8 min read
  • How to scrape all the text from body tag using Beautifulsoup in Python?
    strings generator is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. One drawback of the string attribute is that it only works for tags with string inside it an
    2 min read
  • Scraping websites with Newspaper3k in Python
    Web Scraping is a powerful tool to gather information from a website. To scrape multiple URLs, we can use a Python library called Newspaper3k. The Newspaper3k package is a Python library used for Web Scraping articles, It is built on top of requests and for parsing lxml. This module is a modified an
    2 min read
  • How to Scrape Nested Tags using BeautifulSoup?
    We can scrap the Nested tag in beautiful soup with help of. (dot) operator. After creating a soup of the page if we want to navigate nested tag then with the help of. we can do it. For scraping Nested Tag using Beautifulsoup follow the below-mentioned steps. Step-by-step Approach Step 1: The first s
    3 min read
  • How to Remove tags using BeautifulSoup in Python?
    Prerequisite- Beautifulsoup module In this article, we are going to draft a python script that removes a tag from the tree and then completely destroys it and its contents. For this, decompose() method is used which comes built into the module. Syntax: Beautifulsoup.Tag.decompose() Tag.decompose() r
    2 min read
  • How to write the output to HTML file with Python BeautifulSoup?
    In this article, we are going to write the output to an HTML file with Python BeautifulSoup.  BeautifulSoup is a python library majorly used for web scraping but in this article, we will discuss how to write the output to an HTML file. Modules needed and installation: pip install bs4 Approach: We wi
    3 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences