Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Beautiful Soup
  • Selenium
  • Scrapy
  • urllib
  • Request
  • open cv
  • Data analysis
  • Machine learning
  • NLP
  • Deep learning
  • Data Science
  • Interview question
  • ML math
  • ML Projects
  • ML interview
  • DL interview
Open In App
Next Article:
Installing BeautifulSoup: A Beginner's Guide
Next article icon

Implementing Web Scraping in Python with BeautifulSoup

Last Updated : 02 Aug, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

There are mainly two ways to extract data from a website:

  • Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.
  • Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction.

This article discusses the steps involved in web scraping using the implementation of a Web Scraping framework of Python called Beautiful Soup. Steps involved in web scraping:

  1. Send an HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the HTML content of the webpage. For this task, we will use a third-party HTTP library for python-requests.
  2. Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data. There are many HTML parser libraries available but the most advanced one is html5lib.
  3. Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal. For this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for pulling data out of HTML and XML files.

Step 1: Installing the required third-party libraries

  • Easiest way to install external libraries in python is to use pip. pip is a package management system used to install and manage software packages written in Python. All you need to do is:
pip install requests
pip install html5lib
pip install bs4
  • Another way is to download them manually from these links:
    • requests
    • html5lib
    • beautifulsoup4

Step 2: Accessing the HTML content from webpage 

Python
import requests URL = "https://www.geeksforgeeks.org/data-structures/" r = requests.get(URL) print(r.content) 

Let us try to understand this piece of code.

  • First of all import the requests library.
  • Then, specify the URL of the webpage you want to scrape.
  • Send a HTTP request to the specified URL and save the response from server in a response object called r.
  • Now, as print r.content to get the raw HTML content of the webpage. It is of ‘string’ type.

Note: Sometimes you may get error "Not accepted" so try adding a browser user agent like below. Find your user agent based on device and browser from here https://deviceatlas.com/blog/list-of-user-agent-strings

Python
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246"} # Here the user agent is for Edge browser on windows 10. You can find your browser user agent from the above given link. r = requests.get(url=URL, headers=headers) print(r.content) 

Step 3: Parsing the HTML content 

Python
#This will not run on online IDE import requests from bs4 import BeautifulSoup  URL = "http://www.values.com/inspirational-quotes" r = requests.get(URL)  soup = BeautifulSoup(r.content, 'html5lib') # If this line causes an error, run 'pip install html5lib' or install html5lib print(soup.prettify()) 

A really nice thing about the BeautifulSoup library is that it is built on the top of the HTML parsing libraries like html5lib, lxml, html.parser, etc. So  BeautifulSoup object and specify the parser library can be created at the same time. In the example above,

soup = BeautifulSoup(r.content, 'html5lib')

We create a BeautifulSoup object by passing two arguments:

  • r.content : It is the raw HTML content.
  • html5lib : Specifying the HTML parser we want to use.

Now soup.prettify() is printed, it gives the visual representation of the parse tree created from the raw HTML content. Step 4: Searching and navigating through the parse tree Now, we would like to extract some useful data from the HTML content. The soup object contains all the data in the nested structure which could be programmatically extracted. In our example, we are scraping a webpage consisting of some quotes. So, we would like to create a program to save those quotes (and all relevant information about them). 

Python
#Python program to scrape website  #and save quotes from website import requests from bs4 import BeautifulSoup import csv   URL = "http://www.values.com/inspirational-quotes" r = requests.get(URL)   soup = BeautifulSoup(r.content, 'html5lib')   quotes=[]  # a list to store quotes   table = soup.find('div', attrs = {'id':'all_quotes'})    for row in table.findAll('div',                          attrs = {'class':'col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top'}):     quote = {}     quote['theme'] = row.h5.text     quote['url'] = row.a['href']     quote['img'] = row.img['src']     quote['lines'] = row.img['alt'].split(" #")[0]     quote['author'] = row.img['alt'].split(" #")[1]     quotes.append(quote)   filename = 'inspirational_quotes.csv' with open(filename, 'w', newline='') as f:     w = csv.DictWriter(f,['theme','url','img','lines','author'])     w.writeheader()     for quote in quotes:         w.writerow(quote) 

Before moving on, we recommend you to go through the HTML content of the webpage which we printed using soup.prettify() method and try to find a pattern or a way to navigate to the quotes.

  • It is noticed that all the quotes are inside a div container whose id is 'all_quotes'. So, we find that div element (termed as table in above code) using find() method :
table = soup.find('div', attrs = {'id':'all_quotes'}) 
  • The first argument is the HTML tag you want to search and second argument is a dictionary type element to specify the additional attributes associated with that tag. find() method returns the first matching element. You can try to print table.prettify() to get a sense of what this piece of code does.
  • Now, in the table element, one can notice that each quote is inside a div container whose class is quote. So, we iterate through each div container whose class is quote. Here, we use findAll() method which is similar to find method in terms of arguments but it returns a list of all matching elements. Each quote is now iterated using a variable called row. Here is one sample row HTML content for better understanding: Implementing Web Scraping in Python with Beautiful Soup Now consider this piece of code:
for row in table.find_all_next('div', attrs = {'class': 'col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top'}):
quote = {}
quote['theme'] = row.h5.text
quote['url'] = row.a['href']
quote['img'] = row.img['src']
quote['lines'] = row.img['alt'].split(" #")[0]
quote['author'] = row.img['alt'].split(" #")[1]
quotes.append(quote)
  • We create a dictionary to save all information about a quote. The nested structure can be accessed using dot notation. To access the text inside an HTML element, we use .text : 
quote['theme'] = row.h5.text
  • We can add, remove, modify and access a tag’s attributes. This is done by treating the tag as a dictionary:
quote['url'] = row.a['href']
  • Lastly, all the quotes are appended to the list called quotes.
  • Finally, we would like to save all our data in some CSV file.
filename = 'inspirational_quotes.csv'
with open(filename, 'w', newline='') as f:
w = csv.DictWriter(f,['theme','url','img','lines','author'])
w.writeheader()
for quote in quotes:
w.writerow(quote)
  • Here we create a CSV file called inspirational_quotes.csv and save all the quotes in it for any further use.

So, this was a simple example of how to create a web scraper in Python.  From here, you can try to scrap any other website of your choice. In case of any queries, post them below in comments section.

Note : Web Scraping is considered as illegal in many cases. It may also cause your IP to be blocked permanently by a website. This blog is contributed by Nikhil Kumar.


    Next Article
    Installing BeautifulSoup: A Beginner's Guide

    K

    kartik
    Improve
    Article Tags :
    • Project
    • Python
    • Web-scraping
    Practice Tags :
    • python

    Similar Reads

      Implementing Web Scraping in Python with BeautifulSoup
      There are mainly two ways to extract data from a website:Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.Access the HTML of the webpage and extract useful information/data from it. This technique is called
      8 min read

      Installing and Loading BeautifulSoup

      Installing BeautifulSoup: A Beginner's Guide
      BeautifulSoup is a Python library that makes it easy to extract data from HTML and XML files. It helps you find, navigate, and change the information in these files quickly and simply. It’s a great tool that can save you a lot of time when working with web data. The latest version of BeautifulSoup i
      2 min read
      Beautifulsoup - Kinds of objects
      Prerequisites: BeautifulSoup In this article, we will discuss different types of objects in Beautifulsoup. When the string or HTML document is given in the constructor of BeautifulSoup, this constructor converts this document to different python objects.  The four major and important objects are :
      4 min read
      How to Scrape Data From Local HTML Files using Python?
      BeautifulSoup module in Python allows us to scrape data from local HTML files. For some reason, website pages might get stored in a local (offline environment), and whenever in need, there may be requirements to get the data from them. Sometimes there may be a need to get data from multiple Locally
      4 min read

      Navigating the HTML structure With Beautiful Soup

      Find the siblings of tags using BeautifulSoup
      Prerequisite: BeautifulSoup BeautifulSoup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come in built-in with Python. To install this type the below command in the terminal. In this article, we will learn about siblings in HTML tags using BeautifulSoup. He
      2 min read
      Navigation with BeautifulSoup
      BeautifulSoup is a Python package used for parsing HTML and XML documents, it creates a parse tree for parsed paged which can be used for web scraping, it pulls data from HTML and XML files and works with your favorite parser to provide the idiomatic way of navigating, searching, and modifying the p
      6 min read
      descendants generator – Python Beautifulsoup
      descendants generator is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. The .contents and .children attribute only consider a tag’s direct children. The descend
      2 min read

      Searching and Extract for specific tags With Beautiful Soup

      Python BeautifulSoup - find all class
      Prerequisite:- Requests , BeautifulSoup The task is to write a program to find all the classes for a given Website URL. In Beautiful Soup there is no in-built method to find all classes. Module needed: bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This modu
      2 min read
      BeautifulSoup - Search by text inside a tag
      Prerequisites: Beautifulsoup Beautifulsoup is a powerful python module used for web scraping. This article discusses how a specific text can be searched inside a given tag. INTRODUCTION: BeautifulSoup is a Python library for parsing HTML and XML documents. It provides a simple and intuitive API for
      4 min read
      Scrape Google Search Results using Python BeautifulSoup
      In this article, we are going to see how to Scrape Google Search Results using Python BeautifulSoup. Module Needed:bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the te
      3 min read
      Get tag name using Beautifulsoup in Python
      Prerequisite: Beautifulsoup Installation Name property is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. Name object corresponds to the name of an XML or HTML t
      1 min read
      Extracting an attribute value with beautifulsoup in Python
      Prerequisite: Beautifulsoup Installation Attributes are provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. A tag may have any number of attributes. For example, the
      2 min read
      BeautifulSoup - Modifying the tree
      Prerequisites: BeautifulSoup Beautifulsoup is a Python library used for web scraping. This powerful python tool can also be used to modify html webpages. This article depicts how beautifulsoup can be employed to modify the parse tree. BeautifulSoup is used to search the parse tree and allow you to m
      5 min read
      Find the text of the given tag using BeautifulSoup
      Web scraping is a process of using software bots called web scrapers in extracting information from HTML or XML content of a web page. Beautiful Soup is a library used for scraping data through python. Beautiful Soup works along with a parser to provide iteration, searching, and modifying the conten
      2 min read
      Remove spaces from a string in Python
      Removing spaces from a string is a common task in Python that can be solved in multiple ways. For example, if we have a string like " g f g ", we might want the output to be "gfg" by removing all the spaces. Let's look at different methods to do so:Using replace() methodTo remove all spaces from a s
      2 min read
      Understanding Character Encoding
      Ever imagined how a computer is able to understand and display what you have written? Ever wondered what a UTF-8 or UTF-16 meant when you were going through some configurations? Just think about how "HeLLo WorlD" should be interpreted by a computer. We all know that a computer stores data in bits an
      6 min read
      ASCII Vs UNICODE
      Overview :Unicode and ASCII are the most popular character encoding standards that are currently being used all over the world. Unicode is the universal character encoding used to process, store and facilitate the interchange of text data in any language while ASCII is used for the representation of
      3 min read
      HTML Tables
      HTML (HyperText Markup Language) is the standard markup language used to create and structure web pages. It defines the layout of a webpage using elements and tags, allowing for the display of text, images, links, and multimedia content. As the foundation of nearly all websites, HTML is used in over
      10 min read

      Creating new HTML elements With Beautiful Soup

      HTML Attributes
      HTML Attributes are special words used within the opening tag of an HTML element. They provide additional information about HTML elements. HTML attributes are used to configure and adjust the element's behavior, appearance, or functionality in a variety of ways. Each attribute has a name and a value
      8 min read
      BeautifulSoup - Append to the contents of tag
      Prerequisites: Beautifulsoup Beautifulsoup is a Python library used to extract the contents from the webpages. It is used in extracting the contents from HTML and XML structures. To use this library, we need to install it first. Here we are going to append the text to the existing contents of tag. W
      2 min read

      Modifying HTML with BeautifulSoup

      How to insert a new tag into a BeautifulSoup object?
      In this article, we will see how to insert a new tag into a BeautifulSoup object. See the below examples to get a better idea about the topic. Example: HTML_DOC :  """              <html>               <head>                   <title> Table Data </title>               </he
      5 min read
      How to declare a custom attribute in HTML ?
      In this article, we will learn how to declare a custom attribute in HTML. Attributes are extra information that provides for the HTML elements. There are lots of predefined attributes in HTML. When the predefined attributes do not make sense to store extra data, custom attributes allow users to crea
      2 min read
      How to Remove tags using BeautifulSoup in Python?
      Prerequisite- Beautifulsoup module In this article, we are going to draft a python script that removes a tag from the tree and then completely destroys it and its contents. For this, decompose() method is used which comes built into the module. Syntax: Beautifulsoup.Tag.decompose() Tag.decompose() r
      2 min read
      Remove all style, scripts, and HTML tags using BeautifulSoup
      Prerequisite: BeautifulSoup, Requests Beautiful Soup is a Python library for pulling data out of HTML and XML files. In this article, we are going to discuss how to remove all style, scripts, and HTML tags using beautiful soup. Required Modules: bs4: Beautiful Soup (bs4) is a python library primaril
      2 min read
      BeautifulSoup - Remove the contents of tag
      In this article, we are going to see how to remove the content tag from HTML using BeautifulSoup. BeautifulSoup is a python library used for extracting html and xml files. Modules needed: BeautifulSoup: Our primary module contains a method to access a webpage over HTTP. For installation run this com
      2 min read
      HTML Cleaning and Entity Conversion | Python
      The very important and always ignored task on web is the cleaning of text. Whenever one thinks to parse HTML, embedded Javascript and CSS is always avoided. The users are only interested in tags and text present on the webserver. lxml installation - It is a Python binding for C libraries - libxslt a
      3 min read

      Working with CSS selectors With Beautiful Soup

      CSS element Selector
      The element selector in CSS is used to select HTML elements that are required to be styled. In a selector declaration, there is the name of the HTML element and the CSS properties which are to be applied to that element is written inside the brackets {}. Syntax:element { \\ CSS property}Example 1: T
      2 min read
      Find the text of the given tag using BeautifulSoup
      Web scraping is a process of using software bots called web scrapers in extracting information from HTML or XML content of a web page. Beautiful Soup is a library used for scraping data through python. Beautiful Soup works along with a parser to provide iteration, searching, and modifying the conten
      2 min read
      BeautifulSoup - Find tags by CSS class with CSS Selectors
      Prerequisites: Beautifulsoup Beautifulsoup is a Python library used for web scraping. BeautifulSoup object is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. The
      2 min read

      Handling cookies and sessions with BeautifulSoup

      Retrieving Cookies in Python
      Retrieving cookies in Python can be done by the use of the Requests library. Requests library is one of the integral part of Python for making HTTP requests to a specified URL. The below codes show different approaches to do show: 1. By requesting a session: Python3 1== # import the requests library
      1 min read
      How cookies are used in a website?
      What are cookies? Cookies are small files which are stored on a user's computer. They are used to hold a modest amount of data specific to a particular client and website and can be accessed either by the web server or by the client computer When cookies were invented, they were basically little doc
      3 min read
      BeautifulSoup - Error Handling
      When scraping data from websites, we often face different types of errors. Some are caused by incorrect URLs, server issues or incorrect usage of scraping libraries like requests and BeautifulSoup. In this tutorial, we’ll explore some common exceptions encountered during web scraping and how to hand
      3 min read
    geeksforgeeks-footer-logo
    Corporate & Communications Address:
    A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
    Registered Address:
    K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
    GFG App on Play Store GFG App on App Store
    Advertise with us
    • Company
    • About Us
    • Legal
    • Privacy Policy
    • In Media
    • Contact Us
    • Advertise with us
    • GFG Corporate Solution
    • Placement Training Program
    • Languages
    • Python
    • Java
    • C++
    • PHP
    • GoLang
    • SQL
    • R Language
    • Android Tutorial
    • Tutorials Archive
    • DSA
    • Data Structures
    • Algorithms
    • DSA for Beginners
    • Basic DSA Problems
    • DSA Roadmap
    • Top 100 DSA Interview Problems
    • DSA Roadmap by Sandeep Jain
    • All Cheat Sheets
    • Data Science & ML
    • Data Science With Python
    • Data Science For Beginner
    • Machine Learning
    • ML Maths
    • Data Visualisation
    • Pandas
    • NumPy
    • NLP
    • Deep Learning
    • Web Technologies
    • HTML
    • CSS
    • JavaScript
    • TypeScript
    • ReactJS
    • NextJS
    • Bootstrap
    • Web Design
    • Python Tutorial
    • Python Programming Examples
    • Python Projects
    • Python Tkinter
    • Python Web Scraping
    • OpenCV Tutorial
    • Python Interview Question
    • Django
    • Computer Science
    • Operating Systems
    • Computer Network
    • Database Management System
    • Software Engineering
    • Digital Logic Design
    • Engineering Maths
    • Software Development
    • Software Testing
    • DevOps
    • Git
    • Linux
    • AWS
    • Docker
    • Kubernetes
    • Azure
    • GCP
    • DevOps Roadmap
    • System Design
    • High Level Design
    • Low Level Design
    • UML Diagrams
    • Interview Guide
    • Design Patterns
    • OOAD
    • System Design Bootcamp
    • Interview Questions
    • Inteview Preparation
    • Competitive Programming
    • Top DS or Algo for CP
    • Company-Wise Recruitment Process
    • Company-Wise Preparation
    • Aptitude Preparation
    • Puzzles
    • School Subjects
    • Mathematics
    • Physics
    • Chemistry
    • Biology
    • Social Science
    • English Grammar
    • Commerce
    • World GK
    • GeeksforGeeks Videos
    • DSA
    • Python
    • Java
    • C++
    • Web Development
    • Data Science
    • CS Subjects
    @GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
    We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
    Lightbox
    Improvement
    Suggest Changes
    Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
    geeksforgeeks-suggest-icon
    Create Improvement
    Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
    geeksforgeeks-improvement-icon
    Suggest Changes
    min 4 words, max Words Limit:1000

    Thank You!

    Your suggestions are valuable to us.

    What kind of Experience do you want to share?

    Interview Experiences
    Admission Experiences
    Career Journeys
    Work Experiences
    Campus Experiences
    Competitive Exam Experiences