Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • Beautiful Soup
  • Selenium
  • Scrapy
  • urllib
  • Request
  • open cv
  • Data analysis
  • Machine learning
  • NLP
  • Deep learning
  • Data Science
  • Interview question
  • ML math
  • ML Projects
  • ML interview
  • DL interview
Open In App
Next Article:
Wikipedia search app using Flask Framework - Python
Next article icon

Web scraping from Wikipedia using Python – A Complete Guide

Last Updated : 09 Jan, 2023
Comments
Improve
Suggest changes
Like Article
Like
Report

In this article, you will learn various concepts of web scraping and get comfortable with scraping various types of websites and their data. The goal is to scrape data from the Wikipedia Home page and parse it through various web scraping techniques. You will be getting familiar with various web scraping techniques, python modules for web scraping, and processes of Data extraction and data processing. Web scraping is an automatic process of extracting information from the web. This article will give you an in-depth idea of web scraping, its comparison with web crawling, and why you should opt for web scraping.

Introduction to Web scraping and Python

It is basically a technique or a process in which large amounts of data from a huge number of websites is passed through a web scraping software coded in a programming language and as a result, structured data is extracted which can be saved locally in our devices preferably in Excel sheets, JSON or spreadsheets. Now, we don’t have to manually copy and paste data from websites but a scraper can perform that task for us in a couple of seconds. 

Web scraping is also known as Screen Scraping, Web Data Extraction, Web Harvesting, etc.

Process of Web scraping

This helps programmers write clear, logical code for small and large-scale projects. Python is mostly known as the best web scraper language. It’s more like an all-rounder and can handle most of the web crawling related processes smoothly. Scrapy and Beautiful Soup are among the widely used frameworks based on Python that makes scraping using this language such an easy route to take.

A brief list of Python libraries used for web scraping

Let’s see the web scraping libraries in Python!

  • Requests (HTTP for Humans) Library for Web Scraping – It is used for making various types of HTTP requests like GET, POST, etc. It is the most basic yet the most essential of all libraries.
  • lxml Library for Web Scraping – lxml library provides super-fast and high-performance parsing of HTML and XML content from websites. If you are planning to scrape large datasets, this is the one you should go for.
  • Beautiful Soup Library for Web Scraping – Its work involves creating a parse tree for parsing content. A perfect starting library for beginners and very easy to work with.
  • Selenium Library for Web Scraping – Originally made for automated testing of web applications, this library overcomes the issue all the above libraries face i.e. scraping content from dynamically populated websites. This makes it slower and not suitable for industry-level projects.
  • Scrapy for Web Scraping – The BOSS of all libraries, an entire web scraping framework which is asynchronous in its usage. This makes it blazing fast and increases efficiency.

Practical Implementation – Scraping Wikipedia

Steps of web scraping

Step 1: How to use python for web scraping?

  • We need python IDE and should be familiar with the use of it.
  • Virtualenv is a tool to create isolated Python environments. With the help of virtualenv, we can create a folder that contains all necessary executables to use the packages that our Python project requires. Here we can add and modify python modules without affecting any global installation.
  • We need to install various Python modules and libraries using the pip command for our purpose. But, we should always keep in mind that whether the website we are scraping is legal or not.

Requirements:

  • Requests: It is an efficient HTTP library used for accessing web pages.
  • Urlib3: It is used for retrieving data from URLs.
  • Selenium: It is an open-source automated testing suite for web applications across different browsers and platforms.

Installation:

pip install virtualenv python -m pip install selenium python -m pip install requests python -m pip install urllib3

Sample image during installing

Step 2: Introduction to Requests library

  • Here, we will learn various python modules to fetch data from the web.
  • The python requests library is used to make download the webpage we are trying to scrape.

Requirements:

  • Python IDE
  • Python Modules
  • Requests library

Code Walk-Through:

URL: https://en.wikipedia.org/wiki/Main_Page

Python3




# import required modules
import requests
 
# get URL
page = requests.get("https://en.wikipedia.org/wiki/Main_Page")
 
# display status code
print(page.status_code)
 
# display scraped data
print(page.content)
 
 

Output:

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. There are several types of requests we can make using requests, of which GET is just one. The URL of our sample website is https://en.wikipedia.org/wiki/Main_Page. The task is to download it using requests.get() method. After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully. And a content property that gives the HTML content of the webpage as output.  

Step 3: Introduction to Beautiful Soup for page parsing

We have a lot of python modules for data extraction. We are going to use BeautifulSoup for our purpose.  

  • BeautifulSoup is a Python library for pulling data out of HTML and XML files.
  • It needs an input (document or URL) to create a soup object as it cannot fetch a web page by itself.
  • We have other modules such as regular expression, lxml for the same purpose.
  • We then process the data in CSV or JSON or MySQL format.

Requirements:

  • PythonIDE
  • Python Modules
  • Beautiful Soup library
pip install bs4

Code Walk-Through:

Python3




# import required modules
from bs4 import BeautifulSoup
import requests
 
# get URL
page = requests.get("https://en.wikipedia.org/wiki/Main_Page")
 
# scrape webpage
soup = BeautifulSoup(page.content, 'html.parser')
 
# display scraped data
print(soup.prettify())
 
 

Output:

As you can see above, we now have downloaded an HTML document. We can use the BeautifulSoup library to parse this document and extract the text from the p tag. We first have to import the library and create an instance of the BeautifulSoup class to parse our document. We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object. As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children’s property of soup. Note that children return a list generator, so we need to call the list function on it.

Step 4: Digging deep into Beautiful Soup further

Three features that make Beautiful Soup so powerful:

  • Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn’t take much code to write an application
  • Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings unless the document doesn’t specify an encoding and Beautiful Soup can’t detect one. Then you just have to specify the original encoding.
  • Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility. Then we have to just process our data in a proper format such as CSV or JSON or MySQL.

Requirements:

  • PythonIDE
  • Python Modules
  • Beautiful Soup library

Code Walk-Through:

Python3




# import required modules
from bs4 import BeautifulSoup
import requests
 
# get URL
page = requests.get("https://en.wikipedia.org/wiki/Main_Page")
 
# scrape webpage
soup = BeautifulSoup(page.content, 'html.parser')
 
list(soup.children)
 
# find all occurrence of p in HTML
# includes HTML tags
print(soup.find_all('p'))
 
print('\n\n')
 
# return only text
# does not include HTML tags
print(soup.find_all('p')[0].get_text())
 
 

Output:

What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple. If we want to extract a single tag, we can instead use the find_all() method, which will find all the instances of a tag on a page. Note that find_all() returns a list, so we’ll have to loop through, or use list indexing, to extract text. If you instead only want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object.

Step 5: Exploring page structure with Chrome Dev tools and extracting information

The first thing we’ll need to do is inspect the page using Chrome Devtools. If you’re using another browser, Firefox and Safari have equivalents. It’s recommended to use Chrome though. 

You can start the developer tools in Chrome by clicking View -> Developer -> Developer Tools. You should end up with a panel at the bottom of the browser like what you see below. Make sure the Elements panel is highlighted. The elements panel will show you all the HTML tags on the page, and let you navigate through them. It’s a really handy feature! By right-clicking on the page near where it says Extended Forecast, then clicking Inspect, we’ll open up the tag that contains the text Extended Forecast in the elements panel.

Analyzing by Chrome Dev tools

Code Walk-Through:

Python3




# import required modules
from bs4 import BeautifulSoup
import requests
 
# get URL
page = requests.get("https://en.wikipedia.org/wiki/Main_Page")
 
# scrape webpage
soup = BeautifulSoup(page.content, 'html.parser')
 
# create object
object = soup.find(id="mp-left")
 
# find tags
items = object.find_all(class_="mp-h2")
result = items[0]
 
# display tags
print(result.prettify())
 
 

Output:

Here we have to select that element that has an id to it and contains children having the same class. For example, the element with id mp-left is the parent element and its nested children have the class mp-h2. So we will print the information with the first nested child and prettify it using the prettify() function.

Conclusion and Digging deeper into Web scraping

We learned various concepts of web scraping and scraped data from the Wikipedia Home page and parsed it through various web scraping techniques. The article helped us in getting an in-depth idea of web scraping, its comparison with web crawling, and why you should opt for web scraping. We also learned about the components and working of a web scraper.

Although web scraping opens up many doors for ethical purposes, there can be unintended data scraping by unethical practitioners which creates a moral hazard to many companies and organizations where they can retrieve the data easily and use it for their own selfish means. Data-scraping in combination with big data can provide the company’s market intelligence and help them identify critical trends and patterns and identify the best opportunities and solutions. Therefore, it’s quite accurate to predict that Data scraping can be upgraded to the better soon.

Uses of Web scraping



Next Article
Wikipedia search app using Flask Framework - Python
author
garingh128
Improve
Article Tags :
  • Python
  • Technical Scripter
  • Python web-scraping-exercises
  • Technical Scripter 2020
  • Web-scraping
Practice Tags :
  • python

Similar Reads

  • How to Scrape Multiple Pages of a Website Using Python?
    Web Scraping is a method of extracting useful data from a website using computer programs without having to manually do it. This data can then be exported and categorically organized for various purposes. Some common places where Web Scraping finds its use are Market research & Analysis Websites
    6 min read
  • Wikipedia search app using Flask Framework - Python
    Flask is a micro web framework written in Python. It is classified as a micro-framework because it does not require particular tools or libraries. Flask is a lightweight WSGI web application framework. It is designed to make getting started quick and easy, with the ability to scale up to complex app
    2 min read
  • Implementing web scraping using lxml in Python
    Web scraping basically refers to fetching only some important piece of information from one or more websites. Every website has recognizable structure/pattern of HTML elements. Steps to perform web scraping :1. Send a link and get the response from the sent link 2. Then convert response object to a
    3 min read
  • Web Scraping Financial News Using Python
    In this article, we will cover how to extract financial news seamlessly using Python. This financial news helps many traders in placing the trade in cryptocurrency, bitcoins, the stock markets, and many other global stock markets setting up of trading bot will help us to analyze the data. Thus all t
    3 min read
  • Clean Web Scraping Data Using clean-text in Python
    If you like to play with API's or like to scrape data from various websites, you must've come around random annoying text, numbers, keywords that come around with data. Sometimes it can be really complicating and frustrating to clean scraped data to obtain the actual data that we want.  In this arti
    2 min read
  • Quote Guessing Game using Web Scraping in Python
    Prerequisite: BeautifulSoup Installation  In this article, we will scrape a quote and details of the author from this site http//quotes.toscrape.com using python framework called BeautifulSoup and develop a guessing game using different data structures and algorithm. The user will be given 4 chances
    3 min read
  • Scrape Tables From any website using Python
    Scraping is a very essential skill for everyone to get data from any website. Scraping and parsing a table can be very tedious work if we use standard Beautiful soup parser to do so. Therefore, here we will be describing a library with the help of which any table can be scraped from any website easi
    3 min read
  • Wikipedia Summary Generator using Python Tkinter
    Prerequisite:  Tkinter Wikipedia Python offers multiple options for developing a GUI (Graphical User Interface). Out of all the GUI methods, Tkinter is the most commonly used method. Python with Tkinter outputs the fastest and easiest way to create GUI applications. Wikipedia is a Python library tha
    2 min read
  • Extract title from a webpage using Python
    Prerequisite Implementing Web Scraping in Python with BeautifulSoup, Python Urllib Module, Tools for Web Scraping In this article, we are going to write python scripts to extract the title form the webpage from the given webpage URL. Method 1: bs4 Beautiful Soup(bs4) is a Python library for pulling
    3 min read
  • Scraping Indeed Job Data Using Python
    In this article, we are going to see how to scrape Indeed job data using python. Here we will use Beautiful Soup and the request module to scrape the data. Module neededbs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Py
    3 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences