Skip to content
geeksforgeeks
  • Tutorials
    • Python
    • Java
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
    • Practice Coding Problems
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Python Tutorial
  • Interview Questions
  • Python Quiz
  • Python Glossary
  • Python Projects
  • Practice Python
  • Data Science With Python
  • Python Web Dev
  • DSA with Python
  • Python OOPs
Open In App
Next Article:
Parsing and Processing URL using Python - Regex
Next article icon

Parsing and Processing URL using Python - Regex

Last Updated : 02 Sep, 2020
Comments
Improve
Suggest changes
Like Article
Like
Report

Prerequisite: Regular Expression in Python

URL or Uniform Resource Locator consists of many information parts, such as the domain name, path, port number etc. Any URL can be processed and parsed using Regular Expression. So for using Regular Expression we have to use re library in Python.

Example:

URL: https://www.geeksforgeeks.org/courses  When we parse the above URL then we can find    Hostname: geeksforgeeks.com  Protocol: https  

We are using re.findall( ) function of re library for searching the required pattern in the URL.

Syntax: re.findall(regex, string)  

Return: all non-overlapping matches of pattern in string, as a list of strings. 

Now, let's see the examples:

Example 1: In this Example, we will be extracting the protocol and the hostname from the given URL.

  • Regular expression for extracting protocol group: '(\w+)://'.
  • Regular expression for extracting hostname group: '://www.([\w\-\.]+)'.

Meta characters Used:

  • \w: Matches any alphanumeric character, this is equivalent to the class [a-zA-Z0-9_].
  • +: One or more occurrences of previous characters.

Code:

Python3
# import library import re    # url link s = 'https://www.geeksforgeeks.org/'  # finding the protocol  obj1 = re.findall('(\w+)://',                   s) print(obj1)  # finding the hostname which may # contain dash or dots obj2 = re.findall('://www.([\w\-\.]+)',                    s) print(obj2) 

Output:

['https']  ['geeksforgeeks.org']  

Example 2: If the URL is of a different type such as 'file://localhost:4040/zip_file', with the port number along with it, then to extract the port number, as it is optional we will use the '?' notation. Here the port number '4040' occurs after the ':' sign. Therefore, as it is a digit (:(\d+)) is used. To make it optional as all URLs do not end with host number, this syntax is used '(:(\d+))?'.

Meta characters Used:

  • \d: Matches any decimal digit, this is equivalent to the set class [0-9].
  • +: One or more occurrences of previous characters.
  • ?: Matches zero or one occurrence.

Code:

Python3
# import library import re    # url link s = 'file://localhost:4040/abc_file'  # finding the file capture group obj1 = re.findall('(\w+)://', s)   print(obj1)  # finding the hostname which may  # contain dash or dots obj2 = re.findall('://([\w\-\.]+)', s) print(obj2)  # finding the hostname which may  # contain dash or dots or port # number obj3 = re.findall('://([\w\-\.]+)(:(\d+))?', s) print(obj3) 

Output:

['file']  ['localhost']  [('localhost', ':4040', '4040')]  

Example 3: For a general URL, this can be used, where the path elements can also be constructed.

Python3
# import library import re  # url s = 'http://www.example.com/index.html'   # searching for all capture groups obj = re.findall('(\w+)://([\w\-\.]+)/(\w+).(\w+)',                  s)  print(obj) 

Output:

[('http', 'www.example.com', 'index', 'html')]

Next Article
Parsing and Processing URL using Python - Regex

S

sangy987
Improve
Article Tags :
  • Python
  • python-regex
Practice Tags :
  • python

Similar Reads

    Python | Parse a website with regex and urllib
    Let's discuss the concept of parsing using python. In python we have lot of modules but for parsing we only need urllib and re i.e regular expression. By using both of these libraries we can fetch the data on web pages. Note that parsing of websites means that fetch the whole source code and that we
    2 min read
    Pattern matching in Python with Regex
    You may be familiar with searching for text by pressing ctrl-F and typing in the words you’re looking for. Regular expressions go one step further: They allow you to specify a pattern of text to search for. In this article, we will see how pattern matching in Python works with Regex.Regex in PythonR
    8 min read
    Flipkart Product Price Tracker using Python
    Flipkart Private Limited is an Indian e-commerce company. It sells many products and the prices of these products keep changing. Generally, during sales and festivals, the price of products drops, and our aim is to buy any product at the lowest possible price. In this article, we will learn to build
    3 min read
    How to get current_url using Selenium in Python?
    While doing work with selenium many URL get opened and redirected in order to keeping track of URL current_url method is used. The current_url method is used to retrieve the URL of the webpage the user is currently accessing. It gives the URL of the current webpage loaded by the driver in selenium.
    2 min read
    Python | How to shorten long URLs using Bitly API
    Bitly is used to shorten, brand, share, or retrieve data from links programmatically. In this article, we'll see how to shorten URLs using Bitly API. Below is a working example to shorten a URL using Bitly API. Step #1: Install Bitly API using git git clone https://github.com/bitly/bitly-api-python.
    2 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences