Email Id Extractor Project from sites in Scrapy Python

Last Updated : 17 Oct, 2022

Scrapy is open-source web-crawling framework written in Python used for web scraping, it can also be used to extract data for general-purpose. First all sub pages links are taken from the main page and then email id are scraped from these sub pages using regular expression.

This article shows the email id extraction from geeksforgeeks site as a reference.

Email ids to be scraped from geeksforgeeks site - ['[email protected]', '[email protected]', '[email protected]','[email protected]']

How to create Email ID Extractor Project using Scrapy?

1. Installation of packages - run following command from terminal

pip install scrapy  pip install scrapy-selenium

2. Create project -

scrapy startproject projectname (Here projectname is geeksemailtrack)  cd projectname  scrapy genspider spidername (Here spidername is emails)

3) Add code in settings.py file to use scrapy-selenium

from shutil import which  SELENIUM_DRIVER_NAME = 'chrome'  SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')  SELENIUM_DRIVER_ARGUMENTS=[] DOWNLOADER_MIDDLEWARES = {  'scrapy_selenium.SeleniumMiddleware': 800  }

4) Now download chrome driver for your chrome and put it near to your chrome scrapy.cfg file. To download chrome driver refer this site - To download chrome driver.

Directory structure -

Step by Step Code -

1. Import all required libraries -

Python3

# web scraping framework import scrapy  # for regular expression import re  # for selenium request from scrapy_selenium import SeleniumRequest  # for link extraction from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor

2. Create start_requests function to hit the site from selenium. You can add your own URL.

Python3

def start_requests(self):     yield SeleniumRequest(         url="https://www.geeksforgeeks.org/",         wait_time=3,         screenshot=True,         callback=self.parse,         dont_filter=True     )

3. Create parse function:

Python3

def parse(self, response):         # this helps to get all links from source code         links = LxmlLinkExtractor(allow=()).extract_links(response)          # Finallinks contains links url         Finallinks = [str(link.url) for link in links]          # links list for url that may have email ids         links = []          # filtering and storing only needed url in links list         # pages that are about us and contact us are the ones that have email ids         for link in Finallinks:             if ('Contact' in link or 'contact' in link or 'About' in link or 'about' in link or 'CONTACT' in link or 'ABOUT' in link):                 links.append(link)          # current page url also added because few sites have email ids on there main page         links.append(str(response.url))            # parse_link function is called for extracting email ids         l = links[0]         links.pop(0)          # meta helps to transfer links list from parse to parse_link         yield SeleniumRequest(             url=l,             wait_time=3,             screenshot=True,             callback=self.parse_link,             dont_filter=True,             meta={'links': links}         )

Explanation of parse function -

In the following lines all links are extracted from https://www.geeksforgeeks.org/ response.

links = LxmlLinkExtractor(allow=()).extract_links(response)  Finallinks = [str(link.url) for link in links]

Finallinks is list containing all links.
To avoid unnecessary links we put filter that, if links belong to contact and about page then only we scrape details from that page.

for link in Finallinks:  if ('Contact' in link or 'contact' in link or 'About' in link or 'about' in link or  or 'CONTACT' in link or 'ABOUT' in  link):  links.append(link)

This Above filter is not necessary but sites do have lots of tags(links) and due to this, if site has 50 subpages in site then it will extract email from these 50 sub URLs. it is assumed that emails are mostly on home page, contact page, and about page so this filter help to reduce time wastage of scraping those URL that might not have email ids.
The links of pages that may have email ids are requested one by one and email ids are scraped using regular expression.

4. Create parse_link function code:

Python3

def parse_link(self, response):     # response.meta['links'] this helps to get links list     links = response.meta['links']     flag = 0      # links that contains following bad words are discarded     bad_words = ['facebook', 'instagram', 'youtube', 'twitter', 'wiki', 'linkedin']      for word in bad_words:         # if any bad word is found in the current page url         # flag is assigned to 1         if word in str(response.url):             flag = 1             break      # if flag is 1 then no need to get email from     # that url/page     if (flag != 1):         html_text = str(response.text)         # regular expression used for email id         email_list = re.findall('\w+@\w+\.{1}\w+', html_text)         # set of email_list to get unique         email_list = set(email_list)         if (len(email_list) != 0):             for i in email_list:                 # adding email ids to final uniqueemail                 self.uniqueemail.add(i)      # parse_link function is called till     # if condition satisfy     # else move to parsed function     if (len(links) > 0):         l = links[0]         links.pop(0)         yield SeleniumRequest(             url=l,             callback=self.parse_link,             dont_filter=True,             meta={'links': links}         )     else:         yield SeleniumRequest(             url=response.url,             callback=self.parsed,             dont_filter=True         )

Explanation of parse_link function:
By response.text we get the all source code of the requested URL. The regex expression ‘\w+@\w+\.{1}\w+’ used here could be translated to something like this Look for every piece of string that starts with one or more letters, followed by an at sign (‘@’), followed by one or more letters with a dot in the end.
After that it should have one or more letters again. Its a regex used for getting email id.

5. Create parsed function -

Python3

def parsed(self, response):     # emails list of uniqueemail set     emails = list(self.uniqueemail)     finalemail = []      for email in emails:         # avoid garbage value by using '.in' and '.com'         # and append email ids to finalemail         if ('.in' in email or '.com' in email or 'info' in email or 'org' in email):              finalemail.append(email)      # final unique email ids from geeksforgeeks site     print('\n'*2)     print("Emails scraped", finalemail)     print('\n'*2)

Explanation of Parsed function:
The above regex expression also leads to garbage values like [email protected] in this scraping email id from geeksforgeeks, we know [email protected] is not a email id. The parsed function filter applies filter that only takes emails containing '.com' and ".in".

Run the spider using following command -

scrapy crawl spidername (spidername is name of spider)

Garbage value in scraped emails:

Final scraped emails:

Python

# web scraping framework import scrapy  # for regular expression import re  # for selenium request from scrapy_selenium import SeleniumRequest  # for link extraction from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor  class EmailtrackSpider(scrapy.Spider):     # name of spider     name = 'emailtrack'      # to have unique email ids     uniqueemail = set()      # start_requests sends request to given https://www.geeksforgeeks.org/     # and parse function is called     def start_requests(self):         yield SeleniumRequest(             url="https://www.geeksforgeeks.org/",             wait_time=3,             screenshot=True,             callback=self.parse,             dont_filter=True         )      def parse(self, response):             # this helps to get all links from source code             links = LxmlLinkExtractor(allow=()).extract_links(response)              # Finallinks contains links url             Finallinks = [str(link.url) for link in links]              # links list for url that may have email ids             links = []              # filtering and storing only needed url in links list             # pages that are about us and contact us are the ones that have email ids             for link in Finallinks:                 if ('Contact' in link or 'contact' in link or 'About' in link or 'about' in link or 'CONTACT' in link or 'ABOUT' in link):                     links.append(link)              # current page url also added because few sites have email ids on there main page             links.append(str(response.url))                # parse_link function is called for extracting email ids             l = links[0]             links.pop(0)              # meta helps to transfer links list from parse to parse_link             yield SeleniumRequest(                 url=l,                 wait_time=3,                 screenshot=True,                 callback=self.parse_link,                 dont_filter=True,                 meta={'links': links}             )       def parse_link(self, response):          # response.meta['links'] this helps to get links list         links = response.meta['links']         flag = 0          # links that contains following bad words are discarded         bad_words = ['facebook', 'instagram', 'youtube', 'twitter', 'wiki', 'linkedin']          for word in bad_words:             # if any bad word is found in the current page url             # flag is assigned to 1             if word in str(response.url):                 flag = 1                 break          # if flag is 1 then no need to get email from         # that url/page         if (flag != 1):             html_text = str(response.text)             # regular expression used for email id             email_list = re.findall('\w+@\w+\.{1}\w+', html_text)             # set of email_list to get unique             email_list = set(email_list)             if (len(email_list) != 0):                 for i in email_list:                     # adding email ids to final uniqueemail                     self.uniqueemail.add(i)          # parse_link function is called till         # if condition satisfy         # else move to parsed function         if (len(links) > 0):             l = links[0]             links.pop(0)             yield SeleniumRequest(                 url=l,                 callback=self.parse_link,                 dont_filter=True,                 meta={'links': links}             )         else:             yield SeleniumRequest(                 url=response.url,                 callback=self.parsed,                 dont_filter=True             )      def parsed(self, response):         # emails list of uniqueemail set         emails = list(self.uniqueemail)         finalemail = []          for email in emails:             # avoid garbage value by using '.in' and '.com'             # and append email ids to finalemail             if ('.in' in email or '.com' in email or 'info' in email or 'org' in email):                  finalemail.append(email)          # final unique email ids from geeksforgeeks site         print('\n'*2)         print("Emails scraped", finalemail)         print('\n'*2)

Working video of above code -

Reference - linkextractors

Automating Scrolling using Python-Opencv by Color Detection

Versus

Improve

Article Tags :

Practice Tags :

python

Email Id Extractor Project from sites in Scrapy Python

How to create Email ID Extractor Project using Scrapy?

Similar Reads

Projects for Beginners

Projects for Intermediate

Web Scraping

Automating boring Stuff Using Python

Tkinter Projects

Turtle Projects

OpenCV Projects

Python Django Projects

Python Text to Speech and Vice-Versa