Skip to content
geeksforgeeks
  • Courses
    • DSA to Development
    • Get IBM Certification
    • Newly Launched!
      • Master Django Framework
      • Become AWS Certified
    • For Working Professionals
      • Interview 101: DSA & System Design
      • Data Science Training Program
      • JAVA Backend Development (Live)
      • DevOps Engineering (LIVE)
      • Data Structures & Algorithms in Python
    • For Students
      • Placement Preparation Course
      • Data Science (Live)
      • Data Structure & Algorithm-Self Paced (C++/JAVA)
      • Master Competitive Programming (Live)
      • Full Stack Development with React & Node JS (Live)
    • Full Stack Development
    • Data Science Program
    • All Courses
  • Tutorials
    • Data Structures & Algorithms
    • ML & Data Science
    • Interview Corner
    • Programming Languages
    • Web Development
    • CS Subjects
    • DevOps And Linux
    • School Learning
  • Practice
    • Build your AI Agent
    • GfG 160
    • Problem of the Day
    • Practice Coding Problems
    • GfG SDE Sheet
  • Contests
    • Accenture Hackathon (Ending Soon!)
    • GfG Weekly [Rated Contest]
    • Job-A-Thon Hiring Challenge
    • All Contests and Events
  • webscraping
  • Beautiful Soup
  • Selenium
  • Scrapy
  • urllib
  • open cv
  • Data analysis
  • Machine learning
  • NLP
  • Deep learning
  • Data Science
  • Interview question
  • ML math
  • ML Projects
  • ML interview
  • DL interview
Open In App
Next Article:
How to Install Python Scrapy on Ubuntu?
Next article icon

How to run Scrapy spiders in Python

Last Updated : 24 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

In this article, we are going to discuss how to schedule Scrapy crawl execution programmatically using Python. Scrapy is a powerful web scraping framework, and it's often necessary to schedule the execution of a Scrapy crawl at specific intervals. Scheduling Scrapy crawl execution programmatically allows you to automate the process of scraping data and ensures that you have the most up-to-date data.

Required Packages

Install Scrapy and schedule the library.

pip install schedule  pip install scrapy

Schedule Scrapy Crawl

In order to schedule Scrapy crawl execution, we will use the schedule library. This library allows us to schedule a task to be executed at a specific time or interval.

Step 1: Create a new folder

Step 2: Inside the folder, start a new project by the following command:

scrapy startproject <project_name>

Step 3: Import schedule library and create a function that runs the Scrapy crawl.

Python3
import schedule import time from scrapy import cmdline  def crawl():     cmdline.execute("scrapy crawl my_spider".split()) 

Step 4: Use the schedule library to schedule the crawl function to run at a specific interval

In this example, the crawl function is scheduled to run every 5 minutes. The schedule.run_pending() method checks if any scheduled tasks are due to be run and the time.sleep(1) method is used to prevent the program from using all the CPU resources. You can also schedule the task at a specific time using schedule.every().day.at("10:30").do(crawl). You can also use schedule.clear() method to clear all the scheduled tasks.

Python3
schedule.every(5).minutes.do(crawl)  while True:     schedule.run_pending()     time.sleep(1) 

Example 1

Create a new folder. Inside the folder, start a new project(Quotes). Create WikiSpider.py file in this code is using the Scrapy library to create a spider that scrapes data from Wikipedia. The spider, called "WikiSpider", is set to start at the URL "https://en.wikipedia.org/wiki/Database" and is configured with a number of custom settings, such as the user agent, download delay, and a number of concurrent requests. The spider's parse method is called when the spider is done crawling and it gets the title of the page and all the paragraphs from the page and writes it in a txt file named "wiki.txt". The code also uses the schedule library to run the spider every 30 seconds and uses an infinite loop to keep running the scheduled spider until it is stopped manually.

Python3
import scrapy import schedule import time from scrapy import cmdline  # This class is a spider for scraping data from wikipedia class WikiSpider(scrapy.Spider):     name = "wiki"     # the starting url for the spider to crawl     start_urls = ["https://en.wikipedia.org/wiki/Database"]     # settings for the spider such as user agent, download delay,      # and number of concurrent requests     custom_settings = {     'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0;Win64) \     AppleWebkit/537.36 (KHTML, like Gecko) \     Chrome/89.0.4389.82 Safari/537.36',     'DOWNLOAD_DELAY': 1,     'CONCURRENT_REQUESTS': 1,     'RETRY_TIMES': 3,     'RETRY_HTTP_CODES': [500, 503, 504, 400, 403, 404, 408],     'DOWNLOADER_MIDDLEWARES': {     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,     'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,     }     }     # parse method that is called when the spider is done crawling     def parse(self, response):         # get the title of the page         title = response.css("title::text").get()         # get all the paragraphs from the page         paragraphs = response.css("p::text").getall()     # Open the file in write mode         print(title)         print(paragraphs)         with open("wiki.txt", "w") as f:             f.write(title)             for para in paragraphs:                 f.write(para+"\n")  # function to run the spider def crawl_wiki():     cmdline.execute("scrapy runspider WikiSpider.py".split())  # schedule the spider to run every 30 seconds schedule.every(30).seconds.do(crawl_wiki)  # infinite loop to run the scheduled spider while True:     schedule.run_pending()     time.sleep(1) 

Output: Run the Spider

scrapy runspider WikiSpider.py

On running WikiSpider.py, a wiki.txt file is created which contains contents from https://en.wikipedia.org/wiki/Database scheduled every 30 seconds.

How to schedule Scrapy crawl execution programmatically?
 
How to schedule Scrapy crawl execution programmatically?
wiki.txt is created on running the spider

EXAMPLE 2

Here is an example of a Scrapy spider that scrapes quotes from a website and prints the output to the console. The spider is scheduled to run every hour using the schedule library.

Create a new folder. Inside the folder, start a new project(Quotes). Create QuotesSpider.py file in this code is using the Scrapy library to create a spider that scrapes data from a website that contains quotes. The spider, called "QuotesSpider", is set to start at the URL "http://quotes.toscrape.com/page/1/", which is a website that contains quotes.
The spider's parse method is called when the spider is done crawling and it gets the text, author, and tags of each quote and yields it as a dictionary. Also, it checks for the next page and follows the link if it exists. The code also uses the schedule library to run the spider every 30 seconds and uses an infinite loop to keep running the scheduled spider until it is stopped manually.

Python3
import scrapy import schedule import time from scrapy import cmdline  # This class is a spider for scraping data from quotes website   class QuotesSpider(scrapy.Spider):     name = "quotes"     # the starting url for the spider to crawl     start_urls = [         'http://quotes.toscrape.com/page/1/',      ]     # settings for the spider such as user agent, download delay,     # and number of concurrent requests     custom_settings = {         'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0;Win64) \         AppleWebkit/537.36 (KHTML, like Gecko) \         Chrome/89.0.4389.82 Safari/537.36',         'DOWNLOAD_DELAY': 1,         'CONCURRENT_REQUESTS': 1,         'RETRY_TIMES': 3,         'RETRY_HTTP_CODES': [500, 503, 504, 400, 403, 404, 408],         'DOWNLOADER_MIDDLEWARES': {             'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,             'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,             'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,         }     }     # parse method that is called when the spider is done crawling      def parse(self, response):         for quote in response.css('div.quote'):             yield {                 'text': quote.css('span.text::text').get(),                 'author': quote.css('span small::text').get(),                 'tags': quote.css('div.tags a.tag::text').getall(),             }     # check for next page and follow the link         next_page = response.css('li.next a::attr(href)').get()         if next_page is not None:             yield response.follow(next_page, callback=self.parse)  # function to run the spider   def crawl_quotes():     cmdline.execute("scrapy runspider QuotesSpider.py".split())   # schedule the spider to run every 30 seconds schedule.every(30).seconds.do(crawl_quotes)  # infinite loop to run the scheduled spider while True:     schedule.run_pending()     time.sleep(1) 

Output: Run the Spider

scrapy runspider QuotesSpider.py
How to schedule Scrapy crawl execution programmatically?
 
How to schedule Scrapy crawl execution programmatically?
This output will be printed to the console every time the spider runs, as specified in the schedule.

Next Article
How to Install Python Scrapy on Ubuntu?
author
shiwanshijha21
Improve
Article Tags :
  • Technical Scripter
  • Python
  • Technical Scripter 2022
  • Python-Scrapy
Practice Tags :
  • python

Similar Reads

  • How to Run a Python Script
    Python scripts are Python code files saved with a .py extension. You can run these files on any device if it has Python installed on it. They are very versatile programs and can perform a variety of tasks like data analysis, web development, etc. You might get these Python scripts if you are a begin
    6 min read
  • How to Install Python Scrapy on Windows?
    Scrapy is a web scraping library that is used to scrape, parse and collect web data. Now once our spider has scrapped the data then it decides whether to: Keep the data.Drop the data or items.stop and store the processed data items. In this article, we will look into the process of installing the Sc
    2 min read
  • How to Scrape Videos using Python ?
    Prerequisite: requestsBeautifulSoup In this article, we will discuss web scraping of videos using python. For web scraping, we will use requests and BeautifulSoup Module in Python. The requests library is an integral part of Python for making HTTP requests to a specified URL. Whether it be REST APIs
    2 min read
  • How to Install Python Scrapy on Ubuntu?
    Scraping is the process of collection of web metadata or web information through web crawlers. We can get the links associated with the domain, can also retrieve the JavaScript file links, and many more. For performing web scraping we use the Scrapy library. It is purely written in Python. In this a
    2 min read
  • How To Follow Links With Python Scrapy ?
    In this article, we will use Scrapy, for scraping data, presenting on linked webpages, and, collecting the same. We will scrape data from the website 'https://quotes.toscrape.com/'. Creating a Scrapy Project Scrapy comes with an efficient command-line tool, also called the 'Scrapy tool'. Commands ar
    8 min read
  • Deploying Scrapy spider on ScrapingHub
    What is ScrapingHub ? Scrapy is an open source framework for web-crawling. This framework is written in python and originally made for web scraping. Web scraping can also be used to extract data using API. ScrapingHub provides the whole service to crawl the data from web pages, even for complex web
    5 min read
  • How to Scrape Text from <strong> Tag in Python
    In this article, we are going to scrape text data from <strong> tag. We will scrape all the data which comes under the strong tag of a website. We will cover all the basic understandings with clear and concise examples. Scraping Text from TagScraping text from HTML tags can be easily done by u
    4 min read
  • Run Python File In Vscode
    Visual Studio Code (VSCode) is a popular and versatile code editor that supports Python development with various features and extensions. In this article, we will see how to run Python files in VsCode. Below is the step-by-step procedure by which we can run the basic Python Script in VScode: Step 1:
    2 min read
  • How to Install Scrapy on MacOS?
    In this article, we will learn how to install Scrapy in Python on MacOS. Scrapy is a fast high-level web crawling and web scraping framework used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated
    2 min read
  • How to Build Web scraping bot in Python
    In this article, we are going to see how to build a web scraping bot in Python. Web Scraping is a process of extracting data from websites. A Bot is a piece of code that will automate our task. Therefore, A web scraping bot is a program that will automatically scrape a website for data, based on our
    8 min read
geeksforgeeks-footer-logo
Corporate & Communications Address:
A-143, 7th Floor, Sovereign Corporate Tower, Sector- 136, Noida, Uttar Pradesh (201305)
Registered Address:
K 061, Tower K, Gulshan Vivante Apartment, Sector 137, Noida, Gautam Buddh Nagar, Uttar Pradesh, 201305
GFG App on Play Store GFG App on App Store
Advertise with us
  • Company
  • About Us
  • Legal
  • Privacy Policy
  • In Media
  • Contact Us
  • Advertise with us
  • GFG Corporate Solution
  • Placement Training Program
  • Languages
  • Python
  • Java
  • C++
  • PHP
  • GoLang
  • SQL
  • R Language
  • Android Tutorial
  • Tutorials Archive
  • DSA
  • Data Structures
  • Algorithms
  • DSA for Beginners
  • Basic DSA Problems
  • DSA Roadmap
  • Top 100 DSA Interview Problems
  • DSA Roadmap by Sandeep Jain
  • All Cheat Sheets
  • Data Science & ML
  • Data Science With Python
  • Data Science For Beginner
  • Machine Learning
  • ML Maths
  • Data Visualisation
  • Pandas
  • NumPy
  • NLP
  • Deep Learning
  • Web Technologies
  • HTML
  • CSS
  • JavaScript
  • TypeScript
  • ReactJS
  • NextJS
  • Bootstrap
  • Web Design
  • Python Tutorial
  • Python Programming Examples
  • Python Projects
  • Python Tkinter
  • Python Web Scraping
  • OpenCV Tutorial
  • Python Interview Question
  • Django
  • Computer Science
  • Operating Systems
  • Computer Network
  • Database Management System
  • Software Engineering
  • Digital Logic Design
  • Engineering Maths
  • Software Development
  • Software Testing
  • DevOps
  • Git
  • Linux
  • AWS
  • Docker
  • Kubernetes
  • Azure
  • GCP
  • DevOps Roadmap
  • System Design
  • High Level Design
  • Low Level Design
  • UML Diagrams
  • Interview Guide
  • Design Patterns
  • OOAD
  • System Design Bootcamp
  • Interview Questions
  • Inteview Preparation
  • Competitive Programming
  • Top DS or Algo for CP
  • Company-Wise Recruitment Process
  • Company-Wise Preparation
  • Aptitude Preparation
  • Puzzles
  • School Subjects
  • Mathematics
  • Physics
  • Chemistry
  • Biology
  • Social Science
  • English Grammar
  • Commerce
  • World GK
  • GeeksforGeeks Videos
  • DSA
  • Python
  • Java
  • C++
  • Web Development
  • Data Science
  • CS Subjects
@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved
We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy
Lightbox
Improvement
Suggest Changes
Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.
geeksforgeeks-suggest-icon
Create Improvement
Enhance the article with your expertise. Contribute to the GeeksforGeeks community and help create better learning resources for all.
geeksforgeeks-improvement-icon
Suggest Changes
min 4 words, max Words Limit:1000

Thank You!

Your suggestions are valuable to us.

What kind of Experience do you want to share?

Interview Experiences
Admission Experiences
Career Journeys
Work Experiences
Campus Experiences
Competitive Exam Experiences