Home / Blog / Data Science / Python Scrapy Tutorial for Beginners

Python Scrapy Tutorial for Beginners

June 28, 2024
41

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of AiSPRY and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

History

Mydeco, an online aggregation and e-commerce business, created Scrapy. It was kept up by Mydeco and Insophia, a Montevideo, Uruguay-based online consulting firm. The Berkeley Source Distribution licence was used for its initial public release in the year 2008. Another significant event was the 1.0 release in June 2015. Zyte assumed the role of Scrapy's official maintainer in 2011. Formerly known as Scrapinghub, Zyte.

What is Scrapy?

Scrapy is a framework for web scraping and web crawling. It was originally written in python. It can extract data from websites using its web crawling method.

With web scraping, users can manage their data for various purposes like online merchandising, price monitoring, and also driving marketing decisions.

Scrapy uses web crawlers named spiders which are self-contained crawlers that follow a given set of instructions. Like any other scalable framework, these are easier to build and scale. They can easily scale large crawling projects by allowing developers to reuse their code.

Scrapy framework has many powerful features that can help the user scrape without being detected. Some examples of these powerful features are auto-throttle, user agents, and rotating proxies. Scrapy also provides the users the ability to test their assumptions on on-site behavior.

Now let’s see some of the companies using Scrapy:

Sciences Po Medialab,
Sayone Technologies,
Data.gov.uk’s World Government Data site,
Lyst, Parse.ly. etc

Why use Scrapy?

Python offers two packages for web scraping: Scrapy and Beautiful Soup (BS). Additionally, we shall contrast scrappy with BS.

In contrast, we discover that both are freely usable; the primary distinction between the two is that BS is only a data extractor and parser. As a result, it is unable to obtain data on its own. In order to accomplish this, BS employs a different library called requests, which aids BS in information retrieval.

As an alternative, there is Scrapy, a library that can download, analyse, and store data all by itself. Using its web crawling capabilities, Scrapy can also automatically follow links on web sites.

While a web scraper has to do simple content parsing, BS is a great option since it is straightforward and simple to use. However, while working on a huge project that requires extensive scraping, we must utilise scratch because to its superior performance and scalability.

A extremely reliable web scraping framework is called Scrapy. Due to its asynchronous nature, which enables it to send out several queries concurrently, Scrapy has an edge over other tools when speed and efficiency are required. Additionally, it has excellent built-in support for data extraction from HTML and CSS tags. For these jobs, we employ CSS expressions and XPATH. It is quicker than any other web scraping library currently in use. It may also be readily expanded.

The only drawback of Scrapy is that Selenium performs far better than Scrapy when dealing with Javascript. We need to perform AJAX/PJAX queries in order to run Javascript. In order to resolve the requests in Scrapy, a headless browser or a genuine browser must be used.

Multiple feed export formats may be created using Scrapy with built-in support. Also, Scrapy is far quicker than BS at selecting and extracting data. BS can also be accelerated by employing the multithreading technique.

while working with dynamically loaded web pages that appear when a web browser is opened. Selectors cannot be used to get these data. In this situation, it is necessary to identify the data source and extract the data straight from it.

Using the web crawling framework Scrapy, a developer may construct customised code that specifies how data will be extracted from a site or collection of sites.

The speed of the spiders is substantial because to Scrapy's usage of the Twisted technology, which creates an asynchronous networking framework. This script can scrape numerous pages at once and operates quickly.

How do you use Scrapy in Python?

Python Scrapy Tutorial for Beginners

1. Creating a new Scrapy project.

Creating the Scrapy project using a command line is the first step, once it is created we will have a new Scrapy folder with all the dependencies getting created. In this folder, we will be having a spider folder where we will put our spiders or web crawlers.

2. Writing a spider to crawl a site and extract data.

It uses web spiders for extracting information.

A web spider or a web crawler or a spider bot or just crawler is an internet bot that operates on WWW or World Wide Web using the search engines used for web indexing.

Many websites and search engines use web-crawling software to update their web content. Web crawlers can increase the efficiency of search because it copies pages for processing and indexes them.

For spiders, the scraping cycle goes through something like this:

The first step involves an initial request to crawl the required URLs with a callback function that returns the responses downloaded from the requests.
We use the start_requests() function to make requests for specified URLs within the start_urls. There is another method called the parse method which acts as the callback function.
Using this parse function we can either return item objects, request objects, or an iterable of these objects. A spider might contain specified callbacks to handle respective responses.
We generally use Selectors(XPATH or CSS) to parse content from the web, but this can be done using BS, XML, or another parsing technique to get the parsed data.
In the ultimate step, the items returned from the spider will be written to a file of a specific format using feed exports.

3. Exporting the scraped data using the command line.

This scraped data uses Feed exports to change its format into a better format like JSON, EXCEL, CSV, etc., and saved on the system.

4. Changing spider to recursively follow links.

Spiders also allow us to follow links. To do this we first need to find the navigation page and find the link that goes to the next page. Generally, we have it as a link containing text as “NEXT”. From this link, we need to get the HREF attribute which we can select using selectors. Once this is done, we can use the follow() method to automatically navigate to other pages on the website. Click here to learn Data Science Course

Using spider arguments.

We can use spider arguments to modify their behavior. It is commonly used to define start URLs and to restrict spiders from certain parts of the sites. We can also use them to configure many other possible functions of the spider.

Scrapy provides a lot of powerful features for making scraping easy and efficient, such as:

For HTML/XML sources, we can use selectors (XPATH, CSS) to extract data.
It also provides us with an interactive shell to try out scraping codes while writing or debugging spiders.
We also have support for storing data in various formats like JSON, CSV, XLM, etc., and in multiple backends like FTPS and local file systems. It uses feed exports for this job.
Spiders can deal with foreign, non-standard, or broken encoding declarations as it has robust encoding support as well as auto-detection methods.
Spiders have strong extensibility support which allows us to install our functionality or a well-defined API like middlewares, extensions, and pipelines.
Spiders have a wide range of built-in extensions and middlewares for handling:
robots.txt
crawl depth restriction
HTTP features like compression, authentication, caching
cookies and session handling
user-agent spoofing, and more
Spider also provides a Telnet console that hooks a python console inside a Scrapy process to debug a crawler.
Spiders provide reusable spiders to crawl sites from sitemaps
Spiders provide a media pipeline to automatically download images associated with scraped data.
Spiders also provide caching DNS resolver and much other functionality. Click here to learn Data Science Course in Chennai

Conclusion

Now that we are familiar with the fundamentals of online scraping, we also need to grasp the guidelines for ethical data collection.

Web scraping behaviour guidelines

Ask politely. Before using a certain organization's data for scraping, it is usually a good idea to inquire. If you're lucky, you might be able to obtain the data straight from the business without the need for scraping.
Downloading private papers is not advisable. Never scrape any private objects; this is always a good idea.
Check the laws in your area. It is always a good idea to be aware of what can be lawfully destroyed.
Don't share illegally downloaded stuff. It is acceptable to scrape for data that is covered by the fair use clause in the intellectual property law. However, it might not be lawful to share this info.
As much as you can, please share. Sharing free and legal scraped data is always a good idea because it may assist others who need to complete the same activity. Github and other websites that are comparable to it allow sharing.
Avoid damaging the web. Many websites collapse because they were not designed to withstand web scrapers. Spiders reduce this danger, nevertheless, when the default settings are used.
Make your data reusable by publishing it. The extracted data must to be made available in a form that makes it simpler to discover. Include metadata about the information as well. Give the information in a logical way.