Web Scraping Target Discovery

Most web scrapers are made up of two core parts: finding products on the website and actually scraping them. The former is often referred to as "target discovery" step. For example to scrape product data of an e-commerce website we would need to find urls to each individual product and only then we can scrape their data.

Discovering targets to scrape in web scraping is often a challenging and important task. This series of blog posts tagged with #discovery-methods (also see main article) covers common target discovery approaches.

In this blog series tagged #discovery-methods we'll take a look at common discovery methods used in web-scraping where each is different enough to have it's own risks, negatives and benefits. We'll target an example clothing store website https://hm.com for all of these discovery approaches:

Sitemaps - using website sitemap indexes.
Search API - reverse engineering websites search api.
Indexes - taking advantage of existing indexes and search engines.
Crawling - recursively scrape whole website to find what we're looking for.

These are 4 main discovery approaches that can be used in web scraping target discovery and knowing them is a great tool in web scraper developers utility belt.
To quickly summarize our 4 main articles the discovery strategies vary quite a bit:

Sitemaps is probably the best approach as it's fast, safe and easy to implement; however unfortunately sitemaps of many websites are often neglected or contain dated links.
Reverse engineering websites search api on the other hand is both efficient and has great results; however it requires reverse-engineering knowledge and can be difficult/time consuming to implement.
Crawling is a great general approach but it's risky, slow and resource intensive.
Finally taking advantage of existing indexes is a great last resort for web-sites that don't like to be scraped as they still want to be indexed by search engines or other indexers.

So which one to use? It really depends on your target and resources.
Hopefully this extensive blog series can help you determine the right way to find your data targets!

For more web-scraping discovery techniques, see #discovery-methods and #discovery for more discovery related subjects.

If you have any questions, come join us on #web-scraping on matrix, check out #web-scraping on stackoverflow or leave a comment below!

As always, you can hire me for web-scraping consultation over at hire page and happy scraping!

image credits: "Telescopes, Mauna Kea, Hawaii" by Gregory Williams is licensed under CC BY-NC-ND 2.0