Scrapecrow

A Year of Writing about Web Scraping in Review

2023-01-02T00:00:00+01:00

At the beginning of 2022 I've joined Scrapfly.io - a small web scraping API company. One of my main roles is to keep our blog running with various tutorials and web scraping highlights.
So, to wrap up this year, here's a list of my favorite articles I've worked on in 2022 and what I've learned from working on them!

Introductions

Web scraping is a very available niche. You can write a scraper in almost any programming language as long as it has a viable HTTP client (which is basically any language).

At ScrapFly, we have a diverse client base using all sorts of programming languages to scrape the web using our API. So, I wrote multiple in-depth and up-to-date introductions to web scraping in several languages which was quite a journey on its own:

Python is the most popular language in this medium for a reason - it's a great data language with strong async support and some of the best community libraries out there.
R language is surprisingly effective. The pipe operator fits into web scraping flows perfectly and is really fun to use.
PHP has a lot of great web tools that fit into web scraping though generally it's not a very good data language.
Ruby feels a lot like Python just with a smaller community and fewer tools.
Javascript/NodeJS seems to be able to do everything and scraping is no exception. I'd say that it's not a very good data language so it wouldn't be my first choice for scraping.

This was a great exercise to understand what makes a good web scraping environment. Working on these articles provided really valuable horizontal growth as it presented the problem of web scraping from different perspectives. Every language seems to have something unique and special about it and having a general overview did help me to improve my development techniques and workflows.

Guides

Another way we tried to reach out to scraper developers is by writing hands-on scrape guides for popular scraping targets like:

And dozens of others.

For these, I had to choose a standard set of tools and libraries which was a difficult challenge. Using browser automation tools like Selenium, Playwright or Puppeteer can often be easier but they don't fit the tutorial format well as browsers are complicated and work differently on different user machines.

I've settled on Python with httpx, parsel and jmespath as the de facto scraping environment as it's really powerful and approachable.

With this setup, there's more focus on reverse engineering and web technologies rather than automation but it's easier to figure out how to explain technical concepts than to deal with replication issues of web browsers.

As for the library stack:

httpx is a brilliant http client which supports http2 and respects RFC standards. It's probably the best HTTP client out there outside of cURL.
parsel supports both CSS and XPath selectors for HTML parsing which is by far the best way to parse HTML.
jmespath is the most accessible way to parse JSON datasets. Usually, I'd favor jsonpath over jmespath as it allows non-root based selections (e.g. find me any node named foo anywhere) but jsonpath implementations aren't very great.

Finally, Python itself comes with loads of built-in tools like collections and urllib libraries that really fit into the web scraping world.

Each popular target is surprisingly different though I noticed how prevalent client-side rendering is in the modern web.

More than half of the scrape guides didn't use much HTML parsing at all! Instead, hidden web data scraping turned out to be a much easier way to scrape this modern javascript-powered part of the web.

Key Subjects

Once we started working on scrape guides we noticed a lot of repetition.

Common subjects like how to scrape hidden web data, how to parse JSON data and how to use CSS and XPath selectors would have to be repeated in each article.

So, it seemed like it'd be appropriate that we cover each of these subjects individually. This, lead us even further - what are some other key web scraping subjects that could be referenced anywhere? Here are a few we've settled on:

How to ensure web scraped data quality? - data validation is boring but vital for consistent data delivery.
How to turn web scrapers into data APIs? - real time scraping API with FastAPI and async Python.
How to rotate proxies? - weighted randomization!
web scraper scaling in Python - Asyncio, threads and subprocesses from the point of view of web scraping.
How to rate limit requests?
How to use browser automation tools like Puppeteer, Playwright and Selenium

Most of the topics came up naturally as we were reviewing and developing our articles though some came from our customers and questions on popular forums like Stackoverflow or Reddit.

Favorites

The most difficult and interesting subject in web scraping is definitely scraper blocking. To cover it I've got a bunch of help from my teammates and put together a definitive guide on web scraper blocking which took a lot of research and turned out to be our reader's favorite!

I thought I was pretty well versed in the blocking subject but stuff like how powerful TLS fingerprinting really blew my mind. There are just so many ways to fingerprint and identify users based on their connections.

My favorite article, however, is creating search engines using web scraping where we demo how any website can be quickly indexed with a bit of scraper code, some clever HTML parsing and an index front-end like LunrJS. I've adopted this technique in many other different projects as having a quick data index is really useful!

Lessons Learned

We spent a lot of time designing our blog to function as a knowledge graph rather than a series of individual guides. This paid off pretty well as SEO seems to like references.

One slight disappointment was how cool features can backfire when it comes to SEO. For our CSS and XPath articles, we built real-time on-page testing widgets that were not liked by the search engines and took a while to optimize and finally pay off.

We also tried structuring our articles with revealable components (like <details>) which increased readability but noticeably worsened SEO.

Keyword diversity was another important challenge as we wanted all sorts of users to find our content - from beginner developers to seniors discovering web scraping.

Since web-scraping is still a small niche there aren't many well-established terms, so diversifying terms made a huge difference. Sometimes increasing monthly readers tenfold!

I wrote close to 60 articles that are on average 3000 words long. All of them started as 6000~ word drafts and got edited into something more dense and digestible.

Keep things short is easier said than done though. The real challenge is how do you make all of these conflicting problems work together?

SEO keywords favor longer text allowing more keyword diversity.
Keyword repetition is bad for SEO but good for comprehension.
Shorter explanations might miss vital details but longer ones can be more difficult to comprehend or just bore the reader.
Which problems deserve to be explained and which aren't?

To start addressing these I focused more on the structure of the article. For this, table of contents is such a gem of a feature!

TOC is great for the reader to navigate and overview the article but it's also great for the editor as it helps to visualize: is this the right order to approach this problem? What's missing and what shouldn't be there?

For example, in our scraping guide articles, I quickly learned that starting with a short example or how to scrape a single product page is much more approachable than following the traditional project order.

Getting rid of factoids and history lessons helped too. People are probably interested in the fact that the Referer header is officially misspelled and it's a good thing to know but do they really need to know more about this?

Media use can also be a major help. Well placed image or a video can save several paragraphs of text and provide some breathing space.

The Great Beyond!

Exploring web scraping subject is a real pleasure - it's vast and challenging though the best part is that at the end of the day you see concrete data results. Getting a quality dataset conjured up from nothing is immensely satisfying which makes all of that work worth it!

We still have loads of ideas on educational web scraping content that we will be working on in 2023. We've started work on a collection of web scraping tools and a few more secret projects that should pop up soon! :)

Asynchronous Web Scraping: Scaling For The Moon!

2021-10-27T00:00:00+02:00

Scaling is one of the biggest challenges in web-scraping niche and there are a lot of ways to make web-scrapers scale better in efficiency, speed and reliability. However, there's one thing can be done to web-scraper code that is by far the biggest bang-per-buck improvement: using asynchronous code.

In this short article, we'll take a look at asynchronous python and how we can take advantage of it to make our scrapers over 100 times faster!

What is Asynchronous Programming?

Python supports many async code paradigms, however the current de facto standard is async/await via asyncio (or it's alternatives like trio). In this article we'll keep focus on asyncio as it's currently the most approachable async paradigm in Python.

So what is asynchronous programming?
At it's core it's essentially just pausable functions (called coroutines) which can pause when inactive and give way to active ones. This often has huge benefits in IO (input/output) bound programs that rely on waiting for some sort of external input or output.

What is an IO block?
When programs interact with external service they often need to wait for the service to respond: this waiting is called IO block. Some examples of IO blocks:

In online apps IO block would be waiting for server to respond to requests.
In GUI apps this would be waiting for app user to click buttons, enter text or interact with other GUI widgets.
In video games this would be waiting for the player to perform some actions.

IO blocking is a big deal in web-scraping, since the majority of actual work the program does is communication with the web server.
For example, in synchronous python when we send a request to a server our code stops - does absolutely nothing - while waiting for the server to respond. In web-scraping this can be up to several seconds, and they add up really quickly!

Presumably, we have 100 urls to scrape that wait 1 second each - that's over a minute and a half of waiting! Could we make them wait together concurrently?

How Does Async Web Scraping Work?

In these examples we'll be using httpx http client package which supports both synchronous and asynchronous client APIs

Let's take a look at a quick example. We have 100 urls that we want to scrape, and we have this really simple synchronous scraper code:

from httpx import Client
from time import time


def scrape(url, session):
    return session.get(url)


def run():
    _start = time()
    results = []
    with Client() as session:
        for i in range(100):
            # this url will always take 1 second
            url = "http://httpbin.org/delay/1"
            results.append(scrape(url, session=session))
    print(f"finished scraping in: {time() - _start:.1f} seconds")


if __name__ == "__main__":
    run()

# results in:
# finished scraping in: 135.8 seconds

Here we're taking advantage of https://httpbin.org delayed response endpoint /delay/<n> which will simulates slow server which takes N seconds to respond. This tool is great for testing web scraping scaling!

In this example, we schedule 100 different requests synchronously that each should take at least 1 second to execute each: 1 second delay by the server + all connection overhead our machine does = 100+ seconds.

Now let's try to run those 100 request at the same time, asynchronously by using asyncio async loop and asyncio.gather() function:

import asyncio
from time import time
from httpx import AsyncClient


async def scrape(url, session):
    return await session.get(url)


async def run():
    _start = time()
    async with AsyncClient() as session:
        tasks = []
        for i in range(100):
            # this url will always take 1 second
            url = "http://httpbin.org/delay/1"
            tasks.append(scrape(url, session=session))
        results = await asyncio.gather(*tasks)
    print(f"finished scraping in: {time() - _start:.1f} seconds")


if __name__ == "__main__":
    asyncio.run(run())

If we run this minor async modification, we'll see a colossal speed boost:

finished scraping in: 2.4 seconds

We sped up our scraper over 50 times just by converting few lines of code into asynchronous python! Let's take a look at small illustration of the performance of these two programs:

Here, we can easily visualize the difference between synchronous and asynchronous scraper programs: the IO waiting on the left is part of the program, while on the right it doesn't have to be!

Async programs essentially prevent IO waiting from blocking our programs. For waiting intensive tasks such as http connections, async essentially means concurrency and can speed up programs from dozens to thousands times over!

However, as the illustration shows, there are some negatives to async programming:

Ensuring execution order can be difficult.
Often, request order can be very important in session or cookie bound web scraping. Keeping track of sessions and cookies is much more complicated in async environment.
Async programs are much harder to design and debug since they are inherently more complex.
Async speed can be major pain when dealing with scraper bans which require programming of throttling logic.

Whether this overhead of complexity worth it for your web scraper entirely depends on your program. For small scripts, it might not be worth tangling up with the whole async ecosystem, which is still quite young in Python. However, for big data scrapers, async provided speed boosts of 100-1000x are vital for any reasonable data collection speeds!

We're Going Too Fast: Throttling

Using async python we can make a lot of http requests very quickly, which can become problematic. Web servers often try to protect themselves from high traffic as it's resource expensive and will often block scrapes that connect too quickly.

To avoid bans and blocks, we must either use multiple proxies or preferably throttle our connections. Let's take a look at few common ways to throttle web-scrapers.

Asyncio's Semaphore

This is built in way to throttle coroutine concurrency in python's asyncio. Semaphore is essentially a lock object with limited amount of working slots, i.e. it allows us to specify that no more than N coroutines should be executed concurrently:

import asyncio
from time import time
from httpx import AsyncClient


async def scrape(url, session, throttler):
    async with throttler:
        return await session.get(url)


async def run():
    _start = time()
    throttler = asyncio.Semaphore(10)
    async with AsyncClient() as session:
        tasks = []
        for i in range(100):
            # this url will always take 1 second
            url = "http://httpbin.org/delay/1"
            tasks.append(scrape(url, session=session, throttler=throttler))
        results = await asyncio.gather(*tasks)
    print(f"finished scraping in: {time() - _start:.1f} seconds")


if __name__ == "__main__":
    asyncio.run(run())

# results in:
# finished scraping in: 14.8 seconds

Here we modified our async scraper code with one key addition of asyncio.Semaphore(10) object, which limits concurrent execution to 10 concurrent coroutines. Our original unlimited code took 2.4 to complete now we managed to slow it down 6 times with few extra lines of code!

Some important things to notice about asyncio.Semaphore:

All throttled coroutines should be use same Semaphore object. This means one throttled needs to be passed around to each individual coroutine introducing slight complexity overhead.

Semaphore is only aware of concurrency, not time. Often when web-scraping we want to limit request count to specific time frame like max 60 requests / minute. In other words, our Semaphore(10) when scraping fast servers could yield 10 requests / second however on slower ones it might be only 2 requests / second.

While Semaphore gives us an easy way to throttle ourselves, we can see it's not an ideal approach as it doesn't allow us to be as precise as some web-servers might require us to. Often web servers have strict limits like 60 requests/second - for most efficient web scraper we would like to stay around that range - let's take a look how can we accomplish this.

Leaky Bucket

Leaky Bucket is a common throttling algorithm that uses time tracking for throttling rather than just concurrency. Using the Leaky Bucket approach, we can specify how many tasks can be executed in a specific time frame, i.e. we can throttle our scraper to 10 requests/second!

A popular implementation for python's asyncio is available in aiolimiter package. Let's take a quick look, how can we use it in our web-scraper:

import asyncio
from aiolimiter import AsyncLimiter
from time import time
from httpx import AsyncClient


async def scrape(url, session, throttler):
    async with throttler:
        return await session.get(url)


async def run():
    _start = time()
    throttler = AsyncLimiter(max_rate=10, time_period=1)   # 10 tasks/second
    async with AsyncClient() as session:
        tasks = []
        for i in range(100):
            # this url will always take 1 second
            url = "http://httpbin.org/delay/1"
            tasks.append(scrape(url, session=session, throttler=throttler))
        results = await asyncio.gather(*tasks)
    print(f"finished scraping in: {time() - _start:.1f} seconds")


if __name__ == "__main__":
    asyncio.run(run())

# results in:
# finished scraping in: 10.4 seconds

Here we merely replaced our Semaphore from previous code with aiolimiter.AsyncLimiter which allows us to be much more precise: we completed our 100 scrape tasks in 10.4 seconds which matches our limit of 10req/second almost exactly!

Leaky bucket is a great throttling approach as it allows us to scrape in predictable and controlled speeds. That being said predictability is not a great web scraper feature when it comes to blocking and banning - scraper can be identified quite easily if it shows a consistent pattern of connections.
So which throttling pattern to use entirely depends on scraped targets, but Leaky Bucket is often the best approach!

How Much To Throttle?

Finally with throttling figured out, we need to decide on the number? Unfortunately, there are no standard accepted practices other than it's respectful to keep connection number lower to not cause issues to web servers.

Keeping in the 10-30 requests/second range (depending on page and website size) is widely considered to be a respectful scraping rate - so that's a good starting point. Other than that it's a good idea to grab some proxies and experiment!

Summary, Alternatives and Further Reading

In this introductory article, we've taken a look how we can use asynchronous python to speed up our web-scrapers to the point where we had to figure out how to throttle ourselves! For throttling, we've discovered Semaphore and Leaky Bucket approaches and how can we apply them in web-scraping.

Asynchronous programming can often be difficult, complex and even ugly. However, when it comes to such IO heavy tasks like web connections in web-scraping it is often unavoidable, so better get comfortable with it!

As for further reading and alternatives?

Asynchronous programming is not a new concept by any means, but it's still evolving in many different directions. In this article we focused on Python's async/await approach, however there are other paradigms that are also used in web-scraping.

One of the biggest web-scraping frameworks Scrapy is using Twisted callback based async engine which allows running functions in the background and call result function once they complete. Callbacks and Futures are also often used in Javascript and can be used in Python's asyncio as well!

Other than that, Celery task engine is also often used to execute scraping tasks concurrently.
Gevent is yet another popular asynchronous library that integrates well with web-scraping ecosystem.

In the future of this blog we'll cover more examples of these alternative approaches, however do not underestimate python's asyncio - as we've seen in this article it's extremely fast, surprisingly accessible and once understood can be an absolute pleasure to work with!

So stick around for more articles and if you have any questions, come join us on #web-scraping on matrix, check out #web-scraping on stackoverflow or leave a comment below!

As always, you can hire me for web-scraping consultation over at hire page and happy scraping!

Introduction To Reverse Engineering The Web

2021-10-08T00:00:00+02:00

Unfortunately for web-scrapers modern websites are much more than html files strung together and served over http protocol. Often websites are complex structures of multiple executions layers and file formats that are designed specifically for one of the most complex programs in the world - web browsers.

So how do we scrape complex websites without using web-browsers? We can reverse engineer websites behavior and replicate it in our code!

In this article will cover a short introduction how to use web browser's developer tools to reverse engineer website's behavior. We'll be using https://food.com as our example and cover some useful tips and tricks.

In this article we'll be using Chrome web browser. That being said Chrome based browsers (like Brave, Electron, Qutebrowser etc.) use the same tools and should function the same. However Firefox and it's browser family use slightly different tools

Browser Developer Tools

Fortunately, modern web browsers come with great debugging tools referred to as "Developer Tools". For this article we'll take a look at Chrome web browser. If you fire up Chrome browser and click F12 (or right click anywhere on the page and select inspect) developer tool window will open up:

As you can see, there's a lot going on here. Let's quickly go through these tools and see what they can do to us when it comes to web-scraping. First, let's take a look at the available tool tabs:

Elements - this tab allows to visually explore, search and investigate the html page structure. For

this tab is really useful for visualizing how page is structured

Console - this tab functions like a real time shell or a repl. You can type javascript expressions here and they will be evaluated against the current page.

this tool is great for reverse engineering javascript functionality of the page - imagine it as a debugger shell for website's code

Application - contains various application data: from cookies to database entries. This is rarely used by websites but often used by various web-apps. For web-scraping this tab is not referred to commonly.

most useful feature of this tab is often the clear all data button

Network - probably the most interesting tab: it shows all of network requests made by the browser. Most useful web-scraping tool of the bunch!

we'll be spending most of our time in this tab!

As you can see it's a huge suite of web tools! However the most interesting tool when it comes to reverse-engineering for web-scraping purposes has to be the Network tab. Let's take a look how we can configure it for optimal experience and some examples of how to use it.

Network Inspector

This browser tool shows us all the requests our browser is making when we're browsing the web.

First, let's take a look at the window itself. Specifically how to read it in the context of reverse-engineering for web-scraping:

This might look a bit different on different browsers but functionality should be the same!

There's a lot going here but don't get overwhelmed just yet. We only need to focus on these parts:

Contains all requests your browser made to the website. You can click on each individual one to further inspect it (We'll dig into this more below).
Option flags that disable cache and stop data clearing on page load (These are very useful for reverse engineering)
Contains powerful filtering system. For the most part we'll be spending most of our time in either
- Doc filter which shows all html document requests
- XHR filter which shows all data requests such as json.
Clear button ⍉ which clears current requests for easier tracking of what's going on.

Further, we can take a look at individual request itself and which parts are most useful for reverse-enginering. If you click on one of the requests you should see something like:

In this window we see several important information fields:

Basic request details
Most important details here are URL and request method.
Response headers
Rarely interesting but can contain important meta data about response browser received from the website, such as:
- Set-Cookie header
  contains cookies website requests the browser to save
- Content-Type header
  contains the type of response. Most common values are either text/html for html documents or application/json for json data.
- X- prefixed headers
  these are non-standard headers that are often used for website functionality, tracking or anti-bot protection.
If you'd like to learn more about http headers see MDN's http header documentation
Request headers
Headers browser sent with this request. Often we want to replicate most of these headers in our web-scraper as closely as possible. Most common and vital ones being:
- Content-Type and Accept- prefixed headers
  these are instructions for what sort of content is expected. Often http client libraries (like requests for python) fill them in automatically, thus it's important to keep an eye on these as sometimes they might be generated differently from the our browser.
- User-Agent
  identifies who is making the request. This is really important field for preventing basic bot blocking. Usually we want to set this to one of popular browsers on popular OS systems like Chrome on Windows.
Request Payload
This mostly used when dealing with POST type requests. It shows what data browser sends to the website. Usually it's some sort of request parameters in json format.

As you can see, Network Inspector is a surprisingly powerful and extremely useful reverse engineering tool that shows us what connections our browser is performing in a very detailed manner!

Now that we're somewhat familiar with it, let's take a look at common usage tips and several iconic web-scraping problems that can be solved by using this tool.

Tip: Replicating Requests in Python

There's an easy way to replicate requests seen in the Network Inspector in your python code.
If you right-click on a request, you can see that the Network Inspector allows exporting it in several formats:

While there's no "copy as python" button there is "copy to curl" button which produces a curl command line tool command with all of the request details attached. Something like:

curl 'https://api.food.com/external/v1/nlp/search' \
  -H 'Connection: keep-alive' \
  -H 'Pragma: no-cache' \
  -H 'Cache-Control: no-cache' \
  -H 'Accept: application/json, text/javascript, */*; q=0.01' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.15.2 Chrome/87.0.4280.144 Safari/537.36' \
  -H 'Content-Type: application/json' \
  -H 'Accept-Language: en-US,en;q=0.9' \
  -H 'DNT: 1' \
  -H 'Origin: https://www.food.com' \
  -H 'Sec-Fetch-Site: same-site' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Referer: https://www.food.com/' \
  --data-raw '{"contexts":[],"searchTerm":"","pn":4}' \
  --compressed

There are several tools that can convert this string into code!

https://curl.trillworks.com/ - can convert it to multiple languages, python requests being one of them.
https://github.com/spulec/uncurl - is a library that can convert it to python objects or full python requests code.

Using either of these converter tools we can quickly prototype our web-scraper while reverse engineering our target.

Common Case: Dynamic Javascript Pagination

One of the most common encountered web-scraping issues is dynamic content generation powered by javascript. Modern websites often use javascript to generate web page content on the fly rather than redirecting users to a new page.

Most commonly this is observed in item pagination - instead of sending the user to page 2 directly its data is requested in the background and inject it back to document's body using javascript.
This is often referred to as never-ending or dynamic pagination.

Common identifiers of dynamic pagination:

Instead of pages users just need to scroll down and more results are loaded
Clicking page doesn't reload the current page just the pagination part.
Pagination doesn't work with javascript disabled.

Scraping Recipes from Food.com

For example let's take a look at how https://food.com does it in their recipe search:

you can see the little spinning wheel turning when you scroll the page

As you can see, the content of this website loads dynamically every time the user scrolls the page.
This technique is especially common in Single Page Applications (SPA) - where the whole idea is that the user never needs to switch locations and content is dynamically replaced

For more information on Single Page Applications see MDN's documentation on SPA

Since our web-scraper is not a browser (unless we use browser emulation) it doesn't execute javascript. Meaning to access this dynamic content we must reverse engineer the behavior so we can replicate it in our code.

Let's fire up devtools' Network Inspector and see what food.com does when we scroll down:

XHR filter tab only shows data requests

We can see that when we continue scrolling search requests are being made. That's actually data for the whole page of recipes which when received is being injected into the html page by javascript.
Let's take a look at these requests and how we can replicate them in our web-scraper:

Here we can see that the request being made is a POST type request to https://api.food.com/external/v1/nlp/search and it's sending some JSON data. In return it received json document with 10 recipes and loads of meta information - like how many pages are there in total. That's exactly what we're looking for!

Let's take a look at the document we need to send to receive this information.
Under "Request Payload" we see json document:

{
  "contexts": [
    "{\"name\":\"toast\",\"paths\":[\"/~asset/bread\"],\"type\":\"PRIMARY\",\"searchType\":\"NORMAL\",\"degreesSeparation\":0,\"cleanedName\":\"toast\",\"popularityFactor\":0,\"taggedContentCount\":2054,\"userToken\":true,\"searchGuess\":false,\"essenceContext\":false,\"matchingCandidate\":false}"
  ],
  "searchTerm": "toast",
  "pn": 14
}

Some context data seems to be sent, search term "toast" and pn integer argument which seems to be short for page number. Great, that means we can request any page for any search term!
Let's replicate this request in Python:

import requests

headers = {
    # when web scraping we always want to appear as 
    # a web browser to prevent being blocked
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
}
data = {
    # our recipe search term
    "searchTerm": "Toast",
    # page number
    "pn": 1,
}

url = "https://api.food.com/external/v1/nlp/search"
response = requests.post(url, json=data)

data = response.json()
results = data["response"]["results"]
total_results_count = data["response"]["totalResultsCount"]
print(f"found {len(results)} results from {total_results_count} total")
# this will print: "found 10 results from 2246 total"

It works! We've successfully reverse engineering food.com's backend API for it's paging function and replicated it in this Python scraper script!

If you noticed, we skipped the contexts object in our POST body entirely. The great part about reverse engineering something is that we can adjust its functions and have clean, efficient, beautiful programs. This is great for us as our scraper programs use less resources and are easier to follow and maintain, and it's also better for our target - food.com - as we consume fewer of their resources by only the scraping specific data targets rather than loading the whole page and all the extras such as images, videos etc.

Summary And Further Reading

In this short introduction article we covered what are browser developer tools and how can we use them to understand basic workings of javascript driven websites. We've covered example case of how https://food.com is using javascript for never-ending-pagination and how can we replicate it in Python using requests package.

Reverse engineering story doesn't end here - web is becoming more complex by the day and browser's devtools only scratch the surface of what we can learn about websites. In the future we'll cover advanced reverse engineering topics such as using man-in-the-middle monitoring programs such as mitmproxy, fiddler and wireshark that allow even more detailed inspection and various quality of life tools such as request interception and scripting. These tools not only allow to reverse engineer websites but desktop and mobile applications!

So stick around for more articles and if you have any questions, come join us on #web-scraping on matrix, check out #web-scraping on stackoverflow or leave a comment below!

As always, you can hire me for web-scraping consultation over at hire page and happy scraping!

Banner image by: "ENIGMA cipher machine collection" by brewbooks is licensed under CC BY-SA 2.0

Web Scraping Target Discovery

2021-09-29T00:00:00+02:00

Most web scrapers are made up of two core parts: finding products on the website and actually scraping them. The former is often referred to as "target discovery" step. For example to scrape product data of an e-commerce website we would need to find urls to each individual product and only then we can scrape their data.

Discovering targets to scrape in web scraping is often a challenging and important task. This series of blog posts tagged with #discovery-methods (also see main article) covers common target discovery approaches.

In this blog series tagged #discovery-methods we'll take a look at common discovery methods used in web-scraping where each is different enough to have it's own risks, negatives and benefits. We'll target an example clothing store website https://hm.com for all of these discovery approaches:

Sitemaps - using website sitemap indexes.
Search API - reverse engineering websites search api.
Indexes - taking advantage of existing indexes and search engines.
Crawling - recursively scrape whole website to find what we're looking for.

These are 4 main discovery approaches that can be used in web scraping target discovery and knowing them is a great tool in web scraper developers utility belt.
To quickly summarize our 4 main articles the discovery strategies vary quite a bit:

Sitemaps is probably the best approach as it's fast, safe and easy to implement; however unfortunately sitemaps of many websites are often neglected or contain dated links.
Reverse engineering websites search api on the other hand is both efficient and has great results; however it requires reverse-engineering knowledge and can be difficult/time consuming to implement.
Crawling is a great general approach but it's risky, slow and resource intensive.
Finally taking advantage of existing indexes is a great last resort for web-sites that don't like to be scraped as they still want to be indexed by search engines or other indexers.

So which one to use? It really depends on your target and resources.
Hopefully this extensive blog series can help you determine the right way to find your data targets!

For more web-scraping discovery techniques, see #discovery-methods and #discovery for more discovery related subjects.

If you have any questions, come join us on #web-scraping on matrix, check out #web-scraping on stackoverflow or leave a comment below!

As always, you can hire me for web-scraping consultation over at hire page and happy scraping!

image credits: "Telescopes, Mauna Kea, Hawaii" by Gregory Williams is licensed under CC BY-NC-ND 2.0

Web Scraping Target Discovery: Crawling

2021-09-28T00:00:00+02:00

In this article we'll take a look at web crawling and how can we use it as a discovery strategy in web scraping.

What is recursive crawling and how is it used in web-scraping?

One of the most common ways to discover web scraping targets is to recursively crawl the website. This technique is usually used by broad scrapers (scrapers that scrape many different websites) and index crawlers such as Google and other search engine bots.
In short crawling is recursive scraping technique where given a start url and some crawling rules the scraper continues exploring the website by visiting all'ish of the links present on the website.

To wrap our heads around crawling concept easier lets refer to this small flow chart:

This flow chart illustrates the simplest domain-bound crawl spider flow: the crawler is given a starting point, it scrapes and parses it for urls present in the html body. Then applies matching rules to urls and determines whether to save to urls (for scraping later) or whether to follow them up to repeat the whole process.

Before using crawling as a web scraping discovery strategy it's a good practice to reflect on common pros and cons of this technique and see whether that would fit your web-scraping project:

Pros:

Generic Algorithm: can be applied to any website with few adjustments. In other words one web scraper can be adapted to any website quite easily.
Good Coverage: some websites (like e-commerce) are well interlinked thus crawling will have great discovery coverage.
Easy to Develop: no reverse-engineering skills are required since we're just falling natural website structure.

Cons:

Inefficient and Slow: since crawling is a very generic solution it comes with a lot of inefficiencies. Often extracted links might not contain any product links so lots of crawl branches end up in dead ends.
Insufficient Coverage: some websites are not well interlinked (sometimes purposefully to prevent web scrapers). Crawlers can't discover items that are not referenced anywhere.
Risk: since scraped link bandwidth is much bigger than other discovery approaches the scrapers IPs are more likely to be throttled or blocked.
Struggles With Javascript Heavy Websites: since crawling is very generic and web scrapers don't execute javascript content (unless using browser emulation) some websites might be too complex for web scraper to follow.

We can see that crawling is a smart generic way to discover scrape targets however it's not without it's faults: it's slower, less accurate and might be hard to accomplish with some javascript heavy websites.
Lets take a look at example target discovery implementation that uses web crawling.

Example Use Case: hm.com

Lets take a look at a popular clothing e-commerce website: https://hm.com. We'll be using crawling approach to find all clothing products on the website.

First lets establish essential parts that make up a web crawler:

Link extractor - a function/object that can find urls in html body.
Defined link pattern rules to follow - a function/object that determines how to handle up extracted links.
Duplicate filter - object that keeps track of links scraper visited.
Limiter - since crawling visits many urls we need to limit connection rate to not overwhelm the website.

These are 4 components that make up a basic web crawler. Lets see how we can implement them for hm.com.

Crawling Rules

First lets establish our crawling rules. As per above flowchart our crawler needs to know which urls to follow up and which to save:

import re

class HMScraper:
    save_urls = re.compile("/productpage\.")  # e.g ...com/en_us/productpage.09008.html
    follow_urls = re.compile("\.html")
    follow_saved_urls = False

Here we defined our crawling rules:

We want to save all urls that contain /productpage. in the url as all hm.com products follow this pattern
We want to follow up any url containing .html
Do not follow urls that are being saved.

Following saved urls can useful as product pages often contain "related products" urls which can help us increase discovery coverage. For hm.com domain this is unnecessary.

These are 3 rules that define our crawler's routine for domain hm.com. With that ready lets take a look how we can create a link extractor function that will use these rules to extract crawl targets.

Crawl Loop

Having crawling rules defined we need to create a crawl loop that uses these rules to schedule a whole crawl process.
In this example for our http processing we'll be using httpx and for html parsing parsel python packages. With these two tools we can define basic crawler skeleton:

import logging
import asyncio
import re
from urllib.parse import urlparse
from typing import List
from parsel import Selector
from httpx import AsyncClient


class HMScraper:
    save_urls = re.compile("/productpage\.")  # e.g ...com/en_us/productpage.09008.html
    follow_urls = re.compile("\.html")
    follow_saved_urls = False

    def __init__(self, limit=5):
        # asyncio.Semaphore object allows us to limit coroutine concurrency 
        # in our case we can limit how many concurrent requests are being made
        self.limiter = asyncio.Semaphore(limit)
        self.seen_urls = set()

    async def __aenter__(self):
        """on scraper creation open http session"""
        self.session = AsyncClient(
            # we should use a browser-like user agent header to avoid being blocked
            headers={
                "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"
            }
        )
        return self

    async def __aexit__(self, *args):
        """on scraper destruction close http session"""
        await self.session.aclose()

    async def _request(self, url: str):
        """our http request wrapper function that implements rate limiting"""
        async with self.limiter:
            try:
                resp = await self.session.get(url)
            except Exception as e:
                return e
            return resp

    async def save(self, url):
        # for display purposes lets just print the url
        print(url)

    async def scrape(self):
        ...

    async def find_links(self, response):
        ...

With this skeleton, we have basic usage API for our scraper. We can define our run function:

async def run():
    async with HMScraper() as scraper:
        start_urls = [
            # homepage for US website
            "https://www2.hm.com/en_us/index.html"
        ]


if __name__ == "__main__":
    asyncio.run(run())

Great! Now all we have to do is fill in the interesting bits: link extraction and scrape loop.
For scrape loop all we need to do is request urls, find links in them, follow or save ones that match our rules:

    async def scrape(self, urls: List[str]):
        """Breadth first"""
        while True:
            to_follow = set()
            for resp in asyncio.as_completed([self._request(url) for url in urls]):
                try:
                    resp = await resp
                # skip failed requests; ideally this should be retried or logged
                except Exception:
                    continue
                if resp.status_code != 200:
                    continue

                for url in self.find_links(resp.text):
                    if self.save_urls.search(url):
                        await self.save(url)
                        if not self.follow_saved_urls:
                            continue
                    if self.follow_urls.search(url):
                        print(f"  following {url}")
                        to_follow.add(url)
            if to_follow:
                urls = to_follow
            else:
                return  # end of the crawl

Here we've defined an "endless" while loop that does exactly that: get htmls, parse them for urls where we store some of them and follow up the others. The last remaining piece is our link extraction logic.

Link Extracting

Link extraction process is the core part that makes the crawler and can get quite complex in logic. For our example domain hm.com it's relatively simple. We'll find all urls in the page by following <a> nodes:

    def find_links(self, resp, only_unique=True):
        # build a parsable tree from html body
        sel = Selector(resp.text)
        current_url_parts = urlparse(resp.url)
        # find all <a> nodes and select their href attribute
        urls = sel.xpath("//a/@href").extract()
        for url in urls:
            # convert relative url to absolute
            if url.startswith("/"):
                url = current_url_parts._replace(path=url).geturl()
            # skip absolute urls that do not match current domain
            if urlparse(url).netloc != current_url_parts.netloc:
                continue
            # skip visited urls
            if only_unique and url in self.seen_urls:
                continue
            self.seen_urls.add(url)
            yield url

Here we first build a tree parser object to get all those <a> node links. Then we iterate through them and filter out anything that is not an url of this website or has been visited already.

Link Extraction can get complicated very quickly as some website can contain non-html files (e.g. /document.pdf) that need to be filtered out and many other niche scenarios.

With link extraction complete, we can put together our whole crawler into once piece and see how it performs!

Putting It All Together

Now that we have all parts complete: crawl loop, link extraction, link matching and request limiting. Let's put it all together and run it:

import logging
import asyncio
import re
from urllib.parse import urlparse
from typing import List
from parsel import Selector
from httpx import AsyncClient


class HMScraper:
    save_urls = re.compile("/productpage\.")  # e.g ...com/en_us/productpage.09008.html
    follow_urls = re.compile("\.html")
    follow_saved_urls = False

    def __init__(self, limit=5):
        self.limiter = asyncio.Semaphore(limit)
        self.log = logging.getLogger(type(self).__name__)
        self.seen_urls = set()

    async def __aenter__(self):
        """on scraper creation open http session"""
        self.session = AsyncClient(
            # we should use a browser-like user agent header to avoid being blocked
            headers={
                "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"
            }
        )
        return self

    async def __aexit__(self, *args):
        """on scraper destruction close http session"""
        await self.session.aclose()

    async def _request(self, url: str):
        async with self.limiter:
            try:
                resp = await self.session.get(url)
            except Exception as e:
                return e
            return resp

    async def save(self, url):
        print(url)

    async def scrape(self, urls: List[str]):
        """Breadth first"""
        while True:
            to_follow = set()
            for resp in asyncio.as_completed([self._request(url) for url in urls]):
                resp = await resp
                if isinstance(resp, Exception):
                    continue
                if resp.status_code != 200:
                    continue
                for url in self.find_links(resp):
                    if self.save_urls.search(url):
                        await self.save(url)
                        if not self.follow_saved_urls:
                            continue
                    if self.follow_urls.search(url):
                        print(f"  following {url}")
                        to_follow.add(url)
            if to_follow:
                urls = to_follow
            else:
                return  # end of the crawl

    def find_links(self, resp, only_unique=True):
        """
        find all relative page links in html link nodes
        """
        # build a parsable tree from html body
        sel = Selector(resp.text)
        current_url_parts = urlparse(str(resp.url))
        # find all <a> nodes and select their href attribute
        urls = sel.xpath("//a/@href").extract()
        for url in urls:
            # convert relative url to absolute
            if url.startswith("/"):
                url = current_url_parts._replace(path=url).geturl()
            # skip absolute urls that do not match current domain
            if urlparse(url).netloc != current_url_parts.netloc:
                continue
            # skip visited urls
            if only_unique and url in self.seen_urls:
                continue
            self.seen_urls.add(url)
            yield url


async def run():
    async with HMScraper() as scraper:
        start_urls = [
            # homepage for US website
            "https://www2.hm.com/en_us/index.html"
        ]
        await scraper.scrape(start_urls)


if __name__ == "__main__":
    asyncio.run(run())

If we run our crawler we'll notice few things: - At time of writing 13800~ results are being found which matches well with our other #discovery-methods used in this blog series. - It took a while to complete this crawl: TODO second. Since we are crawling so many pages compared to other discovery methods we crawl

Finally, we can see that we can easily reuse most of this scraper for other websites, all we need to do is change our rules! That's the big selling point of crawlers, is that they're less domain specific than individual web scrapers.

Summary and Further Reading

To summarize, web crawling is a great discovery technique that lends easily to generic/broad scraper development because the same scrape loop can be applied to many targets just with some rule adjustments. However it's less efficient - slower and riskier when it comes to blocks - than other discovery techniques like Search Bar or Sitemaps.

For more web-scraping discovery techniques, see #discovery-methods and #discovery for more discovery related subjects.

If you have any questions, come join us on #web-scraping on matrix, check out #web-scraping on stackoverflow or leave a comment below!

As always, you can hire me for web-scraping consultation over at hire page and happy scraping!

The code used in this article can be found on github.

Web Scraping Target Discovery: Indexes

2021-09-28T00:00:00+02:00

Using various public indexers is a often viable target discovery strategy. It is mostly used as a last resort or as a supplementary technique for difficult discovery-difficult targets.

Public indexers that crawl the web through more complex scraping rules might pickup hard to find targets and we can take advantage of that in our web scraper. In other words as the spirit of web-scraping goes: it's smart to take advantage of existing work!
In this article we'll take a look at few common public indexers and how can we use them to discover targets.

First lets overview common pros and cons of this discovery strategy:

Pros:

Easy: once understood taking advantage of public indexers is surprisingly easy.
Efficient: public indexes function similar to in-website search bars or often come in easy to parse data dumps that we don't even need connection to discover targets.

Cons:

Insufficient coverage and stale data: because these are indexes gathered whenever and by whatever there's very little coverage and link quality. For this reason it is best to combine index based discovery with some other discovery techniques.

Using Search Engines

Most common and rich public indexes are search engines like google, bing or duckduckgo - anything that lets humans search the web can be a useful tool for web scraping robots as well.

To see how we would use search engine in web-scraping discover lets take example of https://crunchbase.com. Let's presume that we want to scrape their company overview data
(e.g. https://www.crunchbase.com/organization/linkedin).

In this example we'll use bing.com to query for crunchbase.com urls. Bing is a great tool for web-scrapers as it's easy to scrape (unlike google which employs various anti-scraping strategies) and has relatively good quality results and coverage.

If we take a look at an average Crunchbase company page like https://www.crunchbase.com/organization/linkedin we can determine the url pattern that all company pages follow looks something like: .../organization/<name>

Knowing this we can write domain specific queries in bing.com search box to find other company pages:

Here we used query /organization/ site:crunchbase.com and bing.com is giving us over a million of results which is pretty close to what the crunchbase is saying they have on their website!

For advanced search keywords/options see bing's advanced search options and bing's advanced search keywords

Other search engines like google, duckduckgo etc. also support similar search syntax. Search can be refined even further with more advanced search rules to find specific scraping targets.

All that being said using search engines to query is not without it's faults.
They are often built for humans rather than robots and have limited pagination (i.e. query will only have 10 pages of results even though it says millions are found) which requires splitting single query into many smaller ones, e.g. searching every letter of the alphabet or particular names. Despite all this discovery approach is surprisingly easy and can often work beautifully for small web scrapers!

Using Public Index Dumps

There are several public web indexes but probably the biggest and most well known one is https://commoncrawl.org/ which crawls the web and provides data dumps publicly for free.

Common Crawl is a nonprofit 501 organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally every month.

Unfortunately this being an open and free project the crawled htmls are somewhat stale, though as web scraper engineers we can instead use it as an index feed for our own web scrapers.

You can access common crawl's web index here: http://urlsearch.commoncrawl.org/. The data is grouped by months and each month's dataset can be queried on the online playground:

Crawl coverage by commoncrawl highly depends on popularity of the source. Some smaller websites are harder to find while bigger targets can have good data coverage.
Despite coverage issues commoncrawl url dataset is a useful tool to have in web-scraping toolbelt.

If you find Commoncrawl useful it's a non-profit organization accepting public donations!

Using Internet Archive

Another public indexer is archive.org project which aims to archive various internet articles for historic prosperity purposes.

Internet Archive is a non-profit library of millions of free books, movies, software, music, websites, and more.

We can use archive.org website archive as our discovery engine. If we go to https://archive.org and type in our source:

We can see that archive.org has captured a lot of urls! Let's take a quick look how we could scrape it as a discovery source. If we open up web inspector and see requests made when we click search we'll find a backend API url that looks something like this (clickable):

https://web.archive.org/web/timemap/?
url=crunchbase.com/organization/%2F&matchType=prefix&
collapse=urlkey&
output=json&
fl=original%2Cendtimestamp%2Cgroupcount%2Cuniqcount&  # rows to return, we're mostly interested in "original"
filter=!statuscode%3A[45]..&  
limit=10000&  # result limit
_=1632308317409  # current timestamp

If we submit get request to this url we'll get the whole dataset of matching results:

[["original","endtimestamp","groupcount","uniqcount"],
["http://crunchbase.com:80/organization/","20200218003314","379","100"],
["https://www.crunchbase.com/organization.investors","20171204192453","2","2"],
["https://www.crunchbase.com/organization.similarwebs","20171204192356","1","1"],
["https://crunchbase.com/organization/%22/financial-organization/alsop-louie-partners%22","20190518071052","1","1"],
["https://crunchbase.com/organization/%22/financial-organization/draper-associates-2%22","20190518091831","1","1"],
["https://crunchbase.com/organization/%22/maps/city/San%2520Francisco%22","20190518044213","1","1"],
["https://crunchbase.com/organization/%22/organization/andreessen-horowitz%22","20190517093614","1","1"],
["https://crunchbase.com/organization/%22/organization/canaan-partners%22","20190517071111","1","1"],
["https://crunchbase.com/organization/%22/organization/fog-creek-software%22","20190518102954","1","1"],
...

This query generates thousands of unique urls with timestamps of last time they were scraped by internet archive which can be easily adapted as a target discovery source!

If you find Internet Archive useful it's a non-profit organization accepting public donations!

Summary and Further Reading

To summarize using public indexes can be a valid scrape target discovery technique however it comes with a certain level of target quality and coverage uncertainty. It's best used to supplement other discovery techniques or initial prototyping.

Here's a list of several search engines that can be used by web-scrapers for target discovery:

https://bing.com - great western web coverage, weak anti web-scraping measures.
https://yahoo.com - uses bing.com database but different algorithms.
https://www.onesearch.com/ - yahoo's new privacy search engine that is also using bing.com databases but even newer algorithms. Only accessible by US ips.
https://duckduckgo.com - similar to bing.com but their own dataset.
https://startpage.com - uses google dataset but easier to access by web-scrapers; see sp project.
https://yandex.com - great russian web coverage and decent western web coverage.
https://boardreader.com/ - brilliant forum/discussion board coverage.

For more web-scraping discovery techniques, see #discovery-methods and #discovery for more discovery related subjects.

If you have any questions, come join us on #web-scraping on matrix, check out #web-scraping on stackoverflow or leave a comment below!

As always, you can hire me for web-scraping consultation over at hire page and happy scraping!

Web Scraping Target Discovery: Search API

2021-09-28T00:00:00+02:00

Reverse engineering website's backend API is a common web-scraping technique - why scrape htmls when backend data can be scraped directly? In this article, we'll briefly cover the most common web scraping reverse-engineering subject: the search API.

Using search API for web-scraping

One way to discovery targets in web scraping is to reverse-engineer the search bar for website's search API. It's often one of the best ways to discover targets - let's overview common pros and cons of this approach:

Pros:

Fresh Targets: search API rarely yields links to outdated targets, as it's exactly what website users see.
Good Coverage: search API can lead to all the results a website has to offer - if it's not searchable, it's probably not there!
Efficient: search API result pagination can yield 10-50 results per page and often can be scraped asynchronously.

Cons:

Domain bound: since every website has their own search structure, the code can rarely be applied to many targets.
Limited Coverage: some search return limited amount of pages (e.g. there are 900 results but after 10 pages the API does not provide any results) meaning the scraper has to figure out how to get around this limit which can be difficult to implement.
Slow: Rarely, but some search result pagination cannot be iterated asynchronously. Pages need to be requested one after another, which slows down the scraping process.

As you can see, pros and cons are very mixed and even contradicting - it really depends on website's search implementation. Let's cover a few examples and see what search API discovery is all about.

Example: hm.com

To understand basic search bar reverse-engineering, lets see how a popular clothing website https://hm.com handles its search.

Reversing Search Bar

If we go to the website, open up our web inspector tools and search something, we can see the search requests being made by the browser:

However, this returns us filtered results when we want to discover all products on the website.
For this, we can trick the search API to search empty queries by either searching for an empty string or a space character. In this case no results are returned for an empty string "" but we can force this search by using an url encoded (also called "percent encoded") character for space in the url bar: %20

For more on percent encoding see MDN's documentation.

Success! We got 13780 product results!
Now, let's figure out how the search works. If you look at the inspector, no data requests are made, because first page data is embedded into HTML as a javascript variable - this is a common website optimization that we can ignore.

We could scrape HTMLs but we often we don't have to. Modern websites tend to communicate with the backend API in JSON, so let's try to find that.
If we scroll to the bottom of the page and click next page, we can see the actual JSON data request being made for the second page:

We see a request is being made to backend's search API and it returns us a JSON set of data with product metadata and location. Let's take a look at the request url, so we can replicate it in our scrapper:

https://www2.hm.com/en_us/search-results/_jcr_content/search.display.json?
q=%20&   # search term
offset=0&  # pagination offset
page-size=40&   # pagination limit
sort=ascPrice  # sort type

Many modern web APIs are very flexible with parameters - we don't have to use all the junk we see in our web inspector. You can always experiment and see which are necessary and how the content changes when parameters do.
In this example we stripped off a lot of uninteresting parameters and just kept query, offset/limit and sort

This seems like a common offset/limit pagination technique. Which is great for web-scrapers as we can get multiple pages asynchronously - in other words we can request slices 0:100, 100:200, ... concurrently.

Confirming Pagination

Before we can commit to using this API endpoint we should test it for common coverage pitfalls, for example for page limits. Often, search APIs limit the amount of rows/pages a query can request. If we just click the link in the browser:

https://www2.hm.com/en_us/search-results/_jcr_content/search.display.json?q=%20&offset=0&page-size=40&sort=ascPrice

We can see JSON response and total of results count of 13_730.
Let's see if we can get last page, which at the time of this article would be: offset=13690&page-size=40:

https://www2.hm.com/en_us/search-results/_jcr_content/search.display.json?q=%20&offset=13690&page-size=40&sort=ascPrice

Unfortunately while requests is successful it contains no product data, indicated as empty array: "products": []

It's what we feared and this pagination has a page limit. By messing around with the parameter we can find where the pagination ends exactly and thats at 10_000 results, which is not an uncommon round number.

Let's see a few common ways we could get around this pagination limit:

Use multiple search queries - common brute force technique is searching many different queries like: a, b, c... and hope all the products are found.
Apply more filters - this query allows optional filter such as categories. We can collect all categories, e.g. shoes, dresses etc. and have query for every one of them.
We can reverse sorting - if one query can give us 10_000 results, by reversing sorting we can have 2 queries with 10_000 results each! That's an easy way to double our reach.

For this specific case seems like approach #3 Reversing Sorting is the best approach! As the website only has a bit over 13_000 results and our reach would be 20_000 - this would be a perfect solution.
We can sort our query by price and reach for results from both ends of the query:

So our first query would get us the first 10_000 cheapest items and the second query would pick up first 3_700 most expensive items. With these two queries, we can fully discover all available products.

Implementation

Having reverse engineering how search API of hm.com works, we can develop our scraping algorithm:

Get first page to get total result count.
Schedule request for first 10_000 results sorted by ascPrice.
Schedule remaining total - 10_000 requests sorted by descPrice.
Collect responses and parse product data.

Here's quick implementation using Python with asynchronous http client package httpx:

from typing import Dict
import httpx
import asyncio


async def scrape_page(
    session: httpx.AsyncClient,
    offset=0,
    page_size=500,  # note: we can increase this from default 40 to something higher!
    sort="ascPrice",
) -> Dict:
    """Scrape a single hm.com product query page"""
    url = "https://www2.hm.com/en_us/search-results/_jcr_content/search.display.json"
    print(f"scraping range {offset}:{offset+page_size} sorted by: {sort}")
    response = await session.get(
        url=url,
        timeout=httpx.Timeout(120),
        params={
            "q": " ",  # note: http client will automatically turn this to "%20"
            "offset": offset,
            "page-size": page_size,
            "sort": sort,
        },
    )
    return response.json()


async def scrape_hmcom():
    # we need to fake any browser User-Agent to get around primitive bot detection
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"
    }
    products = []
    async with httpx.AsyncClient(headers=headers) as session:
        # lets start by scraping first page
        first_page = await scrape_page(session)
        products.extend(first_page["products"])

        # First page contains total amount of results this query contains
        # using this we can create task for each bach of query and
        # execute it concurrently
        tasks = []
        for offset in range(
            first_page["itemsShown"], first_page["total"], first_page["itemsShown"]
        ):
            # for first 10_000 scrape as usual
            if offset < 10_000:
                sorting = "ascPrice"
            # for query > 10_000 start over with reversed ordering
            else:
                sorting = "descPrice"
                offset -= 10_000
            tasks.append(scrape_page(session, offset=offset, sort=sorting))
        print(f"Scheduling {len(tasks)} scrape tasks concurrently")

        # with our scraping tasks in order it's time
        # to execute them concurrently using asyncio.as_completed wrapper
        for scrape_task in asyncio.as_completed(tasks):
            result = await scrape_task
            products.extend(result["products"])

    print(f"found {len(products)} products in {len(tasks) + 1} page requests")


if __name__ == "__main__":
    asyncio.run(scrape_hmcom())

Here we used asynchronous python and httpx as our http client library to scrape all 13790 products with very few requests just in few minutes!

Summary and Further Reading

To summarize, reverse engineering website's search API is a brilliant scrape target discovery technique, however it's more difficult to develop as it requires reverse-engineer effort and all of the code becomes very domain specific.

For more web-scraping discovery techniques, see #discovery-methods and #discovery for more discovery related subjects.

If you have any questions, come join us on #web-scraping on matrix, check out #web-scraping on stackoverflow or leave a comment below!

As always, you can hire me for web-scraping consultation over at hire page and happy scraping!

image credits: "Gears" by Kamal H. is licensed under CC BY-ND 2.0

Web Scraping Target Discovery: Sitemaps

2021-09-28T00:00:00+02:00

In this article will cover one particular discovery method of using website sitemaps to find our scrape targets.

What are sitemaps and how are they used in web-scraping?

Sitemap is an index document generated by websites for web crawlers and indexers. For example websites that want to be crawled by google provide an index of their products so Google's crawlers can index it quicker.

To put it shortly sitemap files are always of xml type (often gzip compressed) documents that contain URL locations and some meta information about them:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
</urlset> 

for more on sitemap structure rules, see official specification page

The documents themselves are usually categorized by names, so for example:

blog post of the website would be contained in sitemap_blogs.xml.
Sold products might be separated in multiple files of sitemap_products_1.xml, sitemap_products_2.xml etc

Before using sitemaps a web scraping discovery strategy, it's a good practice to reflect on common pros and cons of this technique and see whether that would fit your web-scraping project:

Pros:

Efficiency: Single sitemap can contain thousands of items and often entire catalog can be discovered in just few requests!
Simplicity: There's no need for advanced reverse engineering knowledge to use sitemap based discovery.

Cons:

Data Staleness: Sitemap indexes need to be generated by the website explicitly and sometimes newer product might not appear on the index for significant amount of time.
Data Validity: As per previous point because of sitemap staleness some product links might be expired or invalidated. This might cause unnecessary load on your scraper.
Data Completeness: Since sitemaps are generated for crawlers and indexers they might not have all data that is available on the website. For this reason it is important to confirm sitemap coverage during the development of a scraper.
Availability: Sitemaps is/used to be an important part of the web, particularly used in SEO however they are not always present in modern websites that either try to avoid web-scraping or use hard-to-index website structures or are just too big for such indexes.
Risk: Some website use sitemaps as honeypots for web-scrapers and direct to invalid data or use it to identify and ban scrapers.

As you can see, Sitemaps discovery approach appears to be simple and efficient, though not always viable. Generally when developing discovery strategy, sitemaps is the first place I look for product data, then confirm quality by trying alternative discovery approaches and seeing if coverage matches.

Finding Sitemaps

To take advantage of sitemaps, we first need to figure how to find them. Common way to find sitemaps is checking robots.txt or sitemaps.xml file.
For example, let's take popular clothing shop hm.com:

First we would go to /robots.txt page: https://hm.com/robots.txt:

User-Agent: *
Request-rate: 2/1 0000-0200
Request-rate: 1/2 0200-0900
Disallow: /alive/user
Disallow: /m/
...
Sitemap: http://www2.hm.com/sitemapindex.xml

We see some robot scraping rules and a link to the sitemap index! If we proceed and take a look at the sitemap index http://www2.hm.com/sitemapindex.xml we can see:

...
<sitemap>
    <loc>https://www2.hm.com/en_in.sitemap.xml</loc>
    <lastmod>2021-09-29</lastmod>
</sitemap>
<sitemap>
    <loc>https://www2.hm.com/en_us.sitemap.xml</loc>
    <lastmod>2021-09-29</lastmod>
</sitemap>
<sitemap>
    <loc>https://www2.hm.com/de_de.sitemap.xml</loc>
    <lastmod>2021-09-29</lastmod>
</sitemap>
...

The index is split into localized parts, let's continue to en_us index (or whichever you prefer, they should function the same): https://www2.hm.com/en_us.sitemap.xml

<sitemapindex xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xmlns="https://www.sitemaps.org/schemas/sitemap/0.9" xsi:schemaLocation="https://www.sitemaps.org/schemas/sitemap/0.9 https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd">
<sitemap>
    <loc>https://www2.hm.com/en_us.pages.0.xml</loc>
    <lastmod>2021-09-29</lastmod>
</sitemap>
<sitemap>
    <loc>https://www2.hm.com/en_us.store.0.xml</loc>
    <lastmod>2021-09-29</lastmod>
</sitemap>
<sitemap>
    <loc>https://www2.hm.com/en_us.product.0.xml</loc>
    <lastmod>2021-09-29</lastmod>
</sitemap>
<sitemap>
    <loc>https://www2.hm.com/en_us.product.1.xml</loc>
    <lastmod>2021-09-29</lastmod>
</sitemap>

This is index of sitemap indexes. We see there are indexes for articles, pages and categories etc. - but most importantly product index: ....products.N.xml.
To add there's some important metadata as well: when indexes were last updated: <lastmod>. In this case the index is 1 day old so this discovery approach will not pick up any products that have been added in the last few hours.

Every website engine generates sitemaps at different times: some generate once a day/week often indicated by <changefreq>always|hourly|daily|...</changefreq> attribute. Though modern, smaller websites usually generate it on demand when product index is updated which is great for web-scrapers!

Example Use Case: HM.com

Lets write a simple sitemap scraper that will find all product urls on previously mentioned website https://hm.com. For this we'll be using python with requests and parsel libraries:

# requires:
# pip install requests parsel

import re
from typing import List

import requests
from parsel import Selector


def parse_sitemap(url: str) -> List[str]:
    """scrape sitemap and item urls from a sitemap link"""
    print(f"scraping: {url}")
    resp = requests.get(
        url,
        headers={
            # we need to fake browser user-string to get through CDN bot protection
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"
        },
    )
    # turn html text to a parsable tree object
    doc_tree = Selector(resp.text)
    # find all <loc> nodes and take their text (which is an url)
    urls = doc_tree.xpath("//loc/text()").getall()
    return urls


product_urls = set()
sitemap_directory = "https://www2.hm.com/en_us.sitemap.xml"
for url in parse_sitemap(sitemap_directory):
    if ".product." not in url:
        continue
    for url in parse_sitemap(url):
        # product urls match pattern com/<some product naming>.html
        # skip non-product urls
        if not re.search(r"hm.com/.+?\.html", url):
            continue
        product_urls.add(url)
print("\n".join(product_urls)[-100:])
print(len(product_urls))

If we run this small scraper script we'll see that this sitemap discovery approach will yield us 13639 unique results (at the time of writing)! Even if we scrape synchronously sitemap approach is a really efficient way to discover large amount of products.

Confirming Results

Finally we should confirm whether this discovery approach has good coverage by comparing it with other discovery approaches. For that we either need to find all product count number somewhere (some websites mention "total N results available" somewhere in their content) or use another discovery strategy to evaluate our coverage. For this particular website we can take a look at search bar discovery approach covered in other Scrapecrow article:

13780 results found by searching for url quote space character %20

Using empty search approach described in search discovery article we can see that our sitemap discovery coverage showing almost the same amount of results:

Sitemaps: 13639
Searchbar: 13780

These 141 results we're missing are probably indication that sitemap index is running slightly behind the product database. This is a good illustration of different discovery techniques and their importance. For important scrapers it's a good idea to diversify.

Summary and Further Reading

To summarize using sitemaps in web scraping is an efficient, effective and quick product discovery technique with only real down-sides being data staleness, coverage and availability.

For more web-scraping discovery techniques, see #discovery-methods and #discovery for more discovery related subjects.

If you have any questions, come join us on #web-scraping on matrix, check out #web-scraping on stackoverflow or leave a comment below!

As always, you can hire me for web-scraping consultation over at hire page and happy scraping!

image credits: "Map of North America" by NASA Johnson is licensed under CC BY-NC 2.0