Scrapecrowhttps://scrapecrow.com/2023-01-02T00:00:00+01:00A Year of Writing about Web Scraping in Review2023-01-02T00:00:00+01:002023-01-02T00:00:00+01:00Bernardas Ališauskastag:scrapecrow.com,2023-01-02:/year-of-writing-in-review.html<p>I'd like to share the favorite things I wrote in a year of professional web scraping blogging at <a href="https://scrapfly.io/">ScrapFly</a> and my key takeaways.</p><p>At the beginning of 2022 I've joined <a href="https://scrapfly.io">Scrapfly.io</a> - a small web scraping API company. One of my main roles is to keep <a href="https://scrapfly.io/blog/">our blog</a> running with various tutorials and web scraping highlights.<br/>
So, to wrap up this year, here's a list of my favorite articles I've worked on in 2022 and what I've learned from working on them!</p>
<h2 id="introductions">Introductions</h2>
<p>Web scraping is a very available niche. You can write a scraper in almost any programming language as long as it has a viable HTTP client (which is basically any language). </p>
<p><a href="/images/hello-im.webp"><img class="center" loading="lazy" src="/images/hello-im.webp" title="" width="200"/></a><figcaption></figcaption></p>
<p>At ScrapFly, we have a diverse client base using all sorts of programming languages to scrape the web using our API.
So, I wrote multiple in-depth and up-to-date introductions to web scraping in several languages which was quite a journey on its own:</p>
<ul>
<li><a href="https://scrapfly.io/blog/web-scraping-with-python/">Python</a> is the most popular language in this medium for a reason - it's a great data language with strong async support and some of the best community libraries out there.</li>
<li><a href="https://scrapfly.io/blog/web-scraping-with-r/">R language</a> is surprisingly effective. The pipe operator fits into web scraping flows perfectly and is really fun to use.</li>
<li><a href="https://scrapfly.io/blog/web-scraping-with-php-101/">PHP</a> has a lot of great web tools that fit into web scraping though generally it's not a very good data language.</li>
<li><a href="https://scrapfly.io/blog/web-scraping-with-ruby/">Ruby</a> feels a lot like Python just with a smaller community and fewer tools.</li>
<li><a href="https://scrapfly.io/blog/web-scraping-with-nodejs/">Javascript/NodeJS</a> seems to be able to do everything and scraping is no exception. I'd say that it's not a very good data language so it wouldn't be my first choice for scraping.</li>
</ul>
<p>This was a great exercise to understand what makes a good web scraping environment.
Working on these articles provided really valuable horizontal growth as it presented the problem of web scraping from different perspectives. Every language seems to have something unique and special about it and having a general overview did help me to improve my development techniques and workflows.</p>
<h2 id="guides">Guides</h2>
<p>Another way we tried to reach out to scraper developers is by writing hands-on <strong><a href="https://scrapfly.io/blog/tag/scrapeguide/">scrape guides</a></strong> for popular scraping targets like: </p>
<ul>
<li><a href="https://scrapfly.io/blog/how-to-scrape-instagram/">Instagram</a> </li>
<li><a href="https://scrapfly.io/blog/how-to-scrape-bookingcom/">Booking.com</a> </li>
<li><a href="https://scrapfly.io/blog/how-to-scrape-amazon/">Amazon</a></li>
<li><a href="https://scrapfly.io/blog/how-to-scrape-zillow/">Zillow</a></li>
</ul>
<p>And <a href="https://scrapfly.io/blog/tag/scrapeguide/">dozens of others</a>.</p>
<p>For these, I had to choose a standard set of tools and libraries which was a difficult challenge. Using browser automation tools like <a href="https://scrapfly.io/blog/web-scraping-with-selenium-and-python/">Selenium</a>, <a href="https://scrapfly.io/blog/web-scraping-with-playwright-and-python/">Playwright</a> or <a href="https://scrapfly.io/blog/web-scraping-with-puppeteer-and-nodejs/">Puppeteer</a> can often be easier but they don't
fit the tutorial format well as browsers are complicated and work differently on different user machines. </p>
<p><a href="/images/guide.svg"><img class="center" loading="lazy" src="/images/guide.svg" title="" width="200"/></a><figcaption></figcaption></p>
<p>I've settled on <strong>Python</strong> with <a href="https://pypi.org/project/httpx/">httpx</a>, <a href="https://pypi.org/project/parsel/">parsel</a> and <a href="https://scrapfly.io/blog/parse-json-jmespath-python/">jmespath</a> as the de facto scraping environment as it's really powerful and approachable. </p>
<p>With this setup, there's more focus on reverse engineering and web technologies rather than automation but it's easier to figure out how to explain technical concepts than to deal with replication issues of web browsers.</p>
<p>As for the library stack: </p>
<ul>
<li><code>httpx</code> is a brilliant http client which supports http2 and respects RFC standards. It's probably the best HTTP client out there outside of cURL. </li>
<li><code>parsel</code> supports both CSS and XPath selectors for HTML parsing which is by far the best way to parse HTML. </li>
<li><code>jmespath</code> is the most accessible way to parse JSON datasets. Usually, I'd favor <code>jsonpath</code> over <code>jmespath</code> as it allows non-root based selections (e.g. find me any node named <code>foo</code> anywhere) but jsonpath implementations aren't very great. </li>
</ul>
<p>Finally, Python itself comes with loads of built-in tools like <code>collections</code> and <code>urllib</code> libraries that really fit into the web scraping world.</p>
<p>Each popular target is surprisingly different though I noticed how prevalent client-side rendering is in the modern web. </p>
<p>More than half of the scrape guides didn't use much HTML parsing at all! Instead, <a href="https://scrapfly.io/blog/how-to-scrape-hidden-web-data/">hidden web data</a> scraping turned out to be a much easier way to scrape this modern javascript-powered part of the web.</p>
<h2 id="key-subjects">Key Subjects</h2>
<p>Once we started working on scrape guides we noticed a lot of repetition. </p>
<p>Common subjects like <a href="https://scrapfly.io/blog/how-to-scrape-hidden-web-data/">how to scrape hidden web data</a>, <a href="https://scrapfly.io/blog/parse-json-jmespath-python/">how to parse JSON data</a> and how to use <a href="https://scrapfly.io/blog/parsing-html-with-css/">CSS</a> and <a href="https://scrapfly.io/blog/parsing-html-with-xpath/">XPath</a> selectors would have to be repeated in each article. </p>
<p><a href="/images/books.svg"><img class="center" loading="lazy" src="/images/books.svg" title="" width="200"/></a><figcaption></figcaption></p>
<p>So, it seemed like it'd be appropriate that we cover each of these subjects individually. This, lead us even further - what are some other key web scraping subjects that could be referenced anywhere? Here are a few we've settled on:</p>
<ul>
<li><a href="https://scrapfly.io/blog/how-to-ensure-web-scrapped-data-quality/">How to ensure web scraped data quality?</a> - data validation is boring but vital for consistent data delivery.</li>
<li><a href="https://scrapfly.io/blog/how-to-turn-web-scrapers-into-data-apis/">How to turn web scrapers into data APIs?</a> - real time scraping API with FastAPI and async Python.</li>
<li><a href="https://scrapfly.io/blog/how-to-rotate-proxies-in-web-scraping/">How to rotate proxies?</a> - weighted randomization!</li>
<li><a href="https://scrapfly.io/blog/web-scraping-speed/">web scraper scaling in Python</a> - Asyncio, threads and subprocesses from the point of view of web scraping.</li>
<li><a href="https://scrapfly.io/blog/how-to-rate-limit-asynchronous-python-requests/">How to rate limit requests?</a></li>
<li>How to use browser automation tools like <a href="https://scrapfly.io/blog/web-scraping-with-puppeteer-and-nodejs/">Puppeteer</a>, <a href="https://scrapfly.io/blog/web-scraping-with-playwright-and-python/">Playwright</a> and <a href="https://scrapfly.io/blog/web-scraping-with-selenium-and-python/">Selenium</a></li>
</ul>
<p>Most of the topics came up naturally as we were reviewing and developing our articles though some came from our customers and questions on popular forums like Stackoverflow or Reddit.</p>
<h2 id="favorites">Favorites</h2>
<p>The most difficult and interesting subject in web scraping is definitely <strong>scraper blocking</strong>. To cover it I've got a bunch of help from my teammates and put together a <a href="https://scrapfly.io/blog/how-to-scrape-without-getting-blocked-tutorial/">definitive guide on web scraper blocking</a> which took a lot of research and turned out to be our reader's favorite!</p>
<p><a href="/images/blue-heart.svg"><img class="center" loading="lazy" src="/images/blue-heart.svg" title="" width="200"/></a><figcaption></figcaption></p>
<p>I thought I was pretty well versed in the blocking subject but stuff like how powerful <a href="https://scrapfly.io/blog/how-to-avoid-web-scraping-blocking-tls/">TLS fingerprinting</a> really blew my mind. There are just so many ways to fingerprint and identify users based on their connections.</p>
<p>My favorite article, however, is <a href="https://scrapfly.io/blog/search-engine-using-web-scraping/">creating search engines using web scraping</a> where we demo how any website can be quickly indexed with a bit of scraper code, some clever HTML parsing and an index front-end like LunrJS. I've adopted this technique in many other different projects as having a quick data index is really useful!</p>
<h2 id="lessons-learned">Lessons Learned</h2>
<p>We spent a lot of time designing our blog to function as a knowledge graph rather than a series of individual guides. This paid off pretty well as SEO seems to like references. </p>
<p><a href="/images/wizard.svg"><img class="center" loading="lazy" src="/images/wizard.svg" title="" width="200"/></a><figcaption></figcaption></p>
<p>One slight disappointment was how cool features can backfire when it comes to SEO. For our <a href="https://scrapfly.io/blog/parsing-html-with-css/">CSS</a> and <a href="https://scrapfly.io/blog/parsing-html-with-xpath/">XPath</a> articles, we built real-time on-page testing widgets that were not liked by the search engines and took a while to optimize and finally pay off.</p>
<p>We also tried structuring our articles with revealable components (like <code><details></code>) which increased readability but noticeably worsened SEO.</p>
<p>Keyword diversity was another important challenge as we wanted all sorts of users to find our content - from beginner developers to seniors discovering web scraping. </p>
<p>Since web-scraping is still a small niche there aren't many well-established terms, so diversifying terms made a huge difference. Sometimes increasing monthly readers tenfold!</p>
<p><a href="/images/typewriter.svg"><img class="center" loading="lazy" src="/images/typewriter.svg" title="" width="200"/></a><figcaption></figcaption></p>
<p>I wrote close to 60 articles that are on average 3000 words long. All of them started as 6000~ word drafts and got edited into something more dense and digestible. </p>
<p><strong>Keep things short</strong> is easier said than done though. The real challenge is how do you make all of these conflicting problems work together?</p>
<ul>
<li>SEO keywords favor longer text allowing more keyword diversity.</li>
<li>Keyword repetition is bad for SEO but good for comprehension. </li>
<li>Shorter explanations might miss vital details but longer ones can be more difficult to comprehend or just bore the reader.</li>
<li>Which problems deserve to be explained and which aren't?</li>
</ul>
<p>To start addressing these I focused more on the structure of the article. For this, <strong>table of contents</strong> is such a gem of a feature! </p>
<p>TOC is great for the reader to navigate and overview the article but it's also great for the editor as it helps to visualize: is this the right order to approach this problem? What's missing and what shouldn't be there?</p>
<p>For example, in our scraping guide articles, I quickly learned that starting with a short example or how to scrape a single product page is much more approachable than following the traditional project order.</p>
<p>Getting rid of factoids and history lessons helped too. People are probably interested in the fact that the Referer header is officially misspelled and it's a good thing to know but do they really need to know more about this?</p>
<p>Media use can also be a major help. Well placed image or a video can save several paragraphs of text and provide some breathing space.</p>
<h2 id="the-great-beyond">The Great Beyond!</h2>
<p>Exploring web scraping subject is a real pleasure - it's vast and challenging though the best part is that at the end of the day you see concrete data results. Getting a quality dataset conjured up from nothing is immensely satisfying which makes all of that work worth it!</p>
<p><a href="/images/rocket.svg"><img class="center" loading="lazy" src="/images/rocket.svg" title="" width="200"/></a><figcaption></figcaption></p>
<p>We still have loads of ideas on educational web scraping content that we will be working on in 2023. We've started work on a collection of <a href="https://scrapfly.io/web-scraping-tools">web scraping tools</a> and a few more secret projects that should pop up soon! :)</p>Asynchronous Web Scraping: Scaling For The Moon!2021-10-27T00:00:00+02:002021-10-27T00:00:00+02:00Bernardas Ališauskastag:scrapecrow.com,2021-10-27:/asynchronous-web-scraping.html<p>Asynchronous programming can speed up web scrapers astronomically and is by far the most important scaling step when dealing with big projects. What exactly is it, how does it work, and what are the best ways to take advantage of it?</p><p><a href="/images/moon.png"><img class="fullc" loading="lazy" src="/images/moon.png" title=""/></a><figcaption></figcaption></p>
<p>Scaling is one of the biggest challenges in web-scraping niche and there are a lot of ways to make web-scrapers scale better in efficiency, speed and reliability. However, there's one thing can be done to web-scraper code that is by far the biggest bang-per-buck improvement: using asynchronous code.</p>
<p>In this short article, we'll take a look at asynchronous python and how we can take advantage of it to make our scrapers over 100 times faster!</p>
<h2 id="what-is-asynchronous-programming">What is Asynchronous Programming?</h2>
<p>Python supports many async code paradigms, however the current de facto standard is async/await via <code>asyncio</code> (or it's alternatives like <a href="https://trio.readthedocs.io/en/stable/">trio</a>). In this article we'll keep focus on asyncio as it's currently the most approachable async paradigm in Python.</p>
<p>So what is asynchronous programming?<br/>
At it's core it's essentially just pausable functions (called coroutines) which can pause when inactive and give way to active ones. This often has huge benefits in IO (input/output) bound programs that rely on waiting for some sort of external input or output. </p>
<p>What is an IO block?<br/>
When programs interact with <em>external</em> service they often need to wait for the service to respond: this waiting is called IO block. Some examples of IO blocks:</p>
<ul>
<li>In online apps IO block would be waiting for server to respond to requests.</li>
<li>In GUI apps this would be waiting for app user to click buttons, enter text or interact with other GUI widgets. </li>
<li>In video games this would be waiting for the player to perform some actions.</li>
</ul>
<p>IO blocking is a big deal in web-scraping, since the majority of actual work the program does is communication with the web server.<br/>
For example, in synchronous python when we send a request to a server our code stops - does absolutely nothing - while waiting for the server to respond. In web-scraping this can be up to several seconds, and they add up really quickly!</p>
<p>Presumably, we have 100 urls to scrape that wait 1 second each - that's over a minute and a half of waiting! Could we make them wait together concurrently?</p>
<h2 id="how-does-async-web-scraping-work">How Does Async Web Scraping Work?</h2>
<p class="info">In these examples we'll be using <a href="https://www.python-httpx.org/">httpx</a> http client package which supports both synchronous and asynchronous client APIs</p>
<p>Let's take a look at a quick example. We have 100 urls that we want to scrape, and we have this really simple synchronous scraper code:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span>
<span class="normal">15</span>
<span class="normal">16</span>
<span class="normal">17</span>
<span class="normal">18</span>
<span class="normal">19</span>
<span class="normal">20</span>
<span class="normal">21</span>
<span class="normal">22</span>
<span class="normal">23</span>
<span class="normal">24</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="kn">from</span> <span class="nn">httpx</span> <span class="kn">import</span> <span class="n">Client</span>
<span class="kn">from</span> <span class="nn">time</span> <span class="kn">import</span> <span class="n">time</span>
<span class="k">def</span> <span class="nf">scrape</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">session</span><span class="p">):</span>
<span class="k">return</span> <span class="n">session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">run</span><span class="p">():</span>
<span class="n">_start</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
<span class="n">results</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">with</span> <span class="n">Client</span><span class="p">()</span> <span class="k">as</span> <span class="n">session</span><span class="p">:</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
<span class="c1"># this url will always take 1 second</span>
<span class="n">url</span> <span class="o">=</span> <span class="s2">"http://httpbin.org/delay/1"</span>
<span class="n">results</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">scrape</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">session</span><span class="o">=</span><span class="n">session</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"finished scraping in: </span><span class="si">{</span><span class="n">time</span><span class="p">()</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">_start</span><span class="si">:</span><span class="s2">.1f</span><span class="si">}</span><span class="s2"> seconds"</span><span class="p">)</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
<span class="n">run</span><span class="p">()</span>
<span class="c1"># results in:</span>
<span class="c1"># finished scraping in: 135.8 seconds</span>
</code></pre></div></td></tr></table></div>
<p class="info">Here we're taking advantage of <a href="https://httpbin.org">https://httpbin.org</a> delayed response endpoint <code>/delay/<n></code> which will simulates slow server which takes N seconds to respond. This tool is great for testing web scraping scaling!</p>
<p>In this example, we schedule 100 different requests synchronously that each should take at least 1 second to execute each: 1 second delay by the server + all connection overhead our machine does = 100+ seconds.</p>
<p>Now let's try to run those 100 request at the same time, asynchronously by using <code>asyncio</code> async loop and <code>asyncio.gather()</code> function:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span>
<span class="normal">15</span>
<span class="normal">16</span>
<span class="normal">17</span>
<span class="normal">18</span>
<span class="normal">19</span>
<span class="normal">20</span>
<span class="normal">21</span>
<span class="normal">22</span>
<span class="normal">23</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="kn">import</span> <span class="nn">asyncio</span>
<span class="kn">from</span> <span class="nn">time</span> <span class="kn">import</span> <span class="n">time</span>
<span class="kn">from</span> <span class="nn">httpx</span> <span class="kn">import</span> <span class="n">AsyncClient</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">scrape</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">session</span><span class="p">):</span>
<span class="k">return</span> <span class="k">await</span> <span class="n">session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">run</span><span class="p">():</span>
<span class="n">_start</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
<span class="k">async</span> <span class="k">with</span> <span class="n">AsyncClient</span><span class="p">()</span> <span class="k">as</span> <span class="n">session</span><span class="p">:</span>
<span class="n">tasks</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
<span class="c1"># this url will always take 1 second</span>
<span class="n">url</span> <span class="o">=</span> <span class="s2">"http://httpbin.org/delay/1"</span>
<span class="n">tasks</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">scrape</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">session</span><span class="o">=</span><span class="n">session</span><span class="p">))</span>
<span class="n">results</span> <span class="o">=</span> <span class="k">await</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="o">*</span><span class="n">tasks</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"finished scraping in: </span><span class="si">{</span><span class="n">time</span><span class="p">()</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">_start</span><span class="si">:</span><span class="s2">.1f</span><span class="si">}</span><span class="s2"> seconds"</span><span class="p">)</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
<span class="n">asyncio</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">run</span><span class="p">())</span>
</code></pre></div></td></tr></table></div>
<p>If we run this minor async modification, we'll see a colossal speed boost: </p>
<blockquote>
<p>finished scraping in: 2.4 seconds</p>
</blockquote>
<p>We sped up our scraper <strong>over 50 times</strong> just by converting few lines of code into asynchronous python! Let's take a look at small illustration of the performance of these two programs:</p>
<p><a href="/images/async-vs-sync.png"><img class="" loading="lazy" src="/images/async-vs-sync.png" width='title=""'/></a><figcaption></figcaption></p>
<p>Here, we can easily visualize the difference between synchronous and asynchronous scraper programs: the IO waiting on the left is <em>part</em> of the program, while on the right it doesn't have to be!</p>
<p>Async programs essentially prevent IO waiting from blocking our programs. For waiting intensive tasks such as http connections, async essentially means concurrency and can speed up programs from dozens to thousands times over!</p>
<p>However, as the illustration shows, there are some negatives to async programming: </p>
<ul>
<li>Ensuring execution order can be difficult. <br/>
Often, request order can be very important in session or cookie bound web scraping. Keeping track of sessions and cookies is much more complicated in async environment.</li>
<li>Async programs are much harder to design and debug since they are inherently more complex.</li>
<li>Async speed can be major pain when dealing with scraper bans which require programming of throttling logic.</li>
</ul>
<p>Whether this overhead of complexity worth it for your web scraper entirely depends on your program. For small scripts, it might not be worth tangling up with the whole async ecosystem, which is still quite young in Python. However, for big data scrapers, async provided speed boosts of 100-1000x are vital for any reasonable data collection speeds!</p>
<h2 id="were-going-too-fast-throttling">We're Going Too Fast: Throttling</h2>
<p>Using async python we can make a lot of http requests very quickly, which can become problematic. Web servers often try to protect themselves from high traffic as it's resource expensive and will often block scrapes that connect too quickly.</p>
<p>To avoid bans and blocks, we must either use multiple proxies or preferably throttle our connections. Let's take a look at few common ways to throttle web-scrapers.</p>
<h3 id="asyncios-semaphore">Asyncio's Semaphore</h3>
<p>This is built in way to throttle coroutine concurrency in python's asyncio. Semaphore is essentially a lock object with limited amount of working slots, i.e. it allows us to specify that no more than N coroutines should be executed concurrently:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span>
<span class="normal">15</span>
<span class="normal">16</span>
<span class="normal">17</span>
<span class="normal">18</span>
<span class="normal">19</span>
<span class="normal">20</span>
<span class="normal">21</span>
<span class="normal">22</span>
<span class="normal">23</span>
<span class="normal">24</span>
<span class="normal">25</span>
<span class="normal">26</span>
<span class="normal">27</span>
<span class="normal">28</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="kn">import</span> <span class="nn">asyncio</span>
<span class="kn">from</span> <span class="nn">time</span> <span class="kn">import</span> <span class="n">time</span>
<span class="kn">from</span> <span class="nn">httpx</span> <span class="kn">import</span> <span class="n">AsyncClient</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">scrape</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">session</span><span class="p">,</span> <span class="n">throttler</span><span class="p">):</span>
<span class="k">async</span> <span class="k">with</span> <span class="n">throttler</span><span class="p">:</span>
<span class="k">return</span> <span class="k">await</span> <span class="n">session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">run</span><span class="p">():</span>
<span class="n">_start</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
<span class="n">throttler</span> <span class="o">=</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">Semaphore</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
<span class="k">async</span> <span class="k">with</span> <span class="n">AsyncClient</span><span class="p">()</span> <span class="k">as</span> <span class="n">session</span><span class="p">:</span>
<span class="n">tasks</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
<span class="c1"># this url will always take 1 second</span>
<span class="n">url</span> <span class="o">=</span> <span class="s2">"http://httpbin.org/delay/1"</span>
<span class="n">tasks</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">scrape</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">session</span><span class="o">=</span><span class="n">session</span><span class="p">,</span> <span class="n">throttler</span><span class="o">=</span><span class="n">throttler</span><span class="p">))</span>
<span class="n">results</span> <span class="o">=</span> <span class="k">await</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="o">*</span><span class="n">tasks</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"finished scraping in: </span><span class="si">{</span><span class="n">time</span><span class="p">()</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">_start</span><span class="si">:</span><span class="s2">.1f</span><span class="si">}</span><span class="s2"> seconds"</span><span class="p">)</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
<span class="n">asyncio</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">run</span><span class="p">())</span>
<span class="c1"># results in:</span>
<span class="c1"># finished scraping in: 14.8 seconds</span>
</code></pre></div></td></tr></table></div>
<p>Here we modified our async scraper code with one key addition of <code>asyncio.Semaphore(10)</code> object, which limits concurrent execution to 10 concurrent coroutines. Our original unlimited code took 2.4 to complete now we managed to slow it down 6 times with few extra lines of code! </p>
<p>Some important things to notice about <code>asyncio.Semaphore</code>:</p>
<p>All throttled coroutines should be use same Semaphore object. This means one throttled needs to be passed around to each individual coroutine introducing slight complexity overhead.</p>
<p>Semaphore is only aware of concurrency, not time. Often when web-scraping we want to limit request count to specific time frame like max 60 requests / minute. In other words, our <code>Semaphore(10)</code> when scraping fast servers could yield 10 requests / second however on slower ones it might be only 2 requests / second.</p>
<p>While Semaphore gives us an easy way to throttle ourselves, we can see it's not an ideal approach as it doesn't allow us to be as precise as some web-servers might require us to. Often web servers have strict limits like 60 requests/second - for most efficient web scraper we would like to stay around that range - let's take a look how can we accomplish this.</p>
<h3 id="leaky-bucket">Leaky Bucket</h3>
<p>Leaky Bucket is a common throttling algorithm that uses time tracking for throttling rather than just concurrency. Using the Leaky Bucket approach, we can specify how many tasks can be executed in a specific time frame, i.e. we can throttle our scraper to 10 requests/second!</p>
<p>A popular implementation for python's asyncio is available in <a href="https://github.com/mjpieters/aiolimiter">aiolimiter</a> package. Let's take a quick look, how can we use it in our web-scraper:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span>
<span class="normal">15</span>
<span class="normal">16</span>
<span class="normal">17</span>
<span class="normal">18</span>
<span class="normal">19</span>
<span class="normal">20</span>
<span class="normal">21</span>
<span class="normal">22</span>
<span class="normal">23</span>
<span class="normal">24</span>
<span class="normal">25</span>
<span class="normal">26</span>
<span class="normal">27</span>
<span class="normal">28</span>
<span class="normal">29</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="kn">import</span> <span class="nn">asyncio</span>
<span class="kn">from</span> <span class="nn">aiolimiter</span> <span class="kn">import</span> <span class="n">AsyncLimiter</span>
<span class="kn">from</span> <span class="nn">time</span> <span class="kn">import</span> <span class="n">time</span>
<span class="kn">from</span> <span class="nn">httpx</span> <span class="kn">import</span> <span class="n">AsyncClient</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">scrape</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">session</span><span class="p">,</span> <span class="n">throttler</span><span class="p">):</span>
<span class="k">async</span> <span class="k">with</span> <span class="n">throttler</span><span class="p">:</span>
<span class="k">return</span> <span class="k">await</span> <span class="n">session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">run</span><span class="p">():</span>
<span class="n">_start</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
<span class="n">throttler</span> <span class="o">=</span> <span class="n">AsyncLimiter</span><span class="p">(</span><span class="n">max_rate</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">time_period</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># 10 tasks/second</span>
<span class="k">async</span> <span class="k">with</span> <span class="n">AsyncClient</span><span class="p">()</span> <span class="k">as</span> <span class="n">session</span><span class="p">:</span>
<span class="n">tasks</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
<span class="c1"># this url will always take 1 second</span>
<span class="n">url</span> <span class="o">=</span> <span class="s2">"http://httpbin.org/delay/1"</span>
<span class="n">tasks</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">scrape</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">session</span><span class="o">=</span><span class="n">session</span><span class="p">,</span> <span class="n">throttler</span><span class="o">=</span><span class="n">throttler</span><span class="p">))</span>
<span class="n">results</span> <span class="o">=</span> <span class="k">await</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="o">*</span><span class="n">tasks</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"finished scraping in: </span><span class="si">{</span><span class="n">time</span><span class="p">()</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">_start</span><span class="si">:</span><span class="s2">.1f</span><span class="si">}</span><span class="s2"> seconds"</span><span class="p">)</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
<span class="n">asyncio</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">run</span><span class="p">())</span>
<span class="c1"># results in:</span>
<span class="c1"># finished scraping in: 10.4 seconds</span>
</code></pre></div></td></tr></table></div>
<p>Here we merely replaced our <code>Semaphore</code> from previous code with <code>aiolimiter.AsyncLimiter</code> which allows us to be much more precise: we completed our 100 scrape tasks in 10.4 seconds which matches our limit of 10req/second almost exactly!</p>
<p>Leaky bucket is a great throttling approach as it allows us to scrape in predictable and controlled speeds. That being said predictability is not a great web scraper feature when it comes to blocking and banning - scraper can be identified quite easily if it shows a consistent pattern of connections.<br/>
So which throttling pattern to use entirely depends on scraped targets, but Leaky Bucket is often the best approach!</p>
<h3 id="how-much-to-throttle">How Much To Throttle?</h3>
<p>Finally with throttling figured out, we need to decide on the number? Unfortunately, there are no standard accepted practices other than it's respectful to keep connection number lower to not cause issues to web servers. </p>
<p>Keeping in the <strong>10-30 requests/second</strong> range (depending on page and website size) is widely considered to be a respectful scraping rate - so that's a good starting point. Other than that it's a good idea to grab some proxies and experiment!</p>
<h2 id="summary-alternatives-and-further-reading_1">Summary, Alternatives and Further Reading</h2>
<p>In this introductory article, we've taken a look how we can use asynchronous python to speed up our web-scrapers to the point where we had to figure out how to throttle ourselves! For throttling, we've discovered Semaphore and Leaky Bucket approaches and how can we apply them in web-scraping. </p>
<p>Asynchronous programming can often be difficult, complex and even ugly. However, when it comes to such IO heavy tasks like web connections in web-scraping it is often unavoidable, so better get comfortable with it! </p>
<hr/>
<p>As for further reading and alternatives?</p>
<p>Asynchronous programming is not a new concept by any means, but it's still evolving in many different directions. In this article we focused on Python's async/await approach, however there are other paradigms that are also used in web-scraping.</p>
<p>One of the biggest web-scraping frameworks <a href="https://scrapy.org">Scrapy</a> is using <a href="https://duckduckgo.com/?q=twisted+python&ia=web">Twisted</a> callback based async engine which allows running functions in the background and call result function once they complete. Callbacks and Futures are also often used in Javascript and can be used in Python's asyncio as well! </p>
<p>Other than that, <a href="https://github.com/celery/celery">Celery</a> task engine is also often used to execute scraping tasks concurrently. <br/>
<a href="http://www.gevent.org/">Gevent</a> is yet another popular asynchronous library that integrates well with web-scraping ecosystem. </p>
<p>In the future of this blog we'll cover more examples of these alternative approaches, however do not underestimate python's <code>asyncio</code> - as we've seen in this article it's extremely fast, surprisingly accessible and once understood can be an absolute pleasure to work with!</p>
<hr/>
<p>So stick around for more articles and if you have any questions, come join us on <a href="https://matrix.to/#/%23web-scraping:matrix.org">#web-scraping on matrix</a>, check out <a href="https://stackoverflow.com/questions/tagged/web-scraping">#web-scraping on stackoverflow</a> or leave a comment below! </p>
<p>As always, you can hire me for web-scraping consultation over at <a href="/pages/hire.html">hire</a> page and happy scraping! </p>Introduction To Reverse Engineering The Web2021-10-08T00:00:00+02:002021-10-08T00:00:00+02:00Bernardas Ališauskastag:scrapecrow.com,2021-10-08:/reverse-engineering-intro.html<p>To efficiently scrape a web resource, understanding how it works and functions is often a vital step. Reverse engineering a website behavior is often first step when developing a web-scraper - let's take a look how!</p><p><a href="/images/banner-machines.png"><img class="fullc" loading="lazy" src="/images/banner-machines.png" title=""/></a><figcaption></figcaption></p>
<p>Unfortunately for web-scrapers modern websites are much more than html files strung together and served over http protocol. Often websites are complex structures of multiple executions layers and file formats that are designed specifically for one of the most complex programs in the world - web browsers. </p>
<p>So how do we scrape complex websites without using web-browsers? We can reverse engineer websites behavior and replicate it in our code!</p>
<p>In this article will cover a short introduction how to use web browser's developer tools to reverse engineer website's behavior. We'll be using <a href="https://food.com">https://food.com</a> as our example and cover some useful tips and tricks. </p>
<p class="info">In this article we'll be using <code>Chrome</code> web browser. That being said Chrome based browsers (like Brave, Electron, Qutebrowser etc.) use the same tools and should function the same. However Firefox and it's browser family use slightly different tools</p>
<h2 id="browser-developer-tools">Browser Developer Tools</h2>
<p>Fortunately, modern web browsers come with great debugging tools referred to as "Developer Tools". For this article we'll take a look at Chrome web browser. If you fire up Chrome browser and click <code>F12</code> (or right click anywhere on the page and select <code>inspect</code>) developer tool window will open up:</p>
<p><a href="/images/devtools.png"><img class="bigc" loading="lazy" src="/images/devtools.png" title=""/></a><figcaption></figcaption></p>
<p>As you can see, there's a lot going on here. Let's quickly go through these tools and see what they can do to us when it comes to web-scraping. First, let's take a look at the <strong>available tool tabs</strong>:</p>
<hr/>
<p><code>Elements</code> - this tab allows to visually explore, search and investigate the html page structure. For
<a href="/images/devtools_tab_elements.png"><img class="bigc" loading="lazy" src="/images/devtools_tab_elements.png" title="this tab is really useful for visualizing how page is structured"/></a><figcaption>this tab is really useful for visualizing how page is structured</figcaption></p>
<p><code>Console</code> - this tab functions like a real time shell or a repl. You can type javascript expressions here and they will be evaluated against the current page.<br/>
<a href="/images/devtools_tab_console.png"><img class="bigc" loading="lazy" src="/images/devtools_tab_console.png" title="this tool is great for reverse engineering javascript functionality of the page - imagine it as a debugger shell for website's code"/></a><figcaption>this tool is great for reverse engineering javascript functionality of the page - imagine it as a debugger shell for website's code</figcaption></p>
<p><code>Application</code> - contains various application data: from cookies to database entries. This is rarely used by websites but often used by various web-apps. For web-scraping this tab is not referred to commonly.
<a href="/images/devtools_tab_application.png"><img class="bigc" loading="lazy" src="/images/devtools_tab_application.png" title="most useful feature of this tab is often the clear all data button"/></a><figcaption>most useful feature of this tab is often the clear all data button</figcaption></p>
<p><code>Network</code> - probably the most interesting tab: it shows all of network requests made by the browser. Most useful web-scraping tool of the bunch!<br/>
<a href="/images/devtools_tab_network.png"><img class="bigc" loading="lazy" src="/images/devtools_tab_network.png" title="we'll be spending most of our time in this tab!"/></a><figcaption>we'll be spending most of our time in this tab!</figcaption></p>
<hr/>
<p>As you can see it's a huge suite of web tools! However the most interesting tool when it comes to reverse-engineering for web-scraping purposes has to be the Network tab. Let's take a look how we can configure it for optimal experience and some examples of how to use it.</p>
<h2 id="network-inspector">Network Inspector</h2>
<p>This browser tool shows us all the requests our browser is making when we're browsing the web. </p>
<p>First, let's take a look at the window itself. Specifically how to read it in the context of reverse-engineering for web-scraping:</p>
<p><a href="/images/devtools_tab_network_details.png"><img class="bigc" loading="lazy" src="/images/devtools_tab_network_details.png" title="This might look a bit different on different browsers but functionality should be the same!"/></a><figcaption>This might look a bit different on different browsers but functionality should be the same!</figcaption></p>
<p>There's a lot going here but don't get overwhelmed just yet. We only need to focus on these parts:</p>
<ol>
<li>Contains all requests your browser made to the website. You can click on each individual one to further inspect it (We'll dig into this more below). </li>
<li>Option flags that disable cache and stop data clearing on page load (These are very useful for reverse engineering)</li>
<li>Contains powerful filtering system. For the most part we'll be spending most of our time in either<ul>
<li><code>Doc</code> filter which shows all <code>html</code> document requests </li>
<li><code>XHR</code> filter which shows all data requests such as <code>json</code>.</li>
</ul>
</li>
<li>Clear button <code>⍉</code> which clears current requests for easier tracking of what's going on. </li>
</ol>
<p>Further, we can take a look at individual request itself and which parts are most useful for reverse-enginering. If you click on one of the requests you should see something like: </p>
<p><a href="/images/devtools_tab_network_row_details.png"><img class="bigc" loading="lazy" src="/images/devtools_tab_network_row_details.png" title=""/></a><figcaption></figcaption></p>
<p>In this window we see several important information fields:</p>
<ol>
<li>Basic request details<br/>
Most important details here are URL and request method.</li>
<li>
<p>Response headers<br/>
Rarely interesting but can contain important meta data about response browser received from the website, such as:</p>
<ul>
<li><code>Set-Cookie</code> header<br/>
contains cookies website requests the browser to save</li>
<li><code>Content-Type</code> header<br/>
contains the type of response. Most common values are either <code>text/html</code> for html documents or <code>application/json</code> for json data.</li>
<li><code>X-</code> prefixed headers<br/>
these are non-standard headers that are often used for website functionality, tracking or anti-bot protection.</li>
</ul>
<p class="info">If you'd like to learn more about http headers see <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers">MDN's http header documentation</a></p>
</li>
<li>
<p>Request headers<br/>
Headers browser sent with this request. Often we want to replicate most of these headers in our web-scraper <em>as closely as possible</em>. Most common and vital ones being:</p>
<ul>
<li><code>Content-Type</code> and <code>Accept-</code> prefixed headers<br/>
these are instructions for what sort of content is expected. Often http client libraries (like <code>requests</code> for python) fill them in automatically, thus it's important to keep an eye on these as sometimes they might be generated differently from the our browser. </li>
<li><code>User-Agent</code><br/>
identifies who is making the request. This is really important field for preventing basic bot blocking. Usually we want to set this to one of popular browsers on popular OS systems like Chrome on Windows.</li>
</ul>
</li>
<li>
<p>Request Payload<br/>
This mostly used when dealing with <code>POST</code> type requests. It shows what data browser sends to the website. Usually it's some sort of request parameters in json format. </p>
</li>
</ol>
<p>As you can see, Network Inspector is a surprisingly powerful and extremely useful reverse engineering tool that shows us what connections our browser is performing in a very detailed manner!</p>
<p>Now that we're somewhat familiar with it, let's take a look at common usage tips and several iconic web-scraping problems that can be solved by using this tool.</p>
<h2 id="tip-replicating-requests-in-python">Tip: Replicating Requests in Python</h2>
<p>There's an easy way to replicate requests seen in the Network Inspector in your python code. <br/>
If you right-click on a request, you can see that the Network Inspector allows exporting it in several formats: </p>
<p><a href="/images/devtools_tab_network_copy_curl.png"><img class="" loading="lazy" src="/images/devtools_tab_network_copy_curl.png" width='title=""'/></a><figcaption></figcaption></p>
<p>While there's no "copy as python" button there is "copy to curl" button which produces a <code>curl</code> command line tool command with all of the request details attached. Something like:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span>
<span class="normal">15</span>
<span class="normal">16</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="nt">curl</span><span class="w"> </span><span class="s1">'https://api.food.com/external/v1/nlp/search'</span><span class="w"> </span><span class="err">\</span>
<span class="w"> </span><span class="nt">-H</span><span class="w"> </span><span class="s1">'Connection: keep-alive'</span><span class="w"> </span><span class="err">\</span>
<span class="w"> </span><span class="nt">-H</span><span class="w"> </span><span class="s1">'Pragma: no-cache'</span><span class="w"> </span><span class="err">\</span>
<span class="w"> </span><span class="nt">-H</span><span class="w"> </span><span class="s1">'Cache-Control: no-cache'</span><span class="w"> </span><span class="err">\</span>
<span class="w"> </span><span class="nt">-H</span><span class="w"> </span><span class="s1">'Accept: application/json, text/javascript, */*; q=0.01'</span><span class="w"> </span><span class="err">\</span>
<span class="w"> </span><span class="nt">-H</span><span class="w"> </span><span class="s1">'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.15.2 Chrome/87.0.4280.144 Safari/537.36'</span><span class="w"> </span><span class="err">\</span>
<span class="w"> </span><span class="nt">-H</span><span class="w"> </span><span class="s1">'Content-Type: application/json'</span><span class="w"> </span><span class="err">\</span>
<span class="w"> </span><span class="nt">-H</span><span class="w"> </span><span class="s1">'Accept-Language: en-US,en;q=0.9'</span><span class="w"> </span><span class="err">\</span>
<span class="w"> </span><span class="nt">-H</span><span class="w"> </span><span class="s1">'DNT: 1'</span><span class="w"> </span><span class="err">\</span>
<span class="w"> </span><span class="nt">-H</span><span class="w"> </span><span class="s1">'Origin: https://www.food.com'</span><span class="w"> </span><span class="err">\</span>
<span class="w"> </span><span class="nt">-H</span><span class="w"> </span><span class="s1">'Sec-Fetch-Site: same-site'</span><span class="w"> </span><span class="err">\</span>
<span class="w"> </span><span class="nt">-H</span><span class="w"> </span><span class="s1">'Sec-Fetch-Mode: cors'</span><span class="w"> </span><span class="err">\</span>
<span class="w"> </span><span class="nt">-H</span><span class="w"> </span><span class="s1">'Sec-Fetch-Dest: empty'</span><span class="w"> </span><span class="err">\</span>
<span class="w"> </span><span class="nt">-H</span><span class="w"> </span><span class="s1">'Referer: https://www.food.com/'</span><span class="w"> </span><span class="err">\</span>
<span class="w"> </span><span class="nt">--data-raw</span><span class="w"> </span><span class="s1">'{"contexts":</span><span class="cp">[]</span><span class="s1">,"searchTerm":"","pn":4}'</span><span class="w"> </span><span class="err">\</span>
<span class="w"> </span><span class="nt">--compressed</span>
</code></pre></div></td></tr></table></div>
<p>There are several tools that can convert this string into code! </p>
<ul>
<li><a href="https://curl.trillworks.com/">https://curl.trillworks.com/</a> - can convert it to multiple languages, python <code>requests</code> being one of them.</li>
<li><a href="https://github.com/spulec/uncurl">https://github.com/spulec/uncurl</a> - is a library that can convert it to python objects or full python <code>requests</code> code.</li>
</ul>
<p>Using either of these converter tools we can quickly prototype our web-scraper while reverse engineering our target.</p>
<h2 id="common-case-dynamic-javascript-pagination">Common Case: Dynamic Javascript Pagination</h2>
<p>One of the most common encountered web-scraping issues is dynamic content generation powered by javascript.
Modern websites often use javascript to generate web page content on the fly rather than redirecting users to a new page.</p>
<p>Most commonly this is observed in item pagination - instead of sending the user to page 2 directly its data is requested in the background and inject it back to document's body using javascript.<br/>
This is often referred to as <strong>never-ending or dynamic pagination</strong>.</p>
<p>Common identifiers of dynamic pagination:</p>
<ul>
<li>Instead of pages users just need to scroll down and more results are loaded</li>
<li>Clicking page doesn't reload the current page just the pagination part.</li>
<li>Pagination doesn't work with javascript disabled.</li>
</ul>
<h3 id="scraping-recipes-from-foodcom">Scraping Recipes from Food.com</h3>
<p>For example let's take a look at how <a href="https://food.com">https://food.com</a> does it in their recipe search:</p>
<p><video autoplay="" class="bigc" loop="" muted="" title="you can see the little spinning wheel turning when you scroll the page"><source src="/videos/endless_pagination.mp4" type="video/mp4"/></video><figcaption>you can see the little spinning wheel turning when you scroll the page</figcaption></p>
<p>As you can see, the content of this website loads dynamically every time the user scrolls the page.<br/>
This technique is especially common in Single Page Applications (SPA) - where the whole idea is that the user never needs to switch locations and content is dynamically replaced </p>
<p class="info">For more information on Single Page Applications see <a href="https://developer.mozilla.org/en-US/docs/Glossary/SPA">MDN's documentation on SPA</a></p>
<p>Since our web-scraper is not a browser (unless we use browser emulation) it doesn't execute javascript. Meaning to access this dynamic content we must reverse engineer the behavior so we can replicate it in our code. </p>
<p>Let's fire up devtools' Network Inspector and see what food.com does when we scroll down:</p>
<p><video autoplay="" class="bigc" loop="" muted="" title="XHR filter tab only shows data requests"><source src="/videos/endless_pagination_dev.mp4" type="video/mp4"/></video><figcaption>XHR filter tab only shows data requests</figcaption></p>
<p>We can see that when we continue scrolling <code>search</code> requests are being made. That's actually data for the whole page of recipes which when received is being injected into the html page by javascript.<br/>
Let's take a look at these requests and how we can replicate them in our web-scraper:</p>
<p><a href="/images/foodcom_pagination.png"><img class="bigc" loading="lazy" src="/images/foodcom_pagination.png" title=""/></a><figcaption></figcaption></p>
<p>Here we can see that the request being made is a <code>POST</code> type request to <a href="https://api.food.com/external/v1/nlp/search">https://api.food.com/external/v1/nlp/search</a> and it's sending some JSON data. In return it received json document with 10 recipes and loads of meta information - like how many pages are there in total. That's exactly what we're looking for!</p>
<p>Let's take a look at the document we need to send to receive this information.<br/>
Under "Request Payload" we see json document:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span>
<span class="normal">6</span>
<span class="normal">7</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="p">{</span>
<span class="w"> </span><span class="nt">"contexts"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
<span class="w"> </span><span class="s2">"{\"name\":\"toast\",\"paths\":[\"/~asset/bread\"],\"type\":\"PRIMARY\",\"searchType\":\"NORMAL\",\"degreesSeparation\":0,\"cleanedName\":\"toast\",\"popularityFactor\":0,\"taggedContentCount\":2054,\"userToken\":true,\"searchGuess\":false,\"essenceContext\":false,\"matchingCandidate\":false}"</span>
<span class="w"> </span><span class="p">],</span>
<span class="w"> </span><span class="nt">"searchTerm"</span><span class="p">:</span><span class="w"> </span><span class="s2">"toast"</span><span class="p">,</span>
<span class="w"> </span><span class="nt">"pn"</span><span class="p">:</span><span class="w"> </span><span class="mi">14</span>
<span class="p">}</span>
</code></pre></div></td></tr></table></div>
<p>Some context data seems to be sent, search term <code>"toast"</code> and <code>pn</code> integer argument which seems to be short for <code>page number</code>. Great, that means we can request any page for any search term! <br/>
Let's replicate this request in Python:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span>
<span class="normal">15</span>
<span class="normal">16</span>
<span class="normal">17</span>
<span class="normal">18</span>
<span class="normal">19</span>
<span class="normal">20</span>
<span class="normal">21</span>
<span class="normal">22</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="kn">import</span> <span class="nn">requests</span>
<span class="n">headers</span> <span class="o">=</span> <span class="p">{</span>
<span class="c1"># when web scraping we always want to appear as </span>
<span class="c1"># a web browser to prevent being blocked</span>
<span class="s2">"User-Agent"</span><span class="p">:</span> <span class="s2">"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"</span>
<span class="p">}</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">{</span>
<span class="c1"># our recipe search term</span>
<span class="s2">"searchTerm"</span><span class="p">:</span> <span class="s2">"Toast"</span><span class="p">,</span>
<span class="c1"># page number</span>
<span class="s2">"pn"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="p">}</span>
<span class="n">url</span> <span class="o">=</span> <span class="s2">"https://api.food.com/external/v1/nlp/search"</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="n">data</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="n">json</span><span class="p">()</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s2">"response"</span><span class="p">][</span><span class="s2">"results"</span><span class="p">]</span>
<span class="n">total_results_count</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s2">"response"</span><span class="p">][</span><span class="s2">"totalResultsCount"</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"found </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">results</span><span class="p">)</span><span class="si">}</span><span class="s2"> results from </span><span class="si">{</span><span class="n">total_results_count</span><span class="si">}</span><span class="s2"> total"</span><span class="p">)</span>
<span class="c1"># this will print: "found 10 results from 2246 total"</span>
</code></pre></div></td></tr></table></div>
<p>It works! We've successfully reverse engineering food.com's backend API for it's paging function and replicated it in this Python scraper script! </p>
<p>If you noticed, we skipped the <code>contexts</code> object in our POST body entirely. The great part about reverse engineering something is that we can adjust its functions and have clean, efficient, beautiful programs. This is great for us as our scraper programs use less resources and are easier to follow and maintain, and it's also better for our target - food.com - as we consume fewer of their resources by only the scraping specific data targets rather than loading the whole page and all the extras such as images, videos etc.</p>
<h2 id="summary-and-further-reading_1">Summary And Further Reading</h2>
<p>In this short introduction article we covered what are browser developer tools and how can we use them to understand basic workings of javascript driven websites. We've covered example case of how <a href="https://food.com">https://food.com</a> is using javascript for never-ending-pagination and how can we replicate it in Python using <code>requests</code> package. </p>
<p>Reverse engineering story doesn't end here - web is becoming more complex by the day and browser's devtools only scratch the surface of what we can learn about websites. In the future we'll cover advanced reverse engineering topics such as using man-in-the-middle monitoring programs such as <a href="https://mitmproxy.org/">mitmproxy</a>, <a href="https://docs.telerik.com/fiddler-everywhere/introduction">fiddler</a> and <a href="https://www.wireshark.org/">wireshark</a> that allow even more detailed inspection and various quality of life tools such as request interception and scripting. These tools not only allow to reverse engineer websites but desktop and mobile applications!</p>
<hr/>
<p>So stick around for more articles and if you have any questions, come join us on <a href="https://matrix.to/#/%23web-scraping:matrix.org">#web-scraping on matrix</a>, check out <a href="https://stackoverflow.com/questions/tagged/web-scraping">#web-scraping on stackoverflow</a> or leave a comment below! </p>
<p>As always, you can hire me for web-scraping consultation over at <a href="/pages/hire.html">hire</a> page and happy scraping! </p>
<hr/>
<figcaption>Banner image by: "ENIGMA cipher machine collection" by brewbooks is licensed under CC BY-SA 2.0</figcaption>Web Scraping Target Discovery2021-09-29T00:00:00+02:002021-09-29T00:00:00+02:00Bernardas Ališauskastag:scrapecrow.com,2021-09-29:/web-scraping-discovery.html<p>Target discovery in web-scraping is how the scraper explores target website to find scraping targets. For example to scrape product data of an e-commerce website we would need to find urls to each individual product. This step is called "discovery". What types of discovery methods are there?</p><p><a href="/images/banner-telescopes.png"><img class="fullc" loading="lazy" src="/images/banner-telescopes.png" title=""/></a><figcaption></figcaption></p>
<p>Most web scrapers are made up of two core parts: finding products on the website and actually scraping them. The former is often referred to as "target discovery" step. For example to scrape product data of an e-commerce website we would need to find urls to each individual product and only then we can scrape their data.</p>
<p>Discovering targets to scrape in web scraping is often a challenging and important task. This series of blog posts tagged with <a href="/tag/discovery-methods.html">#discovery-methods</a> (also see <a href="/web-scraping-discovery.html">main article</a>) covers common target discovery approaches.</p>
<p>In this blog series tagged <a href="/tag/discovery-methods.html">#discovery-methods</a> we'll take a look at common discovery methods used in web-scraping where each is different enough to have it's own risks, negatives and benefits. We'll target an example clothing store website <a href="https://hm.com">https://hm.com</a> for all of these discovery approaches:</p>
<ul>
<li><a href="/web-scraping-discovery-sitemaps.html">Sitemaps</a> - using website sitemap indexes.</li>
<li><a href="/web-scraping-discovery-search.html">Search API</a> - reverse engineering websites search api.</li>
<li><a href="/web-scraping-discovery-indexes.html">Indexes</a> - taking advantage of existing indexes and search engines.</li>
<li><a href="/web-scraping-discovery-crawling.html">Crawling</a> - recursively scrape whole website to find what we're looking for.</li>
</ul>
<p>These are 4 main discovery approaches that can be used in web scraping target discovery and knowing them is a great tool in web scraper developers utility belt.<br/>
To quickly summarize our 4 main articles the discovery strategies vary quite a bit:</p>
<ul>
<li><strong>Sitemaps</strong> is probably the best approach as it's fast, safe and easy to implement; however unfortunately sitemaps of many websites are often neglected or contain dated links. </li>
<li>Reverse engineering websites <strong>search api</strong> on the other hand is both efficient and has great results; however it requires reverse-engineering knowledge and can be difficult/time consuming to implement. </li>
<li><strong>Crawling</strong> is a great general approach but it's risky, slow and resource intensive. </li>
<li>Finally taking advantage of <strong>existing indexes</strong> is a great last resort for web-sites that don't like to be scraped as they still want to be indexed by search engines or other indexers.</li>
</ul>
<p>So which one to use? It really depends on your target and resources. <br/>
Hopefully this extensive blog series can help you determine the right way to find your data targets!</p>
<hr/>
<p>For more web-scraping discovery techniques, see <a href="/tag/discovery-methods.html">#discovery-methods</a> and <a href="/tag/discovery.html">#discovery</a> for more discovery related subjects. </p>
<p>If you have any questions, come join us on <a href="https://matrix.to/#/%23web-scraping:matrix.org">#web-scraping on matrix</a>, check out <a href="https://stackoverflow.com/questions/tagged/web-scraping">#web-scraping on stackoverflow</a> or leave a comment below! </p>
<p>As always, you can hire me for web-scraping consultation over at <a href="/pages/hire.html">hire</a> page and happy scraping! </p>
<hr/>
<figcaption>image credits: "Telescopes, Mauna Kea, Hawaii" by Gregory Williams is licensed under CC BY-NC-ND 2.0</figcaption>Web Scraping Target Discovery: Crawling2021-09-28T00:00:00+02:002021-09-28T00:00:00+02:00Bernardas Ališauskastag:scrapecrow.com,2021-09-28:/web-scraping-discovery-crawling.html<p>The most common web scraping target discovery technique: recursive crawling. How does it work? What are the pros and cons and the most optimal execution patterns?</p><p><a href="/images/banner-web.jpg"><img class="fullc" loading="lazy" src="/images/banner-web.jpg" title=""/></a><figcaption></figcaption></p>
<p>Most web scrapers are made up of two core parts: finding products on the website and actually scraping them. The former is often referred to as "target discovery" step. For example to scrape product data of an e-commerce website we would need to find urls to each individual product and only then we can scrape their data.</p>
<p>Discovering targets to scrape in web scraping is often a challenging and important task. This series of blog posts tagged with <a href="/tag/discovery-methods.html">#discovery-methods</a> (also see <a href="/web-scraping-discovery.html">main article</a>) covers common target discovery approaches.</p>
<p>In this article we'll take a look at web crawling and how can we use it as a discovery strategy in web scraping. </p>
<h2 id="what-is-recursive-crawling-and-how-is-it-used-in-web-scraping">What is recursive crawling and how is it used in web-scraping?</h2>
<p>One of the most common ways to discover web scraping targets is to recursively crawl the website. This technique is usually used by broad scrapers (scrapers that scrape many different websites) and index crawlers such as Google and other search engine bots.<br/>
In short crawling is recursive scraping technique where given a start url and some crawling rules the scraper continues exploring the website by visiting <em>all'ish</em> of the links present on the website. </p>
<p>To wrap our heads around crawling concept easier lets refer to this small flow chart:</p>
<p><a href="/images/crawl-flow.png"><img class="" loading="lazy" src="/images/crawl-flow.png" width='title=""'/></a><figcaption></figcaption></p>
<p>This flow chart illustrates the simplest domain-bound crawl spider flow: the crawler is given a starting point, it scrapes and parses it for urls present in the html body. Then applies matching rules to urls and determines whether to save to urls (for scraping later) or whether to follow them up to repeat the whole process.</p>
<p>Before using crawling as a web scraping discovery strategy it's a good practice to reflect on common pros and cons of this technique and see whether that would fit your web-scraping project:</p>
<p>Pros: </p>
<ul>
<li><strong>Generic Algorithm</strong>: can be applied to any website with few adjustments. In other words one web scraper can be adapted to any website quite easily. </li>
<li><strong>Good Coverage</strong>: some websites (like e-commerce) are well interlinked thus crawling will have great discovery coverage. </li>
<li><strong>Easy to Develop</strong>: no reverse-engineering skills are required since we're just falling natural website structure. </li>
</ul>
<p>Cons: </p>
<ul>
<li><strong>Inefficient and Slow</strong>: since crawling is a very generic solution it comes with a lot of inefficiencies. Often extracted links might not contain any product links so lots of crawl branches end up in dead ends. </li>
<li><strong>Insufficient Coverage</strong>: some websites are not well interlinked (sometimes purposefully to prevent web scrapers). Crawlers can't discover items that are not referenced anywhere. </li>
<li><strong>Risk</strong>: since scraped link bandwidth is much bigger than other discovery approaches the scrapers IPs are more likely to be throttled or blocked. </li>
<li><strong>Struggles With Javascript Heavy Websites</strong>: since crawling is very generic and web scrapers don't execute javascript content (unless using browser emulation) some websites might be too complex for web scraper to follow. </li>
</ul>
<p>We can see that crawling is a smart generic way to discover scrape targets however it's not without it's faults: it's slower, less accurate and might be hard to accomplish with some javascript heavy websites.<br/>
Lets take a look at example target discovery implementation that uses web crawling.</p>
<h2 id="example-use-case-hmcom">Example Use Case: hm.com</h2>
<p>Lets take a look at a popular clothing e-commerce website: <a href="https://hm.com">https://hm.com</a>. We'll be using crawling approach to find all clothing products on the website. </p>
<p>First lets establish essential parts that make up a web crawler:</p>
<ol>
<li>Link extractor - a function/object that can find urls in html body.</li>
<li>Defined link pattern rules to follow - a function/object that determines how to handle up extracted links.</li>
<li>Duplicate filter - object that keeps track of links scraper visited.</li>
<li>Limiter - since crawling visits many urls we need to limit connection rate to not overwhelm the website. </li>
</ol>
<p>These are 4 components that make up a basic web crawler. Lets see how we can implement them for hm.com.</p>
<h3 id="crawling-rules">Crawling Rules</h3>
<p>First lets establish our crawling rules. As per above flowchart our crawler needs to know which urls to follow up and which to save:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span>
<span class="normal">6</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="kn">import</span> <span class="nn">re</span>
<span class="k">class</span> <span class="nc">HMScraper</span><span class="p">:</span>
<span class="n">save_urls</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s2">"/productpage\."</span><span class="p">)</span> <span class="c1"># e.g ...com/en_us/productpage.09008.html</span>
<span class="n">follow_urls</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s2">"\.html"</span><span class="p">)</span>
<span class="n">follow_saved_urls</span> <span class="o">=</span> <span class="kc">False</span>
</code></pre></div></td></tr></table></div>
<p>Here we defined our crawling rules:</p>
<ul>
<li>We want to save all urls that contain <code>/productpage.</code> in the url as all hm.com products follow this pattern </li>
<li>We want to follow up any url containing <code>.html</code> </li>
<li>
<p>Do not follow urls that are being saved. </p>
<p class="info">Following saved urls can useful as product pages often contain "related products" urls which can help us increase discovery coverage. For hm.com domain this is unnecessary.<br/></p>
</li>
</ul>
<p>These are 3 rules that define our crawler's routine for domain <code>hm.com</code>. With that ready lets take a look how we can create a link extractor function that will use these rules to extract crawl targets.</p>
<h3 id="crawl-loop">Crawl Loop</h3>
<p>Having crawling rules defined we need to create a crawl loop that uses these rules to schedule a whole crawl process. <br/>
In this example for our http processing we'll be using <code>httpx</code> and for html parsing <code>parsel</code> python packages. With these two tools we can define basic crawler skeleton:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span>
<span class="normal">15</span>
<span class="normal">16</span>
<span class="normal">17</span>
<span class="normal">18</span>
<span class="normal">19</span>
<span class="normal">20</span>
<span class="normal">21</span>
<span class="normal">22</span>
<span class="normal">23</span>
<span class="normal">24</span>
<span class="normal">25</span>
<span class="normal">26</span>
<span class="normal">27</span>
<span class="normal">28</span>
<span class="normal">29</span>
<span class="normal">30</span>
<span class="normal">31</span>
<span class="normal">32</span>
<span class="normal">33</span>
<span class="normal">34</span>
<span class="normal">35</span>
<span class="normal">36</span>
<span class="normal">37</span>
<span class="normal">38</span>
<span class="normal">39</span>
<span class="normal">40</span>
<span class="normal">41</span>
<span class="normal">42</span>
<span class="normal">43</span>
<span class="normal">44</span>
<span class="normal">45</span>
<span class="normal">46</span>
<span class="normal">47</span>
<span class="normal">48</span>
<span class="normal">49</span>
<span class="normal">50</span>
<span class="normal">51</span>
<span class="normal">52</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="kn">import</span> <span class="nn">logging</span>
<span class="kn">import</span> <span class="nn">asyncio</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">urllib.parse</span> <span class="kn">import</span> <span class="n">urlparse</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span>
<span class="kn">from</span> <span class="nn">parsel</span> <span class="kn">import</span> <span class="n">Selector</span>
<span class="kn">from</span> <span class="nn">httpx</span> <span class="kn">import</span> <span class="n">AsyncClient</span>
<span class="k">class</span> <span class="nc">HMScraper</span><span class="p">:</span>
<span class="n">save_urls</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s2">"/productpage\."</span><span class="p">)</span> <span class="c1"># e.g ...com/en_us/productpage.09008.html</span>
<span class="n">follow_urls</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s2">"\.html"</span><span class="p">)</span>
<span class="n">follow_saved_urls</span> <span class="o">=</span> <span class="kc">False</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
<span class="c1"># asyncio.Semaphore object allows us to limit coroutine concurrency </span>
<span class="c1"># in our case we can limit how many concurrent requests are being made</span>
<span class="bp">self</span><span class="o">.</span><span class="n">limiter</span> <span class="o">=</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">Semaphore</span><span class="p">(</span><span class="n">limit</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">seen_urls</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
<span class="k">async</span> <span class="k">def</span> <span class="fm">__aenter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""on scraper creation open http session"""</span>
<span class="bp">self</span><span class="o">.</span><span class="n">session</span> <span class="o">=</span> <span class="n">AsyncClient</span><span class="p">(</span>
<span class="c1"># we should use a browser-like user agent header to avoid being blocked</span>
<span class="n">headers</span><span class="o">=</span><span class="p">{</span>
<span class="s2">"User-Agent"</span><span class="p">:</span> <span class="s2">"Mozilla/5.0 (X11; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"</span>
<span class="p">}</span>
<span class="p">)</span>
<span class="k">return</span> <span class="bp">self</span>
<span class="k">async</span> <span class="k">def</span> <span class="fm">__aexit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""on scraper destruction close http session"""</span>
<span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">session</span><span class="o">.</span><span class="n">aclose</span><span class="p">()</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">_request</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">url</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""our http request wrapper function that implements rate limiting"""</span>
<span class="k">async</span> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">limiter</span><span class="p">:</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">resp</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
<span class="k">return</span> <span class="n">e</span>
<span class="k">return</span> <span class="n">resp</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">save</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">url</span><span class="p">):</span>
<span class="c1"># for display purposes lets just print the url</span>
<span class="nb">print</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">scrape</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="o">...</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">find_links</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">response</span><span class="p">):</span>
<span class="o">...</span>
</code></pre></div></td></tr></table></div>
<p>With this skeleton, we have basic usage API for our scraper. We can define our run function:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="k">async</span> <span class="k">def</span> <span class="nf">run</span><span class="p">():</span>
<span class="k">async</span> <span class="k">with</span> <span class="n">HMScraper</span><span class="p">()</span> <span class="k">as</span> <span class="n">scraper</span><span class="p">:</span>
<span class="n">start_urls</span> <span class="o">=</span> <span class="p">[</span>
<span class="c1"># homepage for US website</span>
<span class="s2">"https://www2.hm.com/en_us/index.html"</span>
<span class="p">]</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
<span class="n">asyncio</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">run</span><span class="p">())</span>
</code></pre></div></td></tr></table></div>
<p>Great! Now all we have to do is fill in the interesting bits: link extraction and scrape loop.<br/>
For scrape loop all we need to do is request urls, find links in them, follow or save ones that match our rules:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span>
<span class="normal">15</span>
<span class="normal">16</span>
<span class="normal">17</span>
<span class="normal">18</span>
<span class="normal">19</span>
<span class="normal">20</span>
<span class="normal">21</span>
<span class="normal">22</span>
<span class="normal">23</span>
<span class="normal">24</span>
<span class="normal">25</span></pre></div></td><td class="code"><div><pre><span></span><code> <span class="k">async</span> <span class="k">def</span> <span class="nf">scrape</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">urls</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]):</span>
<span class="w"> </span><span class="sd">"""Breadth first"""</span>
<span class="k">while</span> <span class="kc">True</span><span class="p">:</span>
<span class="n">to_follow</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
<span class="k">for</span> <span class="n">resp</span> <span class="ow">in</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">as_completed</span><span class="p">([</span><span class="bp">self</span><span class="o">.</span><span class="n">_request</span><span class="p">(</span><span class="n">url</span><span class="p">)</span> <span class="k">for</span> <span class="n">url</span> <span class="ow">in</span> <span class="n">urls</span><span class="p">]):</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">resp</span> <span class="o">=</span> <span class="k">await</span> <span class="n">resp</span>
<span class="c1"># skip failed requests; ideally this should be retried or logged</span>
<span class="k">except</span> <span class="ne">Exception</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">if</span> <span class="n">resp</span><span class="o">.</span><span class="n">status_code</span> <span class="o">!=</span> <span class="mi">200</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">for</span> <span class="n">url</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">find_links</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">text</span><span class="p">):</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">save_urls</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="o">.</span><span class="n">follow_saved_urls</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">follow_urls</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">" following </span><span class="si">{</span><span class="n">url</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="n">to_follow</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">if</span> <span class="n">to_follow</span><span class="p">:</span>
<span class="n">urls</span> <span class="o">=</span> <span class="n">to_follow</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="c1"># end of the crawl</span>
</code></pre></div></td></tr></table></div>
<p>Here we've defined an "endless" while loop that does exactly that: get htmls, parse them for urls where we store some of them and follow up the others. The last remaining piece is our link extraction logic. </p>
<h3 id="link-extracting">Link Extracting</h3>
<p>Link extraction process is the core part that makes the crawler and can get quite complex in logic. For our example domain <code>hm.com</code> it's relatively simple. We'll find all urls in the page by following <code><a></code> nodes:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span>
<span class="normal">15</span>
<span class="normal">16</span>
<span class="normal">17</span>
<span class="normal">18</span></pre></div></td><td class="code"><div><pre><span></span><code> <span class="k">def</span> <span class="nf">find_links</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">resp</span><span class="p">,</span> <span class="n">only_unique</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span>
<span class="c1"># build a parsable tree from html body</span>
<span class="n">sel</span> <span class="o">=</span> <span class="n">Selector</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
<span class="n">current_url_parts</span> <span class="o">=</span> <span class="n">urlparse</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">url</span><span class="p">)</span>
<span class="c1"># find all <a> nodes and select their href attribute</span>
<span class="n">urls</span> <span class="o">=</span> <span class="n">sel</span><span class="o">.</span><span class="n">xpath</span><span class="p">(</span><span class="s2">"//a/@href"</span><span class="p">)</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span>
<span class="k">for</span> <span class="n">url</span> <span class="ow">in</span> <span class="n">urls</span><span class="p">:</span>
<span class="c1"># convert relative url to absolute</span>
<span class="k">if</span> <span class="n">url</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s2">"/"</span><span class="p">):</span>
<span class="n">url</span> <span class="o">=</span> <span class="n">current_url_parts</span><span class="o">.</span><span class="n">_replace</span><span class="p">(</span><span class="n">path</span><span class="o">=</span><span class="n">url</span><span class="p">)</span><span class="o">.</span><span class="n">geturl</span><span class="p">()</span>
<span class="c1"># skip absolute urls that do not match current domain</span>
<span class="k">if</span> <span class="n">urlparse</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="o">.</span><span class="n">netloc</span> <span class="o">!=</span> <span class="n">current_url_parts</span><span class="o">.</span><span class="n">netloc</span><span class="p">:</span>
<span class="k">continue</span>
<span class="c1"># skip visited urls</span>
<span class="k">if</span> <span class="n">only_unique</span> <span class="ow">and</span> <span class="n">url</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">seen_urls</span><span class="p">:</span>
<span class="k">continue</span>
<span class="bp">self</span><span class="o">.</span><span class="n">seen_urls</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">yield</span> <span class="n">url</span>
</code></pre></div></td></tr></table></div>
<p>Here we first build a tree parser object to get all those <code><a></code> node links. Then we iterate through them and filter out anything that is not an url of this website or has been visited already.</p>
<p class="info">Link Extraction can get complicated very quickly as some website can contain non-html files (e.g. <code>/document.pdf</code>) that need to be filtered out and many other niche scenarios. </p>
<p>With link extraction complete, we can put together our whole crawler into once piece and see how it performs!</p>
<h3 id="putting-it-all-together">Putting It All Together</h3>
<p>Now that we have all parts complete: crawl loop, link extraction, link matching and request limiting. Let's put it all together and run it:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal"> 10</span>
<span class="normal"> 11</span>
<span class="normal"> 12</span>
<span class="normal"> 13</span>
<span class="normal"> 14</span>
<span class="normal"> 15</span>
<span class="normal"> 16</span>
<span class="normal"> 17</span>
<span class="normal"> 18</span>
<span class="normal"> 19</span>
<span class="normal"> 20</span>
<span class="normal"> 21</span>
<span class="normal"> 22</span>
<span class="normal"> 23</span>
<span class="normal"> 24</span>
<span class="normal"> 25</span>
<span class="normal"> 26</span>
<span class="normal"> 27</span>
<span class="normal"> 28</span>
<span class="normal"> 29</span>
<span class="normal"> 30</span>
<span class="normal"> 31</span>
<span class="normal"> 32</span>
<span class="normal"> 33</span>
<span class="normal"> 34</span>
<span class="normal"> 35</span>
<span class="normal"> 36</span>
<span class="normal"> 37</span>
<span class="normal"> 38</span>
<span class="normal"> 39</span>
<span class="normal"> 40</span>
<span class="normal"> 41</span>
<span class="normal"> 42</span>
<span class="normal"> 43</span>
<span class="normal"> 44</span>
<span class="normal"> 45</span>
<span class="normal"> 46</span>
<span class="normal"> 47</span>
<span class="normal"> 48</span>
<span class="normal"> 49</span>
<span class="normal"> 50</span>
<span class="normal"> 51</span>
<span class="normal"> 52</span>
<span class="normal"> 53</span>
<span class="normal"> 54</span>
<span class="normal"> 55</span>
<span class="normal"> 56</span>
<span class="normal"> 57</span>
<span class="normal"> 58</span>
<span class="normal"> 59</span>
<span class="normal"> 60</span>
<span class="normal"> 61</span>
<span class="normal"> 62</span>
<span class="normal"> 63</span>
<span class="normal"> 64</span>
<span class="normal"> 65</span>
<span class="normal"> 66</span>
<span class="normal"> 67</span>
<span class="normal"> 68</span>
<span class="normal"> 69</span>
<span class="normal"> 70</span>
<span class="normal"> 71</span>
<span class="normal"> 72</span>
<span class="normal"> 73</span>
<span class="normal"> 74</span>
<span class="normal"> 75</span>
<span class="normal"> 76</span>
<span class="normal"> 77</span>
<span class="normal"> 78</span>
<span class="normal"> 79</span>
<span class="normal"> 80</span>
<span class="normal"> 81</span>
<span class="normal"> 82</span>
<span class="normal"> 83</span>
<span class="normal"> 84</span>
<span class="normal"> 85</span>
<span class="normal"> 86</span>
<span class="normal"> 87</span>
<span class="normal"> 88</span>
<span class="normal"> 89</span>
<span class="normal"> 90</span>
<span class="normal"> 91</span>
<span class="normal"> 92</span>
<span class="normal"> 93</span>
<span class="normal"> 94</span>
<span class="normal"> 95</span>
<span class="normal"> 96</span>
<span class="normal"> 97</span>
<span class="normal"> 98</span>
<span class="normal"> 99</span>
<span class="normal">100</span>
<span class="normal">101</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="kn">import</span> <span class="nn">logging</span>
<span class="kn">import</span> <span class="nn">asyncio</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">urllib.parse</span> <span class="kn">import</span> <span class="n">urlparse</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span>
<span class="kn">from</span> <span class="nn">parsel</span> <span class="kn">import</span> <span class="n">Selector</span>
<span class="kn">from</span> <span class="nn">httpx</span> <span class="kn">import</span> <span class="n">AsyncClient</span>
<span class="k">class</span> <span class="nc">HMScraper</span><span class="p">:</span>
<span class="n">save_urls</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s2">"/productpage\."</span><span class="p">)</span> <span class="c1"># e.g ...com/en_us/productpage.09008.html</span>
<span class="n">follow_urls</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s2">"\.html"</span><span class="p">)</span>
<span class="n">follow_saved_urls</span> <span class="o">=</span> <span class="kc">False</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">5</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">limiter</span> <span class="o">=</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">Semaphore</span><span class="p">(</span><span class="n">limit</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">log</span> <span class="o">=</span> <span class="n">logging</span><span class="o">.</span><span class="n">getLogger</span><span class="p">(</span><span class="nb">type</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span><span class="o">.</span><span class="vm">__name__</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">seen_urls</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
<span class="k">async</span> <span class="k">def</span> <span class="fm">__aenter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""on scraper creation open http session"""</span>
<span class="bp">self</span><span class="o">.</span><span class="n">session</span> <span class="o">=</span> <span class="n">AsyncClient</span><span class="p">(</span>
<span class="c1"># we should use a browser-like user agent header to avoid being blocked</span>
<span class="n">headers</span><span class="o">=</span><span class="p">{</span>
<span class="s2">"User-Agent"</span><span class="p">:</span> <span class="s2">"Mozilla/5.0 (X11; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"</span>
<span class="p">}</span>
<span class="p">)</span>
<span class="k">return</span> <span class="bp">self</span>
<span class="k">async</span> <span class="k">def</span> <span class="fm">__aexit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">*</span><span class="n">args</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""on scraper destruction close http session"""</span>
<span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">session</span><span class="o">.</span><span class="n">aclose</span><span class="p">()</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">_request</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">url</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
<span class="k">async</span> <span class="k">with</span> <span class="bp">self</span><span class="o">.</span><span class="n">limiter</span><span class="p">:</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">resp</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">except</span> <span class="ne">Exception</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
<span class="k">return</span> <span class="n">e</span>
<span class="k">return</span> <span class="n">resp</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">save</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">url</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">scrape</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">urls</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]):</span>
<span class="w"> </span><span class="sd">"""Breadth first"""</span>
<span class="k">while</span> <span class="kc">True</span><span class="p">:</span>
<span class="n">to_follow</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
<span class="k">for</span> <span class="n">resp</span> <span class="ow">in</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">as_completed</span><span class="p">([</span><span class="bp">self</span><span class="o">.</span><span class="n">_request</span><span class="p">(</span><span class="n">url</span><span class="p">)</span> <span class="k">for</span> <span class="n">url</span> <span class="ow">in</span> <span class="n">urls</span><span class="p">]):</span>
<span class="n">resp</span> <span class="o">=</span> <span class="k">await</span> <span class="n">resp</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">resp</span><span class="p">,</span> <span class="ne">Exception</span><span class="p">):</span>
<span class="k">continue</span>
<span class="k">if</span> <span class="n">resp</span><span class="o">.</span><span class="n">status_code</span> <span class="o">!=</span> <span class="mi">200</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">for</span> <span class="n">url</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">find_links</span><span class="p">(</span><span class="n">resp</span><span class="p">):</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">save_urls</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="o">.</span><span class="n">follow_saved_urls</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">follow_urls</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">" following </span><span class="si">{</span><span class="n">url</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="n">to_follow</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">if</span> <span class="n">to_follow</span><span class="p">:</span>
<span class="n">urls</span> <span class="o">=</span> <span class="n">to_follow</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">return</span> <span class="c1"># end of the crawl</span>
<span class="k">def</span> <span class="nf">find_links</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">resp</span><span class="p">,</span> <span class="n">only_unique</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span>
<span class="w"> </span><span class="sd">"""</span>
<span class="sd"> find all relative page links in html link nodes</span>
<span class="sd"> """</span>
<span class="c1"># build a parsable tree from html body</span>
<span class="n">sel</span> <span class="o">=</span> <span class="n">Selector</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
<span class="n">current_url_parts</span> <span class="o">=</span> <span class="n">urlparse</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">url</span><span class="p">))</span>
<span class="c1"># find all <a> nodes and select their href attribute</span>
<span class="n">urls</span> <span class="o">=</span> <span class="n">sel</span><span class="o">.</span><span class="n">xpath</span><span class="p">(</span><span class="s2">"//a/@href"</span><span class="p">)</span><span class="o">.</span><span class="n">extract</span><span class="p">()</span>
<span class="k">for</span> <span class="n">url</span> <span class="ow">in</span> <span class="n">urls</span><span class="p">:</span>
<span class="c1"># convert relative url to absolute</span>
<span class="k">if</span> <span class="n">url</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s2">"/"</span><span class="p">):</span>
<span class="n">url</span> <span class="o">=</span> <span class="n">current_url_parts</span><span class="o">.</span><span class="n">_replace</span><span class="p">(</span><span class="n">path</span><span class="o">=</span><span class="n">url</span><span class="p">)</span><span class="o">.</span><span class="n">geturl</span><span class="p">()</span>
<span class="c1"># skip absolute urls that do not match current domain</span>
<span class="k">if</span> <span class="n">urlparse</span><span class="p">(</span><span class="n">url</span><span class="p">)</span><span class="o">.</span><span class="n">netloc</span> <span class="o">!=</span> <span class="n">current_url_parts</span><span class="o">.</span><span class="n">netloc</span><span class="p">:</span>
<span class="k">continue</span>
<span class="c1"># skip visited urls</span>
<span class="k">if</span> <span class="n">only_unique</span> <span class="ow">and</span> <span class="n">url</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">seen_urls</span><span class="p">:</span>
<span class="k">continue</span>
<span class="bp">self</span><span class="o">.</span><span class="n">seen_urls</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="k">yield</span> <span class="n">url</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">run</span><span class="p">():</span>
<span class="k">async</span> <span class="k">with</span> <span class="n">HMScraper</span><span class="p">()</span> <span class="k">as</span> <span class="n">scraper</span><span class="p">:</span>
<span class="n">start_urls</span> <span class="o">=</span> <span class="p">[</span>
<span class="c1"># homepage for US website</span>
<span class="s2">"https://www2.hm.com/en_us/index.html"</span>
<span class="p">]</span>
<span class="k">await</span> <span class="n">scraper</span><span class="o">.</span><span class="n">scrape</span><span class="p">(</span><span class="n">start_urls</span><span class="p">)</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
<span class="n">asyncio</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">run</span><span class="p">())</span>
</code></pre></div></td></tr></table></div>
<p>If we run our crawler we'll notice few things:
- At time of writing 13800~ results are being found which matches well with our other <a href="/tag/discovery-methods.html">#discovery-methods</a> used in this blog series.
- It took a while to complete this crawl: TODO second. Since we are crawling so many pages compared to other discovery methods we crawl </p>
<p>Finally, we can see that we can easily reuse most of this scraper for other websites, all we need to do is change our rules! That's the big selling point of crawlers, is that they're less domain specific than individual web scrapers.</p>
<h2 id="summary-and-further-reading_1">Summary and Further Reading</h2>
<p>To summarize, web crawling is a great discovery technique that lends easily to generic/broad scraper development because the same scrape loop can be applied to many targets just with some rule adjustments. However it's less efficient - slower and riskier when it comes to blocks - than other discovery techniques like <a href="/web-scraping-discover-search.html">Search Bar</a> or <a href="/web-scraping-discover-sitemaps.html">Sitemaps</a>. </p>
<hr/>
<p>For more web-scraping discovery techniques, see <a href="/tag/discovery-methods.html">#discovery-methods</a> and <a href="/tag/discovery.html">#discovery</a> for more discovery related subjects. </p>
<p>If you have any questions, come join us on <a href="https://matrix.to/#/%23web-scraping:matrix.org">#web-scraping on matrix</a>, check out <a href="https://stackoverflow.com/questions/tagged/web-scraping">#web-scraping on stackoverflow</a> or leave a comment below! </p>
<p>As always, you can hire me for web-scraping consultation over at <a href="/pages/hire.html">hire</a> page and happy scraping! </p>
<p class="info">The code used in this article can be found on <a href="https://github.com/Granitosaurus/scrapecrow/tree/main/examples">github</a>.<br/></p>Web Scraping Target Discovery: Indexes2021-09-28T00:00:00+02:002021-09-28T00:00:00+02:00Bernardas Ališauskastag:scrapecrow.com,2021-09-28:/web-scraping-discovery-indexes.html<p>The most common web scraping target discovery technique: recursive crawling. How does it work? What are the pros and cons and the most optimal execution patterns?</p><p><a href="/images/banner-books.jpg"><img class="fullc" loading="lazy" src="/images/banner-books.jpg" title=""/></a><figcaption></figcaption></p>
<p>Most web scrapers are made up of two core parts: finding products on the website and actually scraping them. The former is often referred to as "target discovery" step. For example to scrape product data of an e-commerce website we would need to find urls to each individual product and only then we can scrape their data.</p>
<p>Discovering targets to scrape in web scraping is often a challenging and important task. This series of blog posts tagged with <a href="/tag/discovery-methods.html">#discovery-methods</a> (also see <a href="/web-scraping-discovery.html">main article</a>) covers common target discovery approaches.</p>
<p>Using various public indexers is a often viable target discovery strategy. It is mostly used as a last resort or as a supplementary technique for difficult discovery-difficult targets. </p>
<p>Public indexers that crawl the web through more complex scraping rules might pickup hard to find targets and we can take advantage of that in our web scraper. In other words as the spirit of web-scraping goes: it's smart to take advantage of existing work! <br/>
In this article we'll take a look at few common public indexers and how can we use them to discover targets. </p>
<p>First lets overview common pros and cons of this discovery strategy:</p>
<p>Pros:</p>
<ul>
<li><strong>Easy</strong>: once understood taking advantage of public indexers is surprisingly easy.</li>
<li><strong>Efficient</strong>: public indexes function similar to in-website search bars or often come in easy to parse data dumps that we don't even need connection to discover targets.</li>
</ul>
<p>Cons:</p>
<ul>
<li><strong>Insufficient coverage and stale data</strong>: because these are indexes gathered whenever and by whatever there's very little coverage and link quality. For this reason it is best to combine index based discovery with some other discovery techniques.</li>
</ul>
<h2 id="using-search-engines">Using Search Engines</h2>
<p>Most common and rich public indexes are search engines like google, bing or duckduckgo - anything that lets humans search the web can be a useful tool for web scraping robots as well. </p>
<p>To see how we would use search engine in web-scraping discover lets take example of <a href="https://crunchbase.com">https://crunchbase.com</a>. Let's presume that we want to scrape their company overview data<br/>
(e.g. <a href="https://www.crunchbase.com/organization/linkedin">https://www.crunchbase.com/organization/linkedin</a>). </p>
<p>In this example we'll use <strong>bing.com</strong> to query for crunchbase.com urls. Bing is a great tool for web-scrapers as it's easy to scrape (unlike google which employs various anti-scraping strategies) and has relatively good quality results and coverage. </p>
<p>If we take a look at an average Crunchbase company page like <a href="https://www.crunchbase.com/organization/linkedin">https://www.crunchbase.com/organization/linkedin</a> we can determine the url pattern that all company pages follow looks something like: <code>.../organization/<name></code> </p>
<p>Knowing this we can write domain specific queries in bing.com search box to find other company pages:</p>
<p><a href="/images/bing.com-crunchbase-search.png"><img class="bigc" loading="lazy" src="/images/bing.com-crunchbase-search.png" title=""/></a><figcaption></figcaption></p>
<p>Here we used query <code>/organization/ site:crunchbase.com</code> and bing.com is giving us over a million of results which is pretty close to what the crunchbase is saying they have on their website! </p>
<p class="info">For advanced search keywords/options see <a href="https://help.bing.microsoft.com/#apex/bing/en-US/10002/-1">bing's advanced search options</a> and <a href="https://help.bing.microsoft.com/#apex/bing/en-US/10001/-1">bing's advanced search keywords</a></p>
<p>Other search engines like google, duckduckgo etc. also support similar search syntax. Search can be refined even further with more advanced search rules to find specific scraping targets.</p>
<p>All that being said using search engines to query is not without it's faults.<br/>
They are often built for humans rather than robots and have limited pagination (i.e. query will only have 10 pages of results even though it says millions are found) which requires splitting single query into many smaller ones, e.g. searching every letter of the alphabet or particular names. Despite all this discovery approach is surprisingly easy and can often work beautifully for small web scrapers!</p>
<h2 id="using-public-index-dumps">Using Public Index Dumps</h2>
<p>There are several public web indexes but probably the biggest and most well known one is <a href="https://commoncrawl.org/">https://commoncrawl.org/</a> which crawls the web and provides data dumps publicly for free. </p>
<blockquote>
<p>Common Crawl is a nonprofit 501 organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally every month.</p>
</blockquote>
<p>Unfortunately this being an open and free project the crawled htmls are somewhat stale, though as web scraper engineers we can instead use it as an index feed for our own web scrapers.</p>
<p>You can access common crawl's web index here: <a href="http://urlsearch.commoncrawl.org/">http://urlsearch.commoncrawl.org/</a>. The data is grouped by months and each month's dataset can be queried on the online playground:</p>
<p><a href="/images/commoncrawl-crunchbase-search.png"><img class="bigc" loading="lazy" src="/images/commoncrawl-crunchbase-search.png" title=""/></a><figcaption></figcaption></p>
<p>Crawl coverage by commoncrawl highly depends on popularity of the source. Some smaller websites are harder to find while bigger targets can have good data coverage.<br/>
Despite coverage issues commoncrawl url dataset is a useful tool to have in web-scraping toolbelt.</p>
<p class="info">If you find Commoncrawl useful it's a non-profit organization <a href="https://commoncrawl.org/donate/">accepting public donations</a>!</p>
<h2 id="using-internet-archive">Using Internet Archive</h2>
<p>Another public indexer is archive.org project which aims to archive various internet articles for historic prosperity purposes. </p>
<blockquote>
<p>Internet Archive is a non-profit library of millions of free books, movies, software, music, websites, and more.</p>
</blockquote>
<p>We can use archive.org website archive as our discovery engine. If we go to <a href="https://archive.org">https://archive.org</a> and type in our source:</p>
<p><a href="/images/archive.org-crunchbase-search.png"><img class="bigc" loading="lazy" src="/images/archive.org-crunchbase-search.png" title=""/></a><figcaption></figcaption></p>
<p>We can see that archive.org has captured a lot of urls! Let's take a quick look how we could scrape it as a discovery source. If we open up web inspector and see requests made when we click search we'll find a backend API url that looks something like this (<a href="https://web.archive.org/web/timemap/?url=crunchbase.com/organization/%2F&matchType=prefix&collapse=urlkey&output=json&fl=original%2Cendtimestamp%2Cgroupcount%2Cuniqcount&filter=!statuscode%3A[45]..&limit=10000&_=1632308317409">clickable</a>):</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span>
<span class="normal">6</span>
<span class="normal">7</span>
<span class="normal">8</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="nv">https</span>:<span class="o">//</span><span class="nv">web</span>.<span class="nv">archive</span>.<span class="nv">org</span><span class="o">/</span><span class="nv">web</span><span class="o">/</span><span class="nv">timemap</span><span class="o">/</span>?
<span class="nv">url</span><span class="o">=</span><span class="nv">crunchbase</span>.<span class="nv">com</span><span class="o">/</span><span class="nv">organization</span><span class="o">/%</span><span class="mi">2</span><span class="nv">F</span><span class="o">&</span><span class="nv">matchType</span><span class="o">=</span><span class="nv">prefix</span><span class="o">&</span>
<span class="nv">collapse</span><span class="o">=</span><span class="nv">urlkey</span><span class="o">&</span>
<span class="nv">output</span><span class="o">=</span><span class="nv">json</span><span class="o">&</span>
<span class="nv">fl</span><span class="o">=</span><span class="nv">original</span><span class="o">%</span><span class="mi">2</span><span class="nv">Cendtimestamp</span><span class="o">%</span><span class="mi">2</span><span class="nv">Cgroupcount</span><span class="o">%</span><span class="mi">2</span><span class="nv">Cuniqcount</span><span class="o">&</span><span class="w"> </span>#<span class="w"> </span><span class="nv">rows</span><span class="w"> </span><span class="nv">to</span><span class="w"> </span><span class="k">return</span>,<span class="w"> </span><span class="nv">we</span><span class="err">'re mostly interested in "original"</span>
<span class="err">filter=!statuscode%3A[45]..& </span>
<span class="err">limit=10000& # result limit</span>
<span class="err">_=1632308317409 # current timestamp</span>
</code></pre></div></td></tr></table></div>
<p>If we submit get request to this url we'll get the whole dataset of matching results:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span></pre></div></td><td class="code"><div><pre><span></span><code>[["original","endtimestamp","groupcount","uniqcount"],
["http://crunchbase.com:80/organization/","20200218003314","379","100"],
["https://www.crunchbase.com/organization.investors","20171204192453","2","2"],
["https://www.crunchbase.com/organization.similarwebs","20171204192356","1","1"],
["https://crunchbase.com/organization/%22/financial-organization/alsop-louie-partners%22","20190518071052","1","1"],
["https://crunchbase.com/organization/%22/financial-organization/draper-associates-2%22","20190518091831","1","1"],
["https://crunchbase.com/organization/%22/maps/city/San%2520Francisco%22","20190518044213","1","1"],
["https://crunchbase.com/organization/%22/organization/andreessen-horowitz%22","20190517093614","1","1"],
["https://crunchbase.com/organization/%22/organization/canaan-partners%22","20190517071111","1","1"],
["https://crunchbase.com/organization/%22/organization/fog-creek-software%22","20190518102954","1","1"],
...
</code></pre></div></td></tr></table></div>
<p>This query generates thousands of unique urls with timestamps of last time they were scraped by internet archive which can be easily adapted as a target discovery source!</p>
<p class="info">If you find Internet Archive useful it's a non-profit organization <a href="https://archive.org/donate/">accepting public donations</a>!</p>
<h2 id="summary-and-further-reading">Summary and Further Reading</h2>
<p>To summarize using public indexes can be a valid scrape target discovery technique however it comes with a certain level of target quality and coverage uncertainty. It's best used to supplement other discovery techniques or initial prototyping.</p>
<p>Here's a list of several search engines that can be used by web-scrapers for target discovery:</p>
<ul>
<li><a href="https://bing.com">https://bing.com</a> - great western web coverage, weak anti web-scraping measures.</li>
<li><a href="https://yahoo.com">https://yahoo.com</a> - uses bing.com database but different algorithms. </li>
<li><a href="https://www.onesearch.com/">https://www.onesearch.com/</a> - yahoo's new privacy search engine that is also using bing.com databases but even newer algorithms. Only accessible by US ips.</li>
<li><a href="https://duckduckgo.com">https://duckduckgo.com</a> - similar to bing.com but their own dataset. </li>
<li><a href="https://startpage.com">https://startpage.com</a> - uses google dataset but easier to access by web-scrapers; see <a href="https://github.com/Garee/sp">sp</a> project.</li>
<li><a href="https://yandex.com">https://yandex.com</a> - great russian web coverage and decent western web coverage.</li>
<li><a href="https://boardreader.com/">https://boardreader.com/</a> - brilliant forum/discussion board coverage.</li>
</ul>
<hr/>
<p>For more web-scraping discovery techniques, see <a href="/tag/discovery-methods.html">#discovery-methods</a> and <a href="/tag/discovery.html">#discovery</a> for more discovery related subjects. </p>
<p>If you have any questions, come join us on <a href="https://matrix.to/#/%23web-scraping:matrix.org">#web-scraping on matrix</a>, check out <a href="https://stackoverflow.com/questions/tagged/web-scraping">#web-scraping on stackoverflow</a> or leave a comment below! </p>
<p>As always, you can hire me for web-scraping consultation over at <a href="/pages/hire.html">hire</a> page and happy scraping! </p>Web Scraping Target Discovery: Search API2021-09-28T00:00:00+02:002021-09-28T00:00:00+02:00Bernardas Ališauskastag:scrapecrow.com,2021-09-28:/web-scraping-discovery-search.html<p>Fundamental web-scraping reverse-engineering technique is figuring out how website's search works. Replicating web search in web scraping is a great target discovery technique. Why, when and how should it be used effectively?</p><p><a href="/images/banner-gears.jpg"><img class="fullc" loading="lazy" src="/images/banner-gears.jpg" title=""/></a><figcaption></figcaption></p>
<p>Most web scrapers are made up of two core parts: finding products on the website and actually scraping them. The former is often referred to as "target discovery" step. For example to scrape product data of an e-commerce website we would need to find urls to each individual product and only then we can scrape their data.</p>
<p>Discovering targets to scrape in web scraping is often a challenging and important task. This series of blog posts tagged with <a href="/tag/discovery-methods.html">#discovery-methods</a> (also see <a href="/web-scraping-discovery.html">main article</a>) covers common target discovery approaches.</p>
<p>Reverse engineering website's backend API is a common web-scraping technique - why scrape htmls when backend data can be scraped directly? In this article, we'll briefly cover the most common web scraping reverse-engineering subject: the search API.</p>
<p class="related">For reverse-engineering introduction, see related article <a href="/reverse-engineering-intro.html">Introduction To Reverse Engineering The Web</a></p>
<h2 id="using-search-api-for-web-scraping">Using search API for web-scraping</h2>
<p>One way to discovery targets in web scraping is to reverse-engineer the search bar for website's search API. It's often one of the best ways to discover targets - let's overview common pros and cons of this approach:</p>
<p>Pros:</p>
<ul>
<li><strong>Fresh Targets</strong>: search API rarely yields links to outdated targets, as it's exactly what website users see. </li>
<li><strong>Good Coverage</strong>: search API can lead to all the results a website has to offer - if it's not searchable, it's probably not there! </li>
<li><strong>Efficient</strong>: search API result pagination can yield 10-50 results per page and often can be scraped asynchronously. </li>
</ul>
<p>Cons:</p>
<ul>
<li><strong>Domain bound</strong>: since every website has their own search structure, the code can rarely be applied to many targets.</li>
<li><strong>Limited Coverage</strong>: some search return limited amount of pages (e.g. there are 900 results but after 10 pages the API does not provide any results) meaning the scraper has to figure out how to get around this limit which can be difficult to implement.</li>
<li><strong>Slow</strong>: Rarely, but some search result pagination cannot be iterated asynchronously. Pages need to be requested one after another, which slows down the scraping process.</li>
</ul>
<p>As you can see, pros and cons are very mixed and even contradicting - it really depends on website's search implementation. Let's cover a few examples and see what search API discovery is all about.</p>
<h2 id="example-hmcom">Example: hm.com</h2>
<p>To understand basic search bar reverse-engineering, lets see how a popular clothing website <a href="https://hm.com">https://hm.com</a> handles its search. </p>
<h3 id="reversing-search-bar">Reversing Search Bar</h3>
<p>If we go to the website, open up our web inspector tools and search something, we can see the search requests being made by the browser:</p>
<p><a href="/images/hm.com-initial-req.png"><img class="bigc" loading="lazy" src="/images/hm.com-initial-req.png" title=""/></a><figcaption></figcaption></p>
<p>However, this returns us filtered results when we want to discover <em>all</em> products on the website. <br/>
For this, we can trick the search API to search empty queries by either searching for an empty string or a space character. In this case no results are returned for an empty string <code>""</code> but we can force this search by using an url encoded (also called "percent encoded") character for space in the url bar: <code>%20</code></p>
<p class="info">For more on percent encoding see <a href="https://developer.mozilla.org/en-US/docs/Glossary/percent-encoding">MDN's documentation</a>.</p>
<p><a href="/images/hm.com-space-search.png"><img class="bigc" loading="lazy" src="/images/hm.com-space-search.png" title=""/></a><figcaption></figcaption></p>
<p>Success! We got 13780 product results! <br/>
Now, let's figure out how the search works. If you look at the inspector, no data requests are made, because first page data is embedded into HTML as a javascript variable - this is a common website optimization that we can ignore.</p>
<p>We could scrape HTMLs but we often we don't have to. Modern websites tend to communicate with the backend API in JSON, so let's try to find that. <br/>
If we scroll to the bottom of the page and click next page, we can see the actual JSON data request being made for the second page:</p>
<p><a href="/images/hm.com-xhr-req.png"><img class="bigc" loading="lazy" src="/images/hm.com-xhr-req.png" title=""/></a><figcaption></figcaption></p>
<p>We see a request is being made to backend's search API and it returns us a JSON set of data with product metadata and location. Let's take a look at the request url, so we can replicate it in our scrapper:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span></pre></div></td><td class="code"><div><pre><span></span><code>https://www2.hm.com/en_us/search-results/_jcr_content/search.display.json?
q=%20& # search term
offset=0& # pagination offset
page-size=40& # pagination limit
sort=ascPrice # sort type
</code></pre></div></td></tr></table></div>
<p class="info">Many modern web APIs are very flexible with parameters - we don't have to use all the junk we see in our web inspector. You can always experiment and see which are necessary and how the content changes when parameters do.<br/>
In this example we stripped off a lot of uninteresting parameters and just kept query, offset/limit and sort</p>
<p>This seems like a common offset/limit pagination technique. Which is great for web-scrapers as we can get multiple pages asynchronously - in other words we can request slices 0:100, 100:200, ... concurrently.</p>
<h3 id="confirming-pagination">Confirming Pagination</h3>
<p>Before we can commit to using this API endpoint we should test it for common coverage pitfalls, for example for page limits. Often, search APIs limit the amount of rows/pages a query can request. If we just click the link in the browser:</p>
<p><a href="https://www2.hm.com/en_us/search-results/_jcr_content/search.display.json?q=%20&offset=0&page-size=40&sort=ascPrice">https://www2.hm.com/en_us/search-results/_jcr_content/search.display.json?q=%20&offset=0&page-size=40&sort=ascPrice</a></p>
<p>We can see JSON response and total of results count of <code>13_730</code>.<br/>
Let's see if we can get last page, which at the time of this article would be: <code>offset=13690&page-size=40</code>:</p>
<p><a href="https://www2.hm.com/en_us/search-results/_jcr_content/search.display.json?q=%20&offset=13690&page-size=40&sort=ascPrice">https://www2.hm.com/en_us/search-results/_jcr_content/search.display.json?q=%20&offset=13690&page-size=40&sort=ascPrice</a></p>
<p>Unfortunately while requests is successful it contains no product data, indicated as empty array: <code>"products": []</code> </p>
<p>It's what we feared and this pagination has a page limit. By messing around with the parameter we can find where the pagination ends exactly and thats at <code>10_000</code> results, which is not an uncommon round number. </p>
<p>Let's see a few common ways we could get around this pagination limit:</p>
<ol>
<li>Use multiple search queries - common brute force technique is searching many different queries like: <code>a</code>, <code>b</code>, <code>c</code>... and hope all the products are found.</li>
<li>Apply more filters - this query allows optional filter such as categories. We can collect all categories, e.g. <code>shoes</code>, <code>dresses</code> etc. and have query for every one of them. </li>
<li>We can reverse sorting - if one query can give us 10_000 results, by reversing sorting we can have 2 queries with 10_000 results each! That's an easy way to double our reach. </li>
</ol>
<p>For this specific case seems like approach #3 Reversing Sorting is the best approach! As the website only has a bit over 13_000 results and our reach would be 20_000 - this would be a perfect solution. <br/>
We can sort our query by price and reach for results from both ends of the query:</p>
<p><a href="/images/hm.com-double-end-query.png"><img class="bigc" loading="lazy" src="/images/hm.com-double-end-query.png" title=""/></a><figcaption></figcaption></p>
<p>So our first query would get us the first 10_000 cheapest items and the second query would pick up first 3_700 most expensive items. With these two queries, we can fully discover all available products.</p>
<h3 id="implementation">Implementation</h3>
<p>Having reverse engineering how search API of hm.com works, we can develop our scraping algorithm:</p>
<ol>
<li>Get first page to get total result count.</li>
<li>Schedule request for first <code>10_000</code> results sorted by <code>ascPrice</code>.</li>
<li>Schedule remaining <code>total - 10_000</code> requests sorted by <code>descPrice</code>.</li>
<li>Collect responses and parse product data.</li>
</ol>
<p>Here's quick implementation using Python with asynchronous http client package <code>httpx</code>:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span>
<span class="normal">15</span>
<span class="normal">16</span>
<span class="normal">17</span>
<span class="normal">18</span>
<span class="normal">19</span>
<span class="normal">20</span>
<span class="normal">21</span>
<span class="normal">22</span>
<span class="normal">23</span>
<span class="normal">24</span>
<span class="normal">25</span>
<span class="normal">26</span>
<span class="normal">27</span>
<span class="normal">28</span>
<span class="normal">29</span>
<span class="normal">30</span>
<span class="normal">31</span>
<span class="normal">32</span>
<span class="normal">33</span>
<span class="normal">34</span>
<span class="normal">35</span>
<span class="normal">36</span>
<span class="normal">37</span>
<span class="normal">38</span>
<span class="normal">39</span>
<span class="normal">40</span>
<span class="normal">41</span>
<span class="normal">42</span>
<span class="normal">43</span>
<span class="normal">44</span>
<span class="normal">45</span>
<span class="normal">46</span>
<span class="normal">47</span>
<span class="normal">48</span>
<span class="normal">49</span>
<span class="normal">50</span>
<span class="normal">51</span>
<span class="normal">52</span>
<span class="normal">53</span>
<span class="normal">54</span>
<span class="normal">55</span>
<span class="normal">56</span>
<span class="normal">57</span>
<span class="normal">58</span>
<span class="normal">59</span>
<span class="normal">60</span>
<span class="normal">61</span>
<span class="normal">62</span>
<span class="normal">63</span>
<span class="normal">64</span>
<span class="normal">65</span>
<span class="normal">66</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Dict</span>
<span class="kn">import</span> <span class="nn">httpx</span>
<span class="kn">import</span> <span class="nn">asyncio</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">scrape_page</span><span class="p">(</span>
<span class="n">session</span><span class="p">:</span> <span class="n">httpx</span><span class="o">.</span><span class="n">AsyncClient</span><span class="p">,</span>
<span class="n">offset</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
<span class="n">page_size</span><span class="o">=</span><span class="mi">500</span><span class="p">,</span> <span class="c1"># note: we can increase this from default 40 to something higher!</span>
<span class="n">sort</span><span class="o">=</span><span class="s2">"ascPrice"</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-></span> <span class="n">Dict</span><span class="p">:</span>
<span class="w"> </span><span class="sd">"""Scrape a single hm.com product query page"""</span>
<span class="n">url</span> <span class="o">=</span> <span class="s2">"https://www2.hm.com/en_us/search-results/_jcr_content/search.display.json"</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"scraping range </span><span class="si">{</span><span class="n">offset</span><span class="si">}</span><span class="s2">:</span><span class="si">{</span><span class="n">offset</span><span class="o">+</span><span class="n">page_size</span><span class="si">}</span><span class="s2"> sorted by: </span><span class="si">{</span><span class="n">sort</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="n">response</span> <span class="o">=</span> <span class="k">await</span> <span class="n">session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span>
<span class="n">url</span><span class="o">=</span><span class="n">url</span><span class="p">,</span>
<span class="n">timeout</span><span class="o">=</span><span class="n">httpx</span><span class="o">.</span><span class="n">Timeout</span><span class="p">(</span><span class="mi">120</span><span class="p">),</span>
<span class="n">params</span><span class="o">=</span><span class="p">{</span>
<span class="s2">"q"</span><span class="p">:</span> <span class="s2">" "</span><span class="p">,</span> <span class="c1"># note: http client will automatically turn this to "%20"</span>
<span class="s2">"offset"</span><span class="p">:</span> <span class="n">offset</span><span class="p">,</span>
<span class="s2">"page-size"</span><span class="p">:</span> <span class="n">page_size</span><span class="p">,</span>
<span class="s2">"sort"</span><span class="p">:</span> <span class="n">sort</span><span class="p">,</span>
<span class="p">},</span>
<span class="p">)</span>
<span class="k">return</span> <span class="n">response</span><span class="o">.</span><span class="n">json</span><span class="p">()</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">scrape_hmcom</span><span class="p">():</span>
<span class="c1"># we need to fake any browser User-Agent to get around primitive bot detection</span>
<span class="n">headers</span> <span class="o">=</span> <span class="p">{</span>
<span class="s2">"User-Agent"</span><span class="p">:</span> <span class="s2">"Mozilla/5.0 (X11; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"</span>
<span class="p">}</span>
<span class="n">products</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">async</span> <span class="k">with</span> <span class="n">httpx</span><span class="o">.</span><span class="n">AsyncClient</span><span class="p">(</span><span class="n">headers</span><span class="o">=</span><span class="n">headers</span><span class="p">)</span> <span class="k">as</span> <span class="n">session</span><span class="p">:</span>
<span class="c1"># lets start by scraping first page</span>
<span class="n">first_page</span> <span class="o">=</span> <span class="k">await</span> <span class="n">scrape_page</span><span class="p">(</span><span class="n">session</span><span class="p">)</span>
<span class="n">products</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">first_page</span><span class="p">[</span><span class="s2">"products"</span><span class="p">])</span>
<span class="c1"># First page contains total amount of results this query contains</span>
<span class="c1"># using this we can create task for each bach of query and</span>
<span class="c1"># execute it concurrently</span>
<span class="n">tasks</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">offset</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span>
<span class="n">first_page</span><span class="p">[</span><span class="s2">"itemsShown"</span><span class="p">],</span> <span class="n">first_page</span><span class="p">[</span><span class="s2">"total"</span><span class="p">],</span> <span class="n">first_page</span><span class="p">[</span><span class="s2">"itemsShown"</span><span class="p">]</span>
<span class="p">):</span>
<span class="c1"># for first 10_000 scrape as usual</span>
<span class="k">if</span> <span class="n">offset</span> <span class="o"><</span> <span class="mi">10_000</span><span class="p">:</span>
<span class="n">sorting</span> <span class="o">=</span> <span class="s2">"ascPrice"</span>
<span class="c1"># for query > 10_000 start over with reversed ordering</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">sorting</span> <span class="o">=</span> <span class="s2">"descPrice"</span>
<span class="n">offset</span> <span class="o">-=</span> <span class="mi">10_000</span>
<span class="n">tasks</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">scrape_page</span><span class="p">(</span><span class="n">session</span><span class="p">,</span> <span class="n">offset</span><span class="o">=</span><span class="n">offset</span><span class="p">,</span> <span class="n">sort</span><span class="o">=</span><span class="n">sorting</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"Scheduling </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">tasks</span><span class="p">)</span><span class="si">}</span><span class="s2"> scrape tasks concurrently"</span><span class="p">)</span>
<span class="c1"># with our scraping tasks in order it's time</span>
<span class="c1"># to execute them concurrently using asyncio.as_completed wrapper</span>
<span class="k">for</span> <span class="n">scrape_task</span> <span class="ow">in</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">as_completed</span><span class="p">(</span><span class="n">tasks</span><span class="p">):</span>
<span class="n">result</span> <span class="o">=</span> <span class="k">await</span> <span class="n">scrape_task</span>
<span class="n">products</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="n">result</span><span class="p">[</span><span class="s2">"products"</span><span class="p">])</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"found </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">products</span><span class="p">)</span><span class="si">}</span><span class="s2"> products in </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">tasks</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="si">}</span><span class="s2"> page requests"</span><span class="p">)</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">"__main__"</span><span class="p">:</span>
<span class="n">asyncio</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">scrape_hmcom</span><span class="p">())</span>
</code></pre></div></td></tr></table></div>
<p>Here we used asynchronous python and <code>httpx</code> as our http client library to scrape all 13790 products with very few requests just in few minutes!</p>
<h2 id="summary-and-further-reading_1">Summary and Further Reading</h2>
<p>To summarize, reverse engineering website's search API is a brilliant scrape target discovery technique, however it's more difficult to develop as it requires reverse-engineer effort and all of the code becomes very domain specific. </p>
<hr/>
<p>For more web-scraping discovery techniques, see <a href="/tag/discovery-methods.html">#discovery-methods</a> and <a href="/tag/discovery.html">#discovery</a> for more discovery related subjects. </p>
<p>If you have any questions, come join us on <a href="https://matrix.to/#/%23web-scraping:matrix.org">#web-scraping on matrix</a>, check out <a href="https://stackoverflow.com/questions/tagged/web-scraping">#web-scraping on stackoverflow</a> or leave a comment below! </p>
<p>As always, you can hire me for web-scraping consultation over at <a href="/pages/hire.html">hire</a> page and happy scraping! </p>
<hr/>
<figcaption>image credits: "Gears" by Kamal H. is licensed under CC BY-ND 2.0</figcaption>Web Scraping Target Discovery: Sitemaps2021-09-28T00:00:00+02:002021-09-28T00:00:00+02:00Bernardas Ališauskastag:scrapecrow.com,2021-09-28:/web-scraping-discovery-sitemaps.html<p>There are many techniques when it comes to discovery web-scraping targets. One of the most common ones is to use website sitemap indexes. What are they and to take advantage of them in web-scraping?</p><p><a href="/images/banner-map.jpg"><img class="fullc" loading="lazy" src="/images/banner-map.jpg" title=""/></a><figcaption></figcaption></p>
<p>Most web scrapers are made up of two core parts: finding products on the website and actually scraping them. The former is often referred to as "target discovery" step. For example to scrape product data of an e-commerce website we would need to find urls to each individual product and only then we can scrape their data.</p>
<p>Discovering targets to scrape in web scraping is often a challenging and important task. This series of blog posts tagged with <a href="/tag/discovery-methods.html">#discovery-methods</a> (also see <a href="/web-scraping-discovery.html">main article</a>) covers common target discovery approaches.</p>
<p>In this article will cover one particular discovery method of using website sitemaps to find our scrape targets.</p>
<h2 id="what-are-sitemaps-and-how-are-they-used-in-web-scraping">What are sitemaps and how are they used in web-scraping?</h2>
<p>Sitemap is an index document generated by websites for web crawlers and indexers. For example websites that want to be crawled by google provide an index of their products so Google's crawlers can index it quicker. </p>
<p>To put it shortly sitemap files are always of xml type (often gzip compressed) documents that contain URL locations and some meta information about them:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span>
<span class="normal">6</span>
<span class="normal">7</span>
<span class="normal">8</span>
<span class="normal">9</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="cp"><?xml version="1.0" encoding="UTF-8"?></span>
<span class="nt"><urlset</span><span class="w"> </span><span class="na">xmlns=</span><span class="s">"http://www.sitemaps.org/schemas/sitemap/0.9"</span><span class="nt">></span>
<span class="w"> </span><span class="nt"><url></span>
<span class="w"> </span><span class="nt"><loc></span>http://www.example.com/<span class="nt"></loc></span>
<span class="w"> </span><span class="nt"><lastmod></span>2005-01-01<span class="nt"></lastmod></span>
<span class="w"> </span><span class="nt"><changefreq></span>monthly<span class="nt"></changefreq></span>
<span class="w"> </span><span class="nt"><priority></span>0.8<span class="nt"></priority></span>
<span class="w"> </span><span class="nt"></url></span>
<span class="nt"></urlset></span><span class="w"> </span>
</code></pre></div></td></tr></table></div>
<p class="info">for more on sitemap structure rules, see <a href="https://www.sitemaps.org/protocol.html">official specification page</a></p>
<p>The documents themselves are usually categorized by names, so for example: </p>
<ul>
<li>blog post of the website would be contained in <code>sitemap_blogs.xml</code>. </li>
<li>Sold products might be separated in multiple files of <code>sitemap_products_1.xml</code>, <code>sitemap_products_2.xml</code> etc </li>
</ul>
<p>Before using sitemaps a web scraping discovery strategy, it's a good practice to reflect on common pros and cons of this technique and see whether that would fit your web-scraping project:</p>
<p>Pros: </p>
<ul>
<li><strong>Efficiency</strong>: Single sitemap can contain thousands of items and often entire catalog can be discovered in just few requests! </li>
<li><strong>Simplicity</strong>: There's no need for advanced reverse engineering knowledge to use sitemap based discovery. </li>
</ul>
<p>Cons: </p>
<ul>
<li><strong>Data Staleness</strong>: Sitemap indexes need to be generated by the website explicitly and sometimes newer product might not appear on the index for significant amount of time. </li>
<li><strong>Data Validity</strong>: As per previous point because of sitemap staleness some product links might be expired or invalidated. This might cause unnecessary load on your scraper.</li>
<li><strong>Data Completeness</strong>: Since sitemaps are generated for crawlers and indexers they might not have all data that is available on the website. For this reason it is important to confirm sitemap coverage during the development of a scraper. </li>
<li><strong>Availability</strong>: Sitemaps is/used to be an important part of the web, particularly used in SEO however they are not always present in modern websites that either try to avoid web-scraping or use hard-to-index website structures or are just too big for such indexes. </li>
<li><strong>Risk</strong>: Some website use sitemaps as honeypots for web-scrapers and direct to invalid data or use it to identify and ban scrapers.</li>
</ul>
<p>As you can see, Sitemaps discovery approach appears to be simple and efficient, though not always viable. Generally when developing discovery strategy, sitemaps is the first place I look for product data, then confirm quality by trying alternative discovery approaches and seeing if coverage matches. </p>
<h2 id="finding-sitemaps">Finding Sitemaps</h2>
<p>To take advantage of sitemaps, we first need to figure how to find them. Common way to find sitemaps is checking <code>robots.txt</code> or <code>sitemaps.xml</code> file.<br/>
For example, let's take popular clothing shop <code>hm.com</code>:</p>
<p>First we would go to <code>/robots.txt</code> page: <a href="https://hm.com/robots.txt">https://hm.com/robots.txt</a>:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
<span class="normal">2</span>
<span class="normal">3</span>
<span class="normal">4</span>
<span class="normal">5</span>
<span class="normal">6</span>
<span class="normal">7</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="n">User-Agent: *</span>
<span class="n">Request-rate: 2/1 0000-0200</span>
<span class="n">Request-rate: 1/2 0200-0900</span>
<span class="n">Disallow: /alive/user</span>
<span class="n">Disallow: /m/</span>
<span class="n">...</span>
<span class="n">Sitemap: http://www2.hm.com/sitemapindex.xml</span>
</code></pre></div></td></tr></table></div>
<p>We see some robot scraping rules and a link to the sitemap index! If we proceed and take a look at the sitemap index <a href="http://www2.hm.com/sitemapindex.xml">http://www2.hm.com/sitemapindex.xml</a> we can see:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span></pre></div></td><td class="code"><div><pre><span></span><code>...
<span class="nt"><sitemap></span>
<span class="w"> </span><span class="nt"><loc></span>https://www2.hm.com/en_in.sitemap.xml<span class="nt"></loc></span>
<span class="w"> </span><span class="nt"><lastmod></span>2021-09-29<span class="nt"></lastmod></span>
<span class="nt"></sitemap></span>
<span class="nt"><sitemap></span>
<span class="w"> </span><span class="nt"><loc></span>https://www2.hm.com/en_us.sitemap.xml<span class="nt"></loc></span>
<span class="w"> </span><span class="nt"><lastmod></span>2021-09-29<span class="nt"></lastmod></span>
<span class="nt"></sitemap></span>
<span class="nt"><sitemap></span>
<span class="w"> </span><span class="nt"><loc></span>https://www2.hm.com/de_de.sitemap.xml<span class="nt"></loc></span>
<span class="w"> </span><span class="nt"><lastmod></span>2021-09-29<span class="nt"></lastmod></span>
<span class="nt"></sitemap></span>
...
</code></pre></div></td></tr></table></div>
<p>The index is split into localized parts, let's continue to <code>en_us</code> index (or whichever you prefer, they should function the same): <a href="https://www2.hm.com/en_us.sitemap.xml">https://www2.hm.com/en_us.sitemap.xml</a></p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span>
<span class="normal">15</span>
<span class="normal">16</span>
<span class="normal">17</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="nt"><sitemapindex</span><span class="w"> </span><span class="na">xmlns:xsi=</span><span class="s">"https://www.w3.org/2001/XMLSchema-instance"</span><span class="w"> </span><span class="na">xmlns=</span><span class="s">"https://www.sitemaps.org/schemas/sitemap/0.9"</span><span class="w"> </span><span class="na">xsi:schemaLocation=</span><span class="s">"https://www.sitemaps.org/schemas/sitemap/0.9 https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd"</span><span class="nt">></span>
<span class="nt"><sitemap></span>
<span class="w"> </span><span class="nt"><loc></span>https://www2.hm.com/en_us.pages.0.xml<span class="nt"></loc></span>
<span class="w"> </span><span class="nt"><lastmod></span>2021-09-29<span class="nt"></lastmod></span>
<span class="nt"></sitemap></span>
<span class="nt"><sitemap></span>
<span class="w"> </span><span class="nt"><loc></span>https://www2.hm.com/en_us.store.0.xml<span class="nt"></loc></span>
<span class="w"> </span><span class="nt"><lastmod></span>2021-09-29<span class="nt"></lastmod></span>
<span class="nt"></sitemap></span>
<span class="nt"><sitemap></span>
<span class="w"> </span><span class="nt"><loc></span>https://www2.hm.com/en_us.product.0.xml<span class="nt"></loc></span>
<span class="w"> </span><span class="nt"><lastmod></span>2021-09-29<span class="nt"></lastmod></span>
<span class="nt"></sitemap></span>
<span class="nt"><sitemap></span>
<span class="w"> </span><span class="nt"><loc></span>https://www2.hm.com/en_us.product.1.xml<span class="nt"></loc></span>
<span class="w"> </span><span class="nt"><lastmod></span>2021-09-29<span class="nt"></lastmod></span>
<span class="nt"></sitemap></span>
</code></pre></div></td></tr></table></div>
<p>This is index of sitemap indexes. We see there are indexes for articles, pages and categories etc. - but most importantly product index: <code>....products.N.xml</code>. <br/>
To add there's some important metadata as well: when indexes were last updated: <code><lastmod></code>. In this case the index is 1 day old so this discovery approach will not pick up any products that have been added in the last few hours.</p>
<p class="info">Every website engine generates sitemaps at different times: some generate once a day/week often indicated by <code><changefreq>always|hourly|daily|...</changefreq></code> attribute. Though modern, smaller websites usually generate it on demand when product index is updated which is great for web-scrapers!</p>
<h2 id="example-use-case-hmcom">Example Use Case: HM.com</h2>
<p>Lets write a simple sitemap scraper that will find all product urls on previously mentioned website <a href="https://hm.com">https://hm.com</a>. For this we'll be using python with <code>requests</code> and <code>parsel</code> libraries:</p>
<div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
<span class="normal"> 2</span>
<span class="normal"> 3</span>
<span class="normal"> 4</span>
<span class="normal"> 5</span>
<span class="normal"> 6</span>
<span class="normal"> 7</span>
<span class="normal"> 8</span>
<span class="normal"> 9</span>
<span class="normal">10</span>
<span class="normal">11</span>
<span class="normal">12</span>
<span class="normal">13</span>
<span class="normal">14</span>
<span class="normal">15</span>
<span class="normal">16</span>
<span class="normal">17</span>
<span class="normal">18</span>
<span class="normal">19</span>
<span class="normal">20</span>
<span class="normal">21</span>
<span class="normal">22</span>
<span class="normal">23</span>
<span class="normal">24</span>
<span class="normal">25</span>
<span class="normal">26</span>
<span class="normal">27</span>
<span class="normal">28</span>
<span class="normal">29</span>
<span class="normal">30</span>
<span class="normal">31</span>
<span class="normal">32</span>
<span class="normal">33</span>
<span class="normal">34</span>
<span class="normal">35</span>
<span class="normal">36</span>
<span class="normal">37</span>
<span class="normal">38</span>
<span class="normal">39</span>
<span class="normal">40</span></pre></div></td><td class="code"><div><pre><span></span><code><span class="c1"># requires:</span>
<span class="c1"># pip install requests parsel</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">from</span> <span class="nn">parsel</span> <span class="kn">import</span> <span class="n">Selector</span>
<span class="k">def</span> <span class="nf">parse_sitemap</span><span class="p">(</span><span class="n">url</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-></span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
<span class="w"> </span><span class="sd">"""scrape sitemap and item urls from a sitemap link"""</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">"scraping: </span><span class="si">{</span><span class="n">url</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
<span class="n">resp</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span>
<span class="n">url</span><span class="p">,</span>
<span class="n">headers</span><span class="o">=</span><span class="p">{</span>
<span class="c1"># we need to fake browser user-string to get through CDN bot protection</span>
<span class="s2">"User-Agent"</span><span class="p">:</span> <span class="s2">"Mozilla/5.0 (X11; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"</span>
<span class="p">},</span>
<span class="p">)</span>
<span class="c1"># turn html text to a parsable tree object</span>
<span class="n">doc_tree</span> <span class="o">=</span> <span class="n">Selector</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">text</span><span class="p">)</span>
<span class="c1"># find all <loc> nodes and take their text (which is an url)</span>
<span class="n">urls</span> <span class="o">=</span> <span class="n">doc_tree</span><span class="o">.</span><span class="n">xpath</span><span class="p">(</span><span class="s2">"//loc/text()"</span><span class="p">)</span><span class="o">.</span><span class="n">getall</span><span class="p">()</span>
<span class="k">return</span> <span class="n">urls</span>
<span class="n">product_urls</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
<span class="n">sitemap_directory</span> <span class="o">=</span> <span class="s2">"https://www2.hm.com/en_us.sitemap.xml"</span>
<span class="k">for</span> <span class="n">url</span> <span class="ow">in</span> <span class="n">parse_sitemap</span><span class="p">(</span><span class="n">sitemap_directory</span><span class="p">):</span>
<span class="k">if</span> <span class="s2">".product."</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">url</span><span class="p">:</span>
<span class="k">continue</span>
<span class="k">for</span> <span class="n">url</span> <span class="ow">in</span> <span class="n">parse_sitemap</span><span class="p">(</span><span class="n">url</span><span class="p">):</span>
<span class="c1"># product urls match pattern com/<some product naming>.html</span>
<span class="c1"># skip non-product urls</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="sa">r</span><span class="s2">"hm.com/.+?\.html"</span><span class="p">,</span> <span class="n">url</span><span class="p">):</span>
<span class="k">continue</span>
<span class="n">product_urls</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s2">"</span><span class="se">\n</span><span class="s2">"</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">product_urls</span><span class="p">)[</span><span class="o">-</span><span class="mi">100</span><span class="p">:])</span>
<span class="nb">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">product_urls</span><span class="p">))</span>
</code></pre></div></td></tr></table></div>
<p>If we run this small scraper script we'll see that this sitemap discovery approach will yield us 13639 unique results (at the time of writing)! Even if we scrape synchronously sitemap approach is a really efficient way to discover large amount of products.</p>
<h2 id="confirming-results">Confirming Results</h2>
<p>Finally we should confirm whether this discovery approach has good coverage by comparing it with other discovery approaches. For that we either need to find all product count number somewhere (some websites mention "total N results available" somewhere in their content) or use another discovery strategy to evaluate our coverage. For this particular website we can take a look at <a href="/web-scraping-discovery-search.html">search bar discovery approach</a> covered in other Scrapecrow article:</p>
<p><a href="/images/hm.com-space-search.png"><img class="bigc" loading="lazy" src="/images/hm.com-space-search.png" title="13780 results found by searching for url quote space character <code>%20</code>"/></a><figcaption>13780 results found by searching for url quote space character <code>%20</code></figcaption></p>
<p>Using empty search approach described in <a href="/web-scraping-discovery-search.html">search discovery article</a> we can see that our sitemap discovery coverage showing almost the same amount of results:</p>
<ul>
<li>Sitemaps: 13639</li>
<li>Searchbar: 13780</li>
</ul>
<p>These 141 results we're missing are probably indication that sitemap index is running slightly behind the product database. This is a good illustration of different discovery techniques and their importance. For important scrapers it's a good idea to diversify.</p>
<h2 id="summary-and-further-reading">Summary and Further Reading</h2>
<p>To summarize using sitemaps in web scraping is an efficient, effective and quick product discovery technique with only real down-sides being data staleness, coverage and availability.</p>
<hr/>
<p>For more web-scraping discovery techniques, see <a href="/tag/discovery-methods.html">#discovery-methods</a> and <a href="/tag/discovery.html">#discovery</a> for more discovery related subjects. </p>
<p>If you have any questions, come join us on <a href="https://matrix.to/#/%23web-scraping:matrix.org">#web-scraping on matrix</a>, check out <a href="https://stackoverflow.com/questions/tagged/web-scraping">#web-scraping on stackoverflow</a> or leave a comment below! </p>
<p>As always, you can hire me for web-scraping consultation over at <a href="/pages/hire.html">hire</a> page and happy scraping! </p>
<hr/>
<figcaption>image credits: "Map of North America" by NASA Johnson is licensed under CC BY-NC 2.0</figcaption>