At the beginning of 2022 I've joined Scrapfly.io - a small web scraping API company. One of my main roles is to keep our blog running with various tutorials and web scraping highlights.
So, to wrap up this year, here's a list of my favorite articles I've worked on in 2022 and what I've learned from working on them!
Web scraping is a very available niche. You can write a scraper in almost any programming language as long as it has a viable HTTP client (which is basically any language).
At ScrapFly, we have a diverse client base using all sorts of programming languages to scrape the web using our API. So, I wrote multiple in-depth and up-to-date introductions to web scraping in several languages which was quite a journey on its own:
- Python is the most popular language in this medium for a reason - it's a great data language with strong async support and some of the best community libraries out there.
- R language is surprisingly effective. The pipe operator fits into web scraping flows perfectly and is really fun to use.
- PHP has a lot of great web tools that fit into web scraping though generally it's not a very good data language.
- Ruby feels a lot like Python just with a smaller community and fewer tools.
This was a great exercise to understand what makes a good web scraping environment. Working on these articles provided really valuable horizontal growth as it presented the problem of web scraping from different perspectives. Every language seems to have something unique and special about it and having a general overview did help me to improve my development techniques and workflows.
Another way we tried to reach out to scraper developers is by writing hands-on scrape guides for popular scraping targets like:
And dozens of others.
For these, I had to choose a standard set of tools and libraries which was a difficult challenge. Using browser automation tools like Selenium, Playwright or Puppeteer can often be easier but they don't fit the tutorial format well as browsers are complicated and work differently on different user machines.
With this setup, there's more focus on reverse engineering and web technologies rather than automation but it's easier to figure out how to explain technical concepts than to deal with replication issues of web browsers.
As for the library stack:
httpxis a brilliant http client which supports http2 and respects RFC standards. It's probably the best HTTP client out there outside of cURL.
parselsupports both CSS and XPath selectors for HTML parsing which is by far the best way to parse HTML.
jmespathis the most accessible way to parse JSON datasets. Usually, I'd favor
jmespathas it allows non-root based selections (e.g. find me any node named
fooanywhere) but jsonpath implementations aren't very great.
Finally, Python itself comes with loads of built-in tools like
urllib libraries that really fit into the web scraping world.
Each popular target is surprisingly different though I noticed how prevalent client-side rendering is in the modern web.
Once we started working on scrape guides we noticed a lot of repetition.
So, it seemed like it'd be appropriate that we cover each of these subjects individually. This, lead us even further - what are some other key web scraping subjects that could be referenced anywhere? Here are a few we've settled on:
- How to ensure web scraped data quality? - data validation is boring but vital for consistent data delivery.
- How to turn web scrapers into data APIs? - real time scraping API with FastAPI and async Python.
- How to rotate proxies? - weighted randomization!
- web scraper scaling in Python - Asyncio, threads and subprocesses from the point of view of web scraping.
- How to rate limit requests?
- How to use browser automation tools like Puppeteer, Playwright and Selenium
Most of the topics came up naturally as we were reviewing and developing our articles though some came from our customers and questions on popular forums like Stackoverflow or Reddit.
The most difficult and interesting subject in web scraping is definitely scraper blocking. To cover it I've got a bunch of help from my teammates and put together a definitive guide on web scraper blocking which took a lot of research and turned out to be our reader's favorite!
I thought I was pretty well versed in the blocking subject but stuff like how powerful TLS fingerprinting really blew my mind. There are just so many ways to fingerprint and identify users based on their connections.
My favorite article, however, is creating search engines using web scraping where we demo how any website can be quickly indexed with a bit of scraper code, some clever HTML parsing and an index front-end like LunrJS. I've adopted this technique in many other different projects as having a quick data index is really useful!
We spent a lot of time designing our blog to function as a knowledge graph rather than a series of individual guides. This paid off pretty well as SEO seems to like references.
One slight disappointment was how cool features can backfire when it comes to SEO. For our CSS and XPath articles, we built real-time on-page testing widgets that were not liked by the search engines and took a while to optimize and finally pay off.
We also tried structuring our articles with revealable components (like
<details>) which increased readability but noticeably worsened SEO.
Keyword diversity was another important challenge as we wanted all sorts of users to find our content - from beginner developers to seniors discovering web scraping.
Since web-scraping is still a small niche there aren't many well-established terms, so diversifying terms made a huge difference. Sometimes increasing monthly readers tenfold!
I wrote close to 60 articles that are on average 3000 words long. All of them started as 6000~ word drafts and got edited into something more dense and digestible.
Keep things short is easier said than done though. The real challenge is how do you make all of these conflicting problems work together?
- SEO keywords favor longer text allowing more keyword diversity.
- Keyword repetition is bad for SEO but good for comprehension.
- Shorter explanations might miss vital details but longer ones can be more difficult to comprehend or just bore the reader.
- Which problems deserve to be explained and which aren't?
To start addressing these I focused more on the structure of the article. For this, table of contents is such a gem of a feature!
TOC is great for the reader to navigate and overview the article but it's also great for the editor as it helps to visualize: is this the right order to approach this problem? What's missing and what shouldn't be there?
For example, in our scraping guide articles, I quickly learned that starting with a short example or how to scrape a single product page is much more approachable than following the traditional project order.
Getting rid of factoids and history lessons helped too. People are probably interested in the fact that the Referer header is officially misspelled and it's a good thing to know but do they really need to know more about this?
Media use can also be a major help. Well placed image or a video can save several paragraphs of text and provide some breathing space.
The Great Beyond!
Exploring web scraping subject is a real pleasure - it's vast and challenging though the best part is that at the end of the day you see concrete data results. Getting a quality dataset conjured up from nothing is immensely satisfying which makes all of that work worth it!
We still have loads of ideas on educational web scraping content that we will be working on in 2023. We've started work on a collection of web scraping tools and a few more secret projects that should pop up soon! :)