The Web Scraping Insider #3

Your high-signal, zero-fluff roundup of what’s happening in the world of professional web scraping.

Ian Kerins
July 24, 2025 • Estimated Reading Time: 10 minutes

👋 Hello there!

Welcome to the third edition of The Web Scraping Insider, by the ScrapeOps.io team.

With this newsletter, get a zero-fluff roundup of the most interesting content and news from the world of professional web scraping straight to your inbox every week.

We hope you enjoy it, and we’re always open to your feedback and suggestions.

Let’s get to it!

🧮💸 Domain-Level Proxy Pricing: Are You Overpaying for Scraping Hard Sites?

Over the last few years, we’ve seen the emergence of domain-level proxy pricing, with the basic idea being you should pay more to scrape harder sites and less for easy ones. Sounds great in theory, but in practice?... Users are unpredictably getting charged more for harder domains and getting no discounts on easy ones.

The end result is that it is now even more difficult to find the best & cheapest proxies for your use case.

💡 Let’s Dive Into The Data

At ScrapeOps, we have billions of requests worth of historical data on all the providers, so we pulled some recent June 2025 data on the top 10 Proxy API providers for scraping Amazon.com (pricing is based on a 1 million credit plan).

Disclaimer: This data is fully independent. No provider paid for inclusion or had prior knowledge. All data was taken from real-world production traffic, not synthetic benchmarks.

📈 Chart TL;DR: The chart shows the inverse of price vs. the inverse of average success latency, top right is best. Lowest latency and lowest cost.

Here is a breakdown of the results.

Provider	CPM	Success Rate	Avg Success Latency	Notes
Scrape.do	$99	100%	6.62s	Top value performer
ScrapingBee	$99	100%	8.74s	Solid balance, low cost.
Scrapfly	$600	100%	5.73s	Fast, but 6x more expensive
Bright Data Unlocker	$1120	100%	4.93s	Fastest, but 11x more expensive
ScraperAPI	$745	100%	34.29s	High cost, low throughput
Smartproxy Unlocker	$1400	0%	0.00s	Complete failure on Amazon
Scrapingant	$98	70%	15.24s	Cheapest, but unreliable
Infatica Web Scraper API	$900	90%	11.39s	Expensive, but avg. performance
Scrapingfish	$2000	100%	29.04s	Most expensive, very slow
Zyte API	$175	100%	22.54s	Low-ish cost, but slow latencies.

📊 Price ≠ Performance. Despite prices ranging from $98 to $2,000 per million requests, the price (CPM, cost per million) you pay has very little correlation to the performance you get (as measured by the average latency to achieve a successful response).

💸 Cheap is sometimes the best. We are conditioned to believe that you pay for what you get, but when it comes to proxies, this is often not the case. Whilst Scrape.do is one of the cheapest; they have the 3rd best performance, with Scrapingbee not being too far behind. Whilst most high-end providers fail to justify their cost.

🔻 Most premium proxies fail to justify their cost. Outside of Bright Data and Scrapfly, the premium proxy solutions offer worse performance than some of the cheapest solutions. Even then, very few could justify spending 11X more on Bright Data over Scrape.do just to save ~1 second per request.

🎭 Hidden price multipliers. Numerous proxy providers apply dynamic credit multipliers for scraping Amazon. ScraperAPI (5 credits), Zyte API (1.75 credits), Infatica Web Scraping API (10 credits), and Scrapfly (6 credits) all applied extra charges for this test, distorting the true cost of using their proxies to scrape Amazon.

🌐 It’s not just Amazon. We see the exact same situation play out across pretty much every domain (Walmart, Google, Shoppee, etc.). Hidden credit charges, performance varying massively, and price being a very poor indicator of performance. Worse, for users, is that the optimal proxy provider for each domain can vary significantly.

🚨 Bottom Line:
The price a proxy provider charges is often a very poor indicator of the performance they deliver, and dynamic domain pricing is just making this even harder. Until providers offer full transparency and consistent performance correlation, you’ll need to benchmark providers for your use case yourself (or let ScrapeOps manage it for you), especially with difficult domains where dynamic pricing is applied.

🧾 Pay-Per-Crawl: Innovation or Illusion?

Cloudflare recently launched Pay-Per-Crawl, a native monetization layer that lets websites charge bots for crawling. It generated a lot of buzz online, and yes, technically, it's impressive: billing, tracking, and access control at the edge. And with Cloudflare sitting in front of ~20% of the internet, they’re uniquely positioned to enforce it.

But will it change anything for the scraping industry? Probably not.

💰 Who it really targets. This isn’t aimed at stealth scrapers using proxies and fingerprint spoofing. It’s targeting the tiny number of bots, OpenAI, Google, etc., that identify themselves, respect robots.txt, and play by the rules. This move is about removing free access to previously "acceptable" crawlers, not blocking stealthy ones.

🕵️ What happens next. Those "good citizen" bots are now faced with a choice: pay, or go underground. Many will likely route traffic through proxies and mask their identity, just like everyone else, especially if the cost of access exceeds the cost of bypassing.

💡 It's opt-in, not default. It is worth noting: this isn't a blanket Cloudflare rule. Site owners must explicitly enable it. So impact will be limited to those domains most concerned about AI content ingestion, news sites, high-value publishers, etc.

🧠 Tech leap, not an industry shift. It’s a major leap in infrastructure, request metering, and control at scale. But unless Cloudflare can force all crawlers through this system (unlikely), it doesn’t disrupt most production-grade scrapers. You’ll just get a paywall page instead of a ban page, same cat, different hat.

🚨 The real threat is Labyrinth. While Pay-Per-Crawl is noisy and visible, a system like Cloudflare Labyrinth is many times more dangerous. It serves obfuscated, fake, or poisoned content without alerting you. Detecting it is non-trivial, and it makes data validation far harder. That’s the real arms race.

Bottom line: Pay-Per-Crawl is clever, but it only works if bots play nice. The serious scrapers, those already bypassing Cloudflare, won’t flinch. This is more about signaling than security.

🤖 What Is GenAI Useful For In Web Scraping?

The team at GenAI for Data Scraping released an interesting paper that compared GenAI-powered web scraping methods, AI-assisted code generation, direct HTML extraction with LLMs, and vision-based analysis against traditional scraping and a naive LLM approach. Evaluating them across 3,000 real-world web pages (Amazon, Cars.com, Upwork), and stress-testing stability over 9,000 extractions.

Method	Accuracy	Cost/Page	Time/Page	Highlights
AI Code Gen (Method 1)	100%	$0	Instant	Site-specific, reliable, but brittle
Cleaned HTML + LLM (Method 2)	98.8%	~$0.00075	~30s	Generalizable across sites, flexible
Vision-based (Method 3)	98.4%	~$0.0004	~17s	Visual extraction, ignores DOM quirks

Here are some of the most interesting insights:

📸 Screenshot Parsing > HTML Parsing (on Cost). Even with heavy HTML compression, LLMs were nearly twice as expensive to parse HTML as they were to extract data from screenshots on large pages.

🧠 LLMs Are Non-Deterministic and Painful to Debug. With traditional parsers, you detect issues, tweak selectors, and move on. With LLMs:

Same input can yield different output.
Prompt tweaks lead to regressions elsewhere.
You need full output snapshotting + semantic diffing to do QA properly.

💬 Prompt-Based LLM parsing is a mirage. Their tests showed <70% accuracy, high variance, and frequent hallucinations. Some LLMs "couldn't find" fields that were clearly present. Others just made stuff up.

💥 Wrong Data > No Data. At ScrapeOps, we’re actively productizing an AI codegen system for scrapers. The biggest issue? Not that LLMs fail to build working parsers - but that they do and return polluted, incomplete, or subtly wrong data.

That’s way worse than returning nothing: no data is easy to detect. Wrong data silently corrupts your pipelines.

👉 Generative AI for Data Scraping

🕸️ Some other noteworthy web scraping content and news…

🔒 Lesson From ProxyCrawl Shutdown: Be careful what you scrape!

Nothing really new here in terms of advice, but the news that ProxyCrawl has shut down because of the legal case LinkedIn took against them reaffirms the status quo in terms of what is safe and unsafe to scrape. Key takeaways:

🚪 Beware scraping behind logins. Creating thousands of accounts to scrape behind a login is a recipe for getting caught, sued, and dealing with serious legal consequences.
💰 The size of your pockets matters. This case highlights the power that giant companies have over smaller web scrapers. If you get on the wrong side of Meta, Google, LinkedIn, etc., they don’t need to win in court; they just need to threaten you with their legal resources and force you to change/close. In contrast, a large company like Bright Data was able to fight LinkedIn and win. However, it is important to note that Bright Data wasn't scraping behind the login.

⚒️ Building Effective Tooling For Scraping Teams

Interesting piece from The Scraper’s Journal, Building Internal Tools That Your Scraping Team Will Actually Use. This highlights how most internal scraping tools go unused, not because they’re bad, but because they don’t fit how operators actually work. This piece goes into detail about what it takes to build tools your team will actually adopt.

👉 Building Internal Tools That Your Scraping Team Will Actually Use

🧭 Analysing The Web Scraping Market Through YC’s Picks & Shovels Lens

With the rapid growth of data analytics and now the explosion of GenAI, web scraping is quickly becoming foundational infrastructure for AI and real-world applications.

This article looks into YC’s recent uptick in web scraping investments using their “picks and shovels” investment model, and hypothesizes what the future might hold for the web scraping market.

👉 Why Y Combinator's Web Scraping Investments Reveal the Future of AI-Powered Data

👉 Click above to vote & see what others think in our upcoming #4 Newsletter.

🚀 Until Next Time...

That’s a wrap for this issue of #3 The Web Scraping Insider.

If you found this edition interesting, forward it to a fellow scraper - and hit the vote above to shape newsletter #4!

We’ll be back soon with more deep dives, scraped truths, and tactical guides from the front lines of the data extraction world.

Ian from ScrapeOps