When AI Crawlers Break the Rules: Lessons from Cloudflare’s Battle with Perplexity
Cloudflare Accuses AI Startup Perplexity of Bypassing Scraping Blocks on Tens of Thousands of Websites
Understanding how AI-powered web scrapers interact with website defenses is crucial in today’s data-driven world. The recent showdown between Cloudflare and Perplexity exposes both the technical tactics used to bypass site restrictions and the broader implications for ethics, security, and the future of web data access.
Published Aug 5, 2025 • 4 minutes read

Introduction
Web scraping has become a cornerstone for AI research and product development. Yet as more sites invoke Robots.txt
and firewall protections, AI companies innovate new ways to collect data. Cloudflare’s public call-out of Perplexity for stealth crawling offers a rare window into this escalating tug-of-war between content owners and AI startups.
The Battle Unveiled
Cloudflare’s Investigation
Cloudflare first noticed anomalies after customers reported seeing Perplexity’s results appearing despite explicit blocks in their Robots.txt
files and Web Application Firewall rules. To verify these claims, Cloudflare spun up private domains disallowed to all bots and watched Perplexity’s crawler keep knocking on their digital door.
Stealth Tactics Employed by Perplexity
Cloudflare’s research uncovered a multi-step evasion strategy:
User-Agent switching when “PerplexityBot” was blocked, masquerading as a Chrome browser on macOS
Rotating IP addresses instead of sticking to declared IP ranges to dodge IP-based blocks
ASN hopping by switching autonomous system numbers to further obscure the crawler’s origin
These techniques played out on tens of thousands of domains, generating millions of requests per day while pretending to be a benign human browser.
Implications for the AI and Web Ecosystem
Ethical and Legal Ramifications
Perplexity’s alleged disregard for site preferences doesn’t just violate web etiquette—it raises real legal risks. Publishers like Dow Jones and the BBC are already scrambling to protect copyrighted content. The case underscores a looming clash between AI’s hunger for data and traditional intellectual-property frameworks.
Technical Arms Race
Website operators face pressure to deploy ever more sophisticated defenses. The old reliance on Robots.txt is proving insufficient. As AI startups weaponize IP rotation and UA spoofing, defenders must lean on behavioral fingerprinting and machine-learning-driven bot detection to reclaim control.
Lessons Learned
Ignoring
Robots.txt
rules → Enforce robots policy at the network edge with WAF rules and CAPTCHA challengesUser-Agent spoofing → Analyze request headers and JavaScript execution patterns to verify real browsers
IP rotation → Leverage rate limiting and IP reputation scoring
ASN hopping → Monitor ASN changes and flag anomalous traffic shifts
Best Practices for Ethical AI Crawling
Adopt a transparent crawler identity by publishing your user agents and IP ranges publicly
Honor
Robots.txt
directives without exceptionImplement rate limiting to avoid overwhelming sites
Establish opt-in data partnerships when feasible—seek permissions rather than stealth
Maintain an audit trail of all crawling activities to demonstrate compliance
Conclusion
Cloudflare’s exposé of Perplexity’s stealth tactics serves as a wake-up call for both website operators and AI innovators. Site owners must upgrade defenses beyond simple “no-crawl” protocols, while AI firms should recommit to transparent, ethical data-collection methods. Collaboration between infrastructure providers, content publishers, and AI companies will be essential to balance open data access with respect for digital property.