1. News
  2. INTERNET
  3. Cloudflare Uncovers Perplexity’s Sneaky Data Tactics

Cloudflare Uncovers Perplexity’s Sneaky Data Tactics

featured
Share

Share This Post

or copy the link

Perplexity is reportedly engaging in unauthorized content access from various websites, despite clear prohibitions against such actions. Cloudflare, a leading global web security firm, conducted investigations that confirmed the evasive tactics employed by the answer engine company. Researchers discovered that Perplexity’s crawler bots not only ignored website directives but also took measures to conceal their identity, making it difficult for website owners to monitor their activities. Furthermore, Cloudflare managed to effectively hinder the AI firm’s scraping efforts.

Cloudflare Uncovers Perplexity’s Evasive Strategies

In a recent blog post, Cloudflare stated that Perplexity has been involved in “stealth crawling.” The platform noted, “We have continuous evidence that Perplexity regularly alters its user agent and modifies its source ASNs to obscure its crawling efforts, while also frequently disregarding — or even failing to access — robots.txt files.”

To fully understand Perplexity’s actions, it is essential to grasp the functioning of web crawling. Website owners publish content that is subsequently accessed by third-party services like search engines for indexing, aimed at making relevant content discoverable to users. Some applications and websites also scrape data from other websites, either to display it within their own platforms or to gather information with proper authorization.

For this ecosystem to function effectively, a foundation of trust is essential. This trust is established through established protocols that web crawlers are expected to follow. These protocols ensure that crawling activities are transparent, serve specific purposes, and comply with website preferences. Hence, if a website blocks a specific bot, that bot should refrain from crawling that site.

According to Cloudflare researchers, Perplexity is violating this trust framework by employing stealth methods to scrape data from websites that have explicitly prohibited access to its known bots, namely PerplexityBot and Perplexity-User. This was verified through the creation of new test domains.

The newly established domains were not indexed by any search engines and remained inaccessible to the public. Additionally, the researchers implemented a robots.txt file aimed at preventing all bot access to any part of the domain.

Following this setup, Cloudflare researchers inquired with Perplexity regarding these specially created domains. Despite adhering to internet protocols designed to restrict crawling, Perplexity was still able to retrieve detailed information about these domains.

Cloudflare indicated that Perplexity’s user agents or web crawlers utilize various techniques to circumvent website directives. When access is denied via robots.txt, these bots ignore the instructions and continue to scrape data. If a web application firewall (WAF) is used to block access, they adopt a generic browser agent to mimic Google Chrome or macOS.

Additionally, the unapproved bots reportedly employ multiple IP addresses not included in Perplexity’s official IP range to deceive websites. To further obfuscate their activities, these crawlers also utilize varying automatic system numbers. Cloudflare observed that once these undeclared bots were successfully blocked, the quality of Perplexity’s responses diminished, forcing it to lean on alternative data sources for answers.

Cloudflare’s bot management system has successfully tracked the undeclared crawling activities linked to Perplexity’s hidden user agents, providing automatic protections for all its bot management clients. Moreover, the company has included signature matches for the stealth crawlers within its managed rules, effectively curbing AI crawling activity. This updated feature is accessible to all Cloudflare users, including those on the free tier.

Cloudflare Uncovers Perplexity’s Sneaky Data Tactics
Comment

Tamamen Ücretsiz Olarak Bültenimize Abone Olabilirsin

Yeni haberlerden haberdar olmak için fırsatı kaçırma ve ücretsiz e-posta aboneliğini hemen başlat.

Your email address will not be published. Required fields are marked *

Login

To enjoy Technology Newso privileges, log in or create an account now, and it's completely free!