1. News
  2. AI
  3. Cloudflare Exposes Perplexity’s Sneaky Web Scraping

Cloudflare Exposes Perplexity’s Sneaky Web Scraping

featured
Share

Share This Post

or copy the link

Perplexity is reportedly engaging in unauthorized access to content from various websites, despite being explicitly prohibited from doing so. Cloudflare, a prominent web security services provider, conducted an investigation that revealed the clandestine tactics employed by the answer engine company. The findings indicated that Perplexity’s crawler bots were not only disregarding website directives but also employing various methods to obscure their identity, making it difficult for website owners to detect their activities. Cloudflare successfully identified means to curtail these efforts by the AI company.

Cloudflare Highlights Perplexity’s Stealth Activities

In a blog post, Cloudflare accused Perplexity of engaging in “stealth crawling.” The post stated, “We see continued evidence that Perplexity is repeatedly modifying their user agent and changing their source ASNs to hide their crawling activity, as well as ignoring — or sometimes failing even to fetch — robots.txt files.”

To comprehend Perplexity’s actions, it is critical to understand the underlying mechanics of web crawling. Content website owners provide information, which third-party services, including search engines, retrieve to index their sites and make them visible in response to pertinent queries. Some applications and platforms also scrape website content—either with permission or otherwise—to display it on their interface or collect data.

For this dynamic between websites and crawlers to function effectively, a level of trust must exist. This trust is founded on protocols which dictate that crawlers must be transparent, serve specific purposes, and adhere to website instructions and preferences. Consequently, if a site blocks a particular bot, that bot ought to refrain from crawling that site.

According to Cloudflare researchers, Perplexity is undermining this trust by employing stealth tactics to accumulate data from websites that have specifically denied access to its declared bots, namely PerplexityBot and Perplexity-User. Researchers confirmed this behavior by establishing new test domains.

These domains were isolated from indexing by search engines and remained undiscoverable to the public. Furthermore, the researchers implemented a robots.txt file aimed at preventing all bots from accessing any section of the website.

When Cloudflare approached Perplexity with inquiries regarding these new domains, they discovered that despite following established internet protocols to prohibit crawling activity, Perplexity was able to retrieve detailed information about these sites.

Cloudflare asserts that Perplexity’s web crawlers utilize multiple strategies to circumvent website directives and gain unauthorized access to data. If a user agent is denied access through robots.txt, it disregards this restriction and continues to scrape content. Should a website employ a web application firewall (WAF) to block the bot, Perplexity resorts to using a generic browser agent, mimicking either Google Chrome or macOS.

This undisclosed bot also reportedly operates from various IP addresses not included within Perplexity’s official range to deceive the website. To further conceal its activities, these crawlers utilize different automatic system numbers. Significantly, Cloudflare noted that when these unauthorized bots were effectively halted, the quality of responses from Perplexity decreased, forcing it to rely on alternative data sources for answer generation.

Cloudflare indicated that its bot management system successfully recorded all activities from Perplexity’s concealed user agents, offering automatic protection to all its bot management clients. The company has also incorporated signature matches for this stealth crawler into its managed rules, which blocks AI crawling activities—a feature available to all Cloudflare users, including those on the free tier.

Cloudflare Exposes Perplexity’s Sneaky Web Scraping
Comment

Tamamen Ücretsiz Olarak Bültenimize Abone Olabilirsin

Yeni haberlerden haberdar olmak için fırsatı kaçırma ve ücretsiz e-posta aboneliğini hemen başlat.

Your email address will not be published. Required fields are marked *

Login

To enjoy Technology Newso privileges, log in or create an account now, and it's completely free!