Blocking AI Crawler Bots with robots.txt

As the prevalence of generative AI continues to rise, many content creators are concerned about how their unique work is being utilized without consent. One effective way to guard your content is through the robots.txt file, which allows you to manage the behavior of various web crawlers, including AI bots. This article covers how to properly use the robots.txt file to block unwanted AI crawlers from accessing your website.

What is a robots.txt file?

A robots.txt file is a plain text file that resides in your website’s root directory and defines the instructions for web spiders or crawlers. These instructions can include whether to allow or disallow specific bots from crawling or indexing your pages. The fundamental syntax for blocking a particular bot is:

User-agent: {BOT-NAME}
Disallow: /

Conversely, to permit a bot to crawl your site, the syntax is:

User-agent: {BOT-NAME}
Allow: /

Where to Place Your robots.txt File

To have the desired effects, upload the robots.txt file to the root directory of your website. This means it should be accessible at:

https://yourwebsite.com/robots.txt

For more in-depth guidance, consider checking resources like Google's Introduction to robots.txt or Cloudflare's explanation of how a robots.txt file operates.

How to Block AI Crawler Bots

The syntax for blocking any AI bot using the robots.txt file remains unchanged:

User-agent: {AI-BOT-NAME}
Disallow: /

Blocking OpenAI Bots

To prevent OpenAI crawlers like GPTBot and ChatGPT-User from accessing your content, add these lines to your robots.txt file:

User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /

It's also important to note that OpenAI maintains various IP address ranges that you might want to block at the firewall level. For example, you can employ commands in a Linux environment like UFW to restrict these IP ranges.

Blocking Google AI Bots

To restrict Google's AI bots, such as Bard and Vertex AI, include the following in your robots.txt file:

User-agent: Google-Extended
Disallow: /

Although Google does not provide specific IP ranges for blocking, specifying their user agent is a good starting point.

Blocking Common Crawl

If you'd like to limit access from Common Crawl's crawler, known as CCBot, you can include this in your robots.txt:

User-agent: CCBot
Disallow: /

Blocking Perplexity AI

To prevent Perplexity AI from crawling your site, specify the following in your robots.txt:

User-agent: PerplexityBot
Disallow: /

Blocking Anthropic AI

To limit access from various bots associated with Anthropic, use the following lines:

User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: ClaudeBot
Disallow: /

Do AI Bots Always Follow robots.txt?

Most reputable companies, including Google and OpenAI, comply with the guidelines set in the robots.txt file. However, not all AI bots may adhere to these protocols, particularly poorly designed or malicious crawlers.

Can You Block AI Bots Using Cloudflare or AWS WAF?

Cloudflare has recently rolled out features that allow users to block AI bots through their Web Application Firewall (WAF). While this can be effective, implementing such measures demands a comprehensive understanding of bot functionality to avoid mistakenly blocking legitimate users. Users should evaluate the impact thoroughly, ensuring that the WAF settings do not hinder access for genuine visitors.

Conclusion

As generative AI firms increasingly utilize data from web content for training purposes, the ability to manage and restrict access to your material has become paramount for content creators and website owners. By effectively utilizing the robots.txt file, along with advanced firewall configurations, you can safeguard your hard work while deciding who can and cannot access your data.