Block AI Training Bots on Websites and M365

This comprehensive guide explains how businesses can protect their digital content from unauthorized AI training. It covers three critical areas: blocking AI crawlers from public websites using robots.txt and Cloudflare's enterprise bot management, disabling Microsoft 365's Connected Experiences that may process documents through AI, and controlling Google Workspace Smart Features that grant Gemini access to emails and files. The article includes technical implementation details, a comparison table of major AI bots, and an enterprise-ready checklist for comprehensive protection.

Back to Blog

Brian Desmot

December 4, 2025

14 min read

Digital illustration of a protective shield blocking AI training bot crawlers from accessing business website and cloud services, representing enterprise data protection against unauthorized AI data harvesting.

Key Takeaways

AI crawlers are harvesting your content — Over 21% of the top 1,000 websites now block GPTBot, with OpenAI's crawl-to-referral ratio reaching 1,700:1 compared to Google's 14:1
Robots.txt alone isn't enough — Some AI crawlers ignore these directives; Cloudflare's enterprise bot management provides network-level blocking
Microsoft 365 has hidden AI settings — "Connected Experiences" processes your documents through AI and requires manual opt-out via deeply nested menus
Google Workspace defaults favor AI access — Smart features are enabled by default in the U.S., allowing Gemini to access emails, documents, and attachments
A layered defense strategy is essential — Combine robots.txt, server-level blocking, and Cloudflare's AI Crawl Control for comprehensive protection

Your business website, cloud documents, and collaborative workspaces are constantly being scanned by AI training bots—often without your knowledge or consent. These crawlers harvest your proprietary content, customer data, and intellectual property to train large language models, returning almost nothing in exchange. Unlike traditional search engines that drove traffic back to your site, AI companies like OpenAI, Anthropic, and Meta use your content to power their own products, effectively monetizing your work while bypassing your digital presence entirely.

The statistics paint a sobering picture: Cloudflare's research reveals that OpenAI's GPTBot crawls websites at a ratio of 1,700 crawls for every single referral it sends back. Anthropic's ClaudeBot is even more aggressive, with a crawl-to-referral ratio of 73,000:1. Compare that to Google's traditional search crawler at 14:1, and the asymmetry becomes clear. AI companies are consuming vast amounts of content while providing almost zero traffic in return—fundamentally breaking the implicit contract that has governed the web for decades.

This guide provides a comprehensive framework for protecting your organization's digital assets across three critical domains: your public-facing websites, Microsoft 365 environments, and Google Workspace tenants. We'll explore both the technical controls available and how partnering with a managed IT services provider can help you implement enterprise-grade protections efficiently.

Understanding the AI Training Bot Landscape

Before implementing defenses, it's crucial to understand what you're protecting against. AI training bots come in several categories, each with different behaviors and implications for your content.

Types of AI Crawlers

Training Data Crawlers are the primary concern for most organizations. These bots systematically harvest web content to build datasets for training large language models. GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot (Common Crawl), and Google-Extended all fall into this category. Once your content enters their training corpus, it becomes part of the model's knowledge base—potentially forever.

AI Search and Assistant Crawlers operate differently. ChatGPT-User, Perplexity-User, and similar bots fetch content in real-time when users ask questions, providing citations and potentially driving some referral traffic. These represent a gray area—blocking them removes your visibility in AI-powered search results, but allowing them means your content is used without traditional compensation.

Corporate AI Crawlers from major tech companies serve multiple purposes. ByteDance's Bytespider, Amazon's Amazonbot, Meta's FacebookBot, and Apple's Applebot-Extended collect data for everything from voice assistants to recommendation algorithms. These companies rarely disclose exactly how your content is used across their product ecosystem.

Major AI Training Bots: Quick Reference

Bot Name	Operator	Primary Purpose	Respects robots.txt
GPTBot	OpenAI	Model training data collection	Yes
ClaudeBot	Anthropic	Model training data collection	Yes
CCBot	Common Crawl	Open web archive (used by many AI labs)	Yes
Google-Extended	Google	Gemini AI training	Yes
Bytespider	ByteDance	LLM training (Doubao)	Partially
PerplexityBot	Perplexity AI	AI search indexing	Inconsistent
Amazonbot	Amazon	Alexa and AI services	Yes
Meta-ExternalAgent	Meta	AI model training	Yes

Protecting Your Public Website from AI Crawlers

Website protection requires a layered approach. Robots.txt provides the foundation, but since compliance is voluntary, you need additional network-level controls to enforce your preferences against non-compliant crawlers.

Layer 1: Robots.txt Configuration

The robots.txt file has been the standard mechanism for communicating with web crawlers since 1994. While it's technically just a set of preferences that crawlers should respect, major AI companies have committed to honoring these directives—at least on paper. As of July 2025, 94% of the top 12 million websites maintain a robots.txt file, and approximately 21% of the top 1,000 websites specifically include rules for GPTBot.

Here's a comprehensive robots.txt configuration that blocks all major AI training bots while preserving your search engine visibility:

robots.txt

# Block AI Training Crawlers
# OpenAI
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

# Anthropic (Claude)
User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

# Google AI Training
User-agent: Google-Extended
Disallow: /

# Common Crawl (widely used for AI training)
User-agent: CCBot
Disallow: /

# Meta AI
User-agent: FacebookBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Meta-ExternalFetcher
Disallow: /

# ByteDance
User-agent: Bytespider
Disallow: /

# Perplexity
User-agent: PerplexityBot
Disallow: /

# Apple AI
User-agent: Applebot-Extended
Disallow: /

# Amazon
User-agent: Amazonbot
Disallow: /

# Other AI Crawlers
User-agent: Diffbot
Disallow: /

User-agent: ImagesiftBot
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: webzio-extended
Disallow: /

User-agent: YouBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: AI2Bot
Disallow: /

# Allow standard search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: *
Allow: /

Upload this file to your website's root directory so it's accessible at https://yourdomain.com/robots.txt. Keep in mind that this list requires regular updates as new AI bots emerge—Cloudflare has identified over 226 known AI crawlers, and more appear regularly.

Layer 2: Cloudflare's AI Bot Management

Robots.txt has a fundamental limitation: it relies on voluntary compliance. Some AI crawlers have been documented ignoring these directives entirely, and malicious actors can spoof legitimate user-agent strings. This is where Cloudflare's infrastructure becomes invaluable.

On July 1, 2025, Cloudflare announced it would become the first major Internet infrastructure provider to block AI crawlers by default. For new domains, customers are now explicitly asked whether they want to allow AI crawlers—a significant shift from the previous opt-out model. More than one million Cloudflare customers have already chosen to block AI bots, demonstrating widespread demand for these protections.

Cloudflare's AI Crawl Control provides several key capabilities:

One-Click Blocking: Enable "Block AI Bots" in the Security settings to immediately stop verified AI crawlers and unverified bots exhibiting similar behavior
Granular Crawler Management: Allow, block, or charge individual AI crawlers based on your content strategy. Allow search-focused bots while blocking training crawlers
Managed Robots.txt: Cloudflare can automatically create and maintain a robots.txt file with AI crawler directives, ensuring your preferences stay current
Monetized Content Protection: Automatically block AI bots only on pages with advertising, preserving potential revenue while protecting monetized content
Pay-Per-Crawl: For organizations open to licensing their content, Cloudflare's beta program enables direct payments from AI companies per crawl request

The combination of Cloudflare's global network visibility, machine learning models trained on bot behavior, and experience mitigating DDoS attacks makes their platform uniquely effective at identifying and blocking AI crawlers—even those that attempt to disguise themselves.

How ITECS Implements Cloudflare Protection

As a managed IT services provider, ITECS helps businesses leverage Cloudflare's comprehensive security suite—not just for AI bot management, but as part of an integrated web security strategy.

Proxy and Caching Services: Cloudflare's reverse proxy sits between your web servers and incoming traffic, providing DDoS protection, SSL/TLS termination, and intelligent caching that improves site performance while reducing server load. This architecture gives you visibility into all traffic hitting your domain, including AI crawlers.

Security Layers: Beyond bot management, Cloudflare provides Web Application Firewall (WAF) capabilities, rate limiting, and browser integrity checking. These tools work together to protect against both AI scraping and traditional cyber threats.

Analytics and Reporting: Cloudflare's dashboard provides detailed insights into bot traffic patterns, showing which AI crawlers are attempting to access your site, how often, and which pages they're targeting. This visibility enables data-driven decisions about your content access policies.

Our cybersecurity consulting team works with your organization to develop a comprehensive bot management strategy that balances content protection with legitimate business needs like search engine visibility and potential AI partnership opportunities.

Securing Your Microsoft 365 Environment

While website protection focuses on external crawlers, Microsoft 365 presents a different challenge: internal features that may share your data with AI systems. The "Connected Experiences" functionality has been part of Microsoft 365 since April 2019, but recent concerns about AI training have brought renewed scrutiny to these settings.

Understanding Connected Experiences

Microsoft's Connected Experiences encompass a wide range of cloud-powered features within Word, Excel, PowerPoint, Outlook, and other Microsoft 365 applications. These include real-time grammar suggestions, co-authoring capabilities, translation services, and content recommendations. When enabled, these features send document content to Microsoft's cloud services for processing.

Microsoft has stated explicitly: "Microsoft does not use customer data from Microsoft 365 consumer and commercial applications to train large language models." This distinction is important—while Connected Experiences do process your content, Microsoft claims this processing is for feature functionality rather than model training.

However, Microsoft's privacy statement includes language about using data "to develop and train our AI models," creating ambiguity that concerns privacy-conscious organizations. The company has clarified that enterprise customers can negotiate specific terms about data usage, but the default settings and documentation remain complex.

How to Disable Connected Experiences

For individual users, disabling Connected Experiences requires navigating through multiple layers of settings menus. Here's the path:

Individual User Settings (Word, Excel, PowerPoint, etc.)

Open any Microsoft 365 application (Word, Excel, etc.)
Click File in the top-left corner
Select Options from the sidebar
Navigate to the Trust Center tab
Click Trust Center Settings
Select Privacy Options
Click Privacy Settings
Find Optional Connected Experiences and uncheck the box

Note: This process must be repeated for each Microsoft 365 application on each device.

Enterprise Administrator Controls

For organizations using Microsoft 365 enterprise deployments, administrators have access to centralized policy controls that can enforce Connected Experiences settings across the entire organization.

Using Microsoft Intune, Group Policy, or the Microsoft 365 Apps admin center, IT administrators can:

Disable all optional connected experiences organization-wide
Control which specific connected experiences are available to users
Prevent users from overriding organizational settings
Configure different policies for different user groups based on data sensitivity

Microsoft Purview's Data Security Posture Management (DSPM) for AI provides additional monitoring capabilities, offering insights into AI interactions involving sensitive information and allowing organizations to create policies around Copilot for Microsoft 365 usage.

Copilot Considerations

Microsoft 365 Copilot represents a separate but related concern. While Copilot does access your organizational data through Microsoft Graph to provide AI-powered assistance, Microsoft maintains that "prompts, responses, and data accessed through Microsoft Graph aren't used to train foundation LLMs." The data processing occurs within your tenant boundary and is subject to your existing Microsoft 365 compliance configurations.

Organizations subject to regulatory requirements such as HIPAA or CMMC should work with their compliance teams and IT consultants to evaluate whether Copilot deployment aligns with their data handling requirements.

Controlling AI Access in Google Workspace

Google Workspace presents similar challenges to Microsoft 365, with AI-powered "Smart Features" enabled by default in many regions. A November 2025 California lawsuit alleges that Google changed its policies in October 2025 to give Gemini default access to private content including emails and attachments—content that previously required explicit user consent.

Understanding Google's Smart Features

Google's Smart Features encompass functionality across Gmail, Drive, Calendar, Meet, and other Workspace applications. When enabled, these features allow Google's AI to process your content for purposes including:

Smart Reply and Smart Compose suggestions in Gmail
Automatic event creation from email content
Gemini-powered summaries and content analysis
Intelligent search across your Workspace content
Personalization features in Maps, Wallet, and Google Assistant

Google states: "We do not use your Workspace data to train or improve the underlying generative AI and large language models that power Bard, Search, and other systems outside of Workspace without permission." However, the data is used to improve Workspace-specific AI features, and the distinction between "improving features" and "training models" can be unclear.

Disabling Smart Features in Gmail

Google provides granular controls for Smart Features, but the settings are spread across multiple locations:

Gmail Smart Features Settings

Open Gmail and click the gear icon (Settings)
Click See all settings
Navigate to the General tab
Scroll to Smart features and personalization
Uncheck Smart features and personalization
Click Manage Workspace smart feature settings
Disable both checkboxes:
- Smart features in Google Workspace
- Smart features in other Google products
Click Save Changes

Note: Disabling these features will remove functionality like Smart Compose, categorized inbox, and automatic calendar event detection.

Google Workspace Administrator Controls

For organizations using Google Workspace Business, Enterprise, or Education editions, administrators can enforce AI feature restrictions organization-wide through the Google Admin console:

Log in to admin.google.com
Navigate to Apps → Google Workspace → Settings for Gmail → User settings
Locate options for Gemini and Smart Features
Configure restrictions for the entire organization or specific organizational units

The Gemini for Workspace feature can be completely disabled or restricted to specific user groups, giving organizations control over who can use AI-powered capabilities and under what circumstances.

Managing Gemini App Permissions

Even with Workspace Smart Features disabled, the standalone Gemini app has separate permissions that must be managed. Google's documentation indicates that Gemini learns from user chats by default, with sample conversations contributing to AI model training unless explicitly disabled.

To use Gemini without contributing to model training, users can utilize "temporary" chats or interact with Gemini without signing into their Google accounts. Organizations should include Gemini app policies in their broader AI governance strategy.

Building an Enterprise AI Data Protection Strategy

Effective protection against AI data harvesting requires more than configuring individual settings—it demands a comprehensive strategy that addresses policy, technology, and ongoing governance.

Policy Development

Organizations should develop clear policies addressing:

Content Classification: Which content categories (proprietary research, customer data, public marketing materials) require AI crawler protection?
AI Tool Usage: Under what circumstances can employees use AI tools that may process company data? What approvals are required?
Third-Party Agreements: Do vendor contracts include provisions about AI training on your data? What due diligence is required before adopting new SaaS tools?
Incident Response: How will you respond if you discover unauthorized AI training on your content? What documentation and legal resources are available?

Technical Implementation Checklist

Ongoing Governance

The AI crawler landscape evolves rapidly. New bots emerge regularly, existing bots change their behavior, and platform policies shift. Effective protection requires ongoing vigilance:

Regular Audits: Review server logs and Cloudflare analytics monthly to identify new AI crawlers attempting to access your content
Policy Updates: Keep robots.txt and firewall rules current as new AI bots are documented
Platform Monitoring: Track Microsoft and Google announcements about AI feature changes that might affect your data handling
Employee Training: Ensure staff understand policies about AI tool usage and recognize signs of unauthorized data access

Related Resources

Meta and Yandex Betrayed User Trust

How major tech companies have exploited user data for AI training without adequate consent

Quick Tips for Cybersecurity Hygiene

Foundational security practices that complement AI data protection efforts

Microsoft 365 MSP Guide

Comprehensive guide to managing Microsoft 365 environments securely

Claude vs ChatGPT: Business Comparison

Understanding the AI tools that may be crawling your content

Take Control of Your Digital Content

The battle over AI training data is fundamentally about control—control over your intellectual property, your customer data, and your competitive advantages. While AI companies have benefited enormously from freely scraping the web, businesses are increasingly asserting their right to decide how their content is used.

Implementing comprehensive AI data protection requires expertise across web infrastructure, cloud platforms, and enterprise security. ITECS helps businesses navigate this complex landscape, implementing layered defenses that protect your content without sacrificing legitimate functionality.

Our cybersecurity services team can audit your current exposure to AI crawlers, implement Cloudflare's enterprise bot management, configure Microsoft 365 and Google Workspace privacy settings, and establish ongoing governance processes to keep your protections current.

Protect Your Business from Unauthorized AI Training

Don't let AI companies profit from your content without your consent. ITECS provides comprehensive AI data protection services including Cloudflare implementation, Microsoft 365 and Google Workspace configuration, and ongoing bot management.