Key Takeaways
- AI crawlers are harvesting your content — Over 21% of the top 1,000 websites now block GPTBot, with OpenAI's crawl-to-referral ratio reaching 1,700:1 compared to Google's 14:1
- Robots.txt alone isn't enough — Some AI crawlers ignore these directives; Cloudflare's enterprise bot management provides network-level blocking
- Microsoft 365 has hidden AI settings — "Connected Experiences" processes your documents through AI and requires manual opt-out via deeply nested menus
- Google Workspace defaults favor AI access — Smart features are enabled by default in the U.S., allowing Gemini to access emails, documents, and attachments
- A layered defense strategy is essential — Combine robots.txt, server-level blocking, and Cloudflare's AI Crawl Control for comprehensive protection
Your business website, cloud documents, and collaborative workspaces are constantly being scanned by AI training bots—often without your knowledge or consent. These crawlers harvest your proprietary content, customer data, and intellectual property to train large language models, returning almost nothing in exchange. Unlike traditional search engines that drove traffic back to your site, AI companies like OpenAI, Anthropic, and Meta use your content to power their own products, effectively monetizing your work while bypassing your digital presence entirely.
The statistics paint a sobering picture: Cloudflare's research reveals that OpenAI's GPTBot crawls websites at a ratio of 1,700 crawls for every single referral it sends back. Anthropic's ClaudeBot is even more aggressive, with a crawl-to-referral ratio of 73,000:1. Compare that to Google's traditional search crawler at 14:1, and the asymmetry becomes clear. AI companies are consuming vast amounts of content while providing almost zero traffic in return—fundamentally breaking the implicit contract that has governed the web for decades.
This guide provides a comprehensive framework for protecting your organization's digital assets across three critical domains: your public-facing websites, Microsoft 365 environments, and Google Workspace tenants. We'll explore both the technical controls available and how partnering with a managed IT services provider can help you implement enterprise-grade protections efficiently.
Understanding the AI Training Bot Landscape
Before implementing defenses, it's crucial to understand what you're protecting against. AI training bots come in several categories, each with different behaviors and implications for your content.
Types of AI Crawlers
Training Data Crawlers are the primary concern for most organizations. These bots systematically harvest web content to build datasets for training large language models. GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot (Common Crawl), and Google-Extended all fall into this category. Once your content enters their training corpus, it becomes part of the model's knowledge base—potentially forever.
AI Search and Assistant Crawlers operate differently. ChatGPT-User, Perplexity-User, and similar bots fetch content in real-time when users ask questions, providing citations and potentially driving some referral traffic. These represent a gray area—blocking them removes your visibility in AI-powered search results, but allowing them means your content is used without traditional compensation.
Corporate AI Crawlers from major tech companies serve multiple purposes. ByteDance's Bytespider, Amazon's Amazonbot, Meta's FacebookBot, and Apple's Applebot-Extended collect data for everything from voice assistants to recommendation algorithms. These companies rarely disclose exactly how your content is used across their product ecosystem.
Major AI Training Bots: Quick Reference
| Bot Name | Operator | Primary Purpose | Respects robots.txt |
|---|---|---|---|
| GPTBot | OpenAI | Model training data collection | Yes |
| ClaudeBot | Anthropic | Model training data collection | Yes |
| CCBot | Common Crawl | Open web archive (used by many AI labs) | Yes |
| Google-Extended | Gemini AI training | Yes | |
| Bytespider | ByteDance | LLM training (Doubao) | Partially |
| PerplexityBot | Perplexity AI | AI search indexing | Inconsistent |
| Amazonbot | Amazon | Alexa and AI services | Yes |
| Meta-ExternalAgent | Meta | AI model training | Yes |
Protecting Your Public Website from AI Crawlers
Website protection requires a layered approach. Robots.txt provides the foundation, but since compliance is voluntary, you need additional network-level controls to enforce your preferences against non-compliant crawlers.
Layer 1: Robots.txt Configuration
The robots.txt file has been the standard mechanism for communicating with web crawlers since 1994. While it's technically just a set of preferences that crawlers should respect, major AI companies have committed to honoring these directives—at least on paper. As of July 2025, 94% of the top 12 million websites maintain a robots.txt file, and approximately 21% of the top 1,000 websites specifically include rules for GPTBot.
Here's a comprehensive robots.txt configuration that blocks all major AI training bots while preserving your search engine visibility:
# Block AI Training Crawlers
# OpenAI
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
# Anthropic (Claude)
User-agent: anthropic-ai
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
# Google AI Training
User-agent: Google-Extended
Disallow: /
# Common Crawl (widely used for AI training)
User-agent: CCBot
Disallow: /
# Meta AI
User-agent: FacebookBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Meta-ExternalFetcher
Disallow: /
# ByteDance
User-agent: Bytespider
Disallow: /
# Perplexity
User-agent: PerplexityBot
Disallow: /
# Apple AI
User-agent: Applebot-Extended
Disallow: /
# Amazon
User-agent: Amazonbot
Disallow: /
# Other AI Crawlers
User-agent: Diffbot
Disallow: /
User-agent: ImagesiftBot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: webzio-extended
Disallow: /
User-agent: YouBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: AI2Bot
Disallow: /
# Allow standard search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: *
Allow: /
Upload this file to your website's root directory so it's accessible at https://yourdomain.com/robots.txt. Keep in mind that this list requires regular updates as new AI bots emerge—Cloudflare has identified over 226 known AI crawlers, and more appear regularly.
Layer 2: Cloudflare's AI Bot Management
Robots.txt has a fundamental limitation: it relies on voluntary compliance. Some AI crawlers have been documented ignoring these directives entirely, and malicious actors can spoof legitimate user-agent strings. This is where Cloudflare's infrastructure becomes invaluable.
On July 1, 2025, Cloudflare announced it would become the first major Internet infrastructure provider to block AI crawlers by default. For new domains, customers are now explicitly asked whether they want to allow AI crawlers—a significant shift from the previous opt-out model. More than one million Cloudflare customers have already chosen to block AI bots, demonstrating widespread demand for these protections.
Cloudflare's AI Crawl Control provides several key capabilities:
- One-Click Blocking: Enable "Block AI Bots" in the Security settings to immediately stop verified AI crawlers and unverified bots exhibiting similar behavior
- Granular Crawler Management: Allow, block, or charge individual AI crawlers based on your content strategy. Allow search-focused bots while blocking training crawlers
- Managed Robots.txt: Cloudflare can automatically create and maintain a robots.txt file with AI crawler directives, ensuring your preferences stay current
- Monetized Content Protection: Automatically block AI bots only on pages with advertising, preserving potential revenue while protecting monetized content
- Pay-Per-Crawl: For organizations open to licensing their content, Cloudflare's beta program enables direct payments from AI companies per crawl request
The combination of Cloudflare's global network visibility, machine learning models trained on bot behavior, and experience mitigating DDoS attacks makes their platform uniquely effective at identifying and blocking AI crawlers—even those that attempt to disguise themselves.
How ITECS Implements Cloudflare Protection
As a managed IT services provider, ITECS helps businesses leverage Cloudflare's comprehensive security suite—not just for AI bot management, but as part of an integrated web security strategy.
Proxy and Caching Services: Cloudflare's reverse proxy sits between your web servers and incoming traffic, providing DDoS protection, SSL/TLS termination, and intelligent caching that improves site performance while reducing server load. This architecture gives you visibility into all traffic hitting your domain, including AI crawlers.
Security Layers: Beyond bot management, Cloudflare provides Web Application Firewall (WAF) capabilities, rate limiting, and browser integrity checking. These tools work together to protect against both AI scraping and traditional cyber threats.
Analytics and Reporting: Cloudflare's dashboard provides detailed insights into bot traffic patterns, showing which AI crawlers are attempting to access your site, how often, and which pages they're targeting. This visibility enables data-driven decisions about your content access policies.
Our cybersecurity consulting team works with your organization to develop a comprehensive bot management strategy that balances content protection with legitimate business needs like search engine visibility and potential AI partnership opportunities.
Securing Your Microsoft 365 Environment
While website protection focuses on external crawlers, Microsoft 365 presents a different challenge: internal features that may share your data with AI systems. The "Connected Experiences" functionality has been part of Microsoft 365 since April 2019, but recent concerns about AI training have brought renewed scrutiny to these settings.
Understanding Connected Experiences
Microsoft's Connected Experiences encompass a wide range of cloud-powered features within Word, Excel, PowerPoint, Outlook, and other Microsoft 365 applications. These include real-time grammar suggestions, co-authoring capabilities, translation services, and content recommendations. When enabled, these features send document content to Microsoft's cloud services for processing.
Microsoft has stated explicitly: "Microsoft does not use customer data from Microsoft 365 consumer and commercial applications to train large language models." This distinction is important—while Connected Experiences do process your content, Microsoft claims this processing is for feature functionality rather than model training.
However, Microsoft's privacy statement includes language about using data "to develop and train our AI models," creating ambiguity that concerns privacy-conscious organizations. The company has clarified that enterprise customers can negotiate specific terms about data usage, but the default settings and documentation remain complex.
How to Disable Connected Experiences
For individual users, disabling Connected Experiences requires navigating through multiple layers of settings menus. Here's the path:
Individual User Settings (Word, Excel, PowerPoint, etc.)
- Open any Microsoft 365 application (Word, Excel, etc.)
- Click File in the top-left corner
- Select Options from the sidebar
- Navigate to the Trust Center tab
- Click Trust Center Settings
- Select Privacy Options
- Click Privacy Settings
- Find Optional Connected Experiences and uncheck the box
Note: This process must be repeated for each Microsoft 365 application on each device.
Enterprise Administrator Controls
For organizations using Microsoft 365 enterprise deployments, administrators have access to centralized policy controls that can enforce Connected Experiences settings across the entire organization.
Using Microsoft Intune, Group Policy, or the Microsoft 365 Apps admin center, IT administrators can:
- Disable all optional connected experiences organization-wide
- Control which specific connected experiences are available to users
- Prevent users from overriding organizational settings
- Configure different policies for different user groups based on data sensitivity
Microsoft Purview's Data Security Posture Management (DSPM) for AI provides additional monitoring capabilities, offering insights into AI interactions involving sensitive information and allowing organizations to create policies around Copilot for Microsoft 365 usage.
Copilot Considerations
Microsoft 365 Copilot represents a separate but related concern. While Copilot does access your organizational data through Microsoft Graph to provide AI-powered assistance, Microsoft maintains that "prompts, responses, and data accessed through Microsoft Graph aren't used to train foundation LLMs." The data processing occurs within your tenant boundary and is subject to your existing Microsoft 365 compliance configurations.
Organizations subject to regulatory requirements such as HIPAA or CMMC should work with their compliance teams and IT consultants to evaluate whether Copilot deployment aligns with their data handling requirements.
Controlling AI Access in Google Workspace
Google Workspace presents similar challenges to Microsoft 365, with AI-powered "Smart Features" enabled by default in many regions. A November 2025 California lawsuit alleges that Google changed its policies in October 2025 to give Gemini default access to private content including emails and attachments—content that previously required explicit user consent.
Understanding Google's Smart Features
Google's Smart Features encompass functionality across Gmail, Drive, Calendar, Meet, and other Workspace applications. When enabled, these features allow Google's AI to process your content for purposes including:
- Smart Reply and Smart Compose suggestions in Gmail
- Automatic event creation from email content
- Gemini-powered summaries and content analysis
- Intelligent search across your Workspace content
- Personalization features in Maps, Wallet, and Google Assistant
Google states: "We do not use your Workspace data to train or improve the underlying generative AI and large language models that power Bard, Search, and other systems outside of Workspace without permission." However, the data is used to improve Workspace-specific AI features, and the distinction between "improving features" and "training models" can be unclear.
Disabling Smart Features in Gmail
Google provides granular controls for Smart Features, but the settings are spread across multiple locations:
Gmail Smart Features Settings
- Open Gmail and click the gear icon (Settings)
- Click See all settings
- Navigate to the General tab
- Scroll to Smart features and personalization
- Uncheck Smart features and personalization
- Click Manage Workspace smart feature settings
- Disable both checkboxes:
- Smart features in Google Workspace
- Smart features in other Google products
- Click Save Changes
Note: Disabling these features will remove functionality like Smart Compose, categorized inbox, and automatic calendar event detection.
Google Workspace Administrator Controls
For organizations using Google Workspace Business, Enterprise, or Education editions, administrators can enforce AI feature restrictions organization-wide through the Google Admin console:
- Log in to admin.google.com
- Navigate to Apps → Google Workspace → Settings for Gmail → User settings
- Locate options for Gemini and Smart Features
- Configure restrictions for the entire organization or specific organizational units
The Gemini for Workspace feature can be completely disabled or restricted to specific user groups, giving organizations control over who can use AI-powered capabilities and under what circumstances.
Managing Gemini App Permissions
Even with Workspace Smart Features disabled, the standalone Gemini app has separate permissions that must be managed. Google's documentation indicates that Gemini learns from user chats by default, with sample conversations contributing to AI model training unless explicitly disabled.
To use Gemini without contributing to model training, users can utilize "temporary" chats or interact with Gemini without signing into their Google accounts. Organizations should include Gemini app policies in their broader AI governance strategy.
Building an Enterprise AI Data Protection Strategy
Effective protection against AI data harvesting requires more than configuring individual settings—it demands a comprehensive strategy that addresses policy, technology, and ongoing governance.
Policy Development
Organizations should develop clear policies addressing:
- Content Classification: Which content categories (proprietary research, customer data, public marketing materials) require AI crawler protection?
- AI Tool Usage: Under what circumstances can employees use AI tools that may process company data? What approvals are required?
- Third-Party Agreements: Do vendor contracts include provisions about AI training on your data? What due diligence is required before adopting new SaaS tools?
- Incident Response: How will you respond if you discover unauthorized AI training on your content? What documentation and legal resources are available?
Technical Implementation Checklist
AI Data Protection Implementation Checklist
Ongoing Governance
The AI crawler landscape evolves rapidly. New bots emerge regularly, existing bots change their behavior, and platform policies shift. Effective protection requires ongoing vigilance:
- Regular Audits: Review server logs and Cloudflare analytics monthly to identify new AI crawlers attempting to access your content
- Policy Updates: Keep robots.txt and firewall rules current as new AI bots are documented
- Platform Monitoring: Track Microsoft and Google announcements about AI feature changes that might affect your data handling
- Employee Training: Ensure staff understand policies about AI tool usage and recognize signs of unauthorized data access
Related Resources
Meta and Yandex Betrayed User Trust
How major tech companies have exploited user data for AI training without adequate consent
Quick Tips for Cybersecurity Hygiene
Foundational security practices that complement AI data protection efforts
Microsoft 365 MSP Guide
Comprehensive guide to managing Microsoft 365 environments securely
Claude vs ChatGPT: Business Comparison
Understanding the AI tools that may be crawling your content
Take Control of Your Digital Content
The battle over AI training data is fundamentally about control—control over your intellectual property, your customer data, and your competitive advantages. While AI companies have benefited enormously from freely scraping the web, businesses are increasingly asserting their right to decide how their content is used.
Implementing comprehensive AI data protection requires expertise across web infrastructure, cloud platforms, and enterprise security. ITECS helps businesses navigate this complex landscape, implementing layered defenses that protect your content without sacrificing legitimate functionality.
Our cybersecurity services team can audit your current exposure to AI crawlers, implement Cloudflare's enterprise bot management, configure Microsoft 365 and Google Workspace privacy settings, and establish ongoing governance processes to keep your protections current.
Protect Your Business from Unauthorized AI Training
Don't let AI companies profit from your content without your consent. ITECS provides comprehensive AI data protection services including Cloudflare implementation, Microsoft 365 and Google Workspace configuration, and ongoing bot management.
Schedule a Consultation Today →