{"id":981,"date":"2026-01-08T15:45:51","date_gmt":"2026-01-08T07:45:51","guid":{"rendered":"http:\/\/longzhuplatform.com\/?p=981"},"modified":"2026-01-08T15:45:51","modified_gmt":"2026-01-08T07:45:51","slug":"most-major-news-publishers-block-ai-training-retrieval-bots-via-sejournal-mattgsouthern","status":"publish","type":"post","link":"http:\/\/longzhuplatform.com\/?p=981","title":{"rendered":"Most Major News Publishers Block AI Training &amp; Retrieval Bots via @sejournal, @MattGSouthern"},"content":{"rendered":"<p><\/p> <div id=\"narrow-cont\"> <p>Most top news publishers block AI training bots via robots.txt, but they\u2019re also blocking the retrieval bots that determine whether sites appear in AI-generated answers.<\/p> <p>BuzzStream analyzed the robots.txt files of 100 top news sites across the US and UK and found <strong>79%<\/strong> block at least one training bot. More notably, <strong>71%<\/strong> also block at least one retrieval or live search bot.<\/p> <p>Training bots gather content to build AI models, while retrieval bots fetch content in real time when users ask questions. Sites blocking retrieval bots may not appear when AI tools try to cite sources, even if the underlying model was trained on their content.<\/p> <h2>What The Data Shows<\/h2> <p>BuzzStream examined the top 50 news sites in each market based on SimilarWeb traffic share, then deduplicated the list. The study grouped bots into three categories: training, retrieval\/live search, and indexing.<\/p> <h3>Training Bot Blocks<\/h3> <p>Among training bots, Common Crawl\u2019s CCBot was the most frequently blocked at 75%, followed by Anthropic-ai at 72%, ClaudeBot at 69%, and GPTBot at 62%.<\/p> <p>Google-Extended, which trains Gemini, was the least blocked training bot at 46% overall. US publishers blocked it at 58%, nearly double the 29% rate among UK publishers.<\/p> <p>Harry Clarkson-Bennett, SEO Director at The Telegraph, told BuzzStream:<\/p> <blockquote> <p>\u201cPublishers are blocking AI bots using the robots.txt because there\u2019s almost no value exchange. LLMs are not designed to send referral traffic and publishers (still!) need traffic to survive.\u201d<\/p> <\/blockquote> <h3>Retrieval Bot Blocks<\/h3> <p>The study found 71% of sites block at least one retrieval or live search bot.<\/p> <p>Claude-Web was blocked by 66% of sites, while OpenAI\u2019s OAI-SearchBot, which powers ChatGPT\u2019s live search, was blocked by 49%. ChatGPT-User was blocked by 40%.<\/p> <p>Perplexity-User, which handles user-initiated retrieval requests, was the least blocked at 17%.<\/p> <h3>Indexing Blocks<\/h3> <p>PerplexityBot, which Perplexity uses to index pages for its search corpus, was blocked by 67% of sites.<\/p> <p>Only 14% of sites blocked all AI bots tracked in the study, while 18% blocked none.<\/p> <h2>The Enforcement Gap<\/h2> <p>The study acknowledges that robots.txt is a directive, not a barrier, and bots can ignore it.<\/p> <p>We covered this enforcement gap when Google\u2019s Gary Illyes confirmed robots.txt can\u2019t prevent unauthorized access. It functions more like a \u201cplease keep out\u201d sign than a locked door.<\/p> <p>Clarkson-Bennett raised the same point in BuzzStream\u2019s report:<\/p> <blockquote> <p>\u201cThe robots.txt file is a directive. It\u2019s like a sign that says please keep out, but doesn\u2019t stop a disobedient or maliciously wired robot. Lots of them flagrantly ignore these directives.\u201d<\/p> <\/blockquote> <p>Cloudflare documented that Perplexity used stealth crawling behavior to bypass robots.txt restrictions. The company rotated IP addresses, changed ASNs, and spoofed its user agent to appear as a browser.<\/p> <p>Cloudflare delisted Perplexity as a verified bot and now actively blocks it. Perplexity disputed Cloudflare\u2019s claims and published a response.<\/p> <p>For publishers serious about blocking AI crawlers, CDN-level blocking or bot fingerprinting may be necessary beyond robots.txt directives.<\/p> <h2>Why This Matters<\/h2> <p>The retrieval-blocking numbers warrant attention here. In addition to opting out of AI training, many publishers are opting out of the citation and discovery layer that AI search tools use to surface sources.<\/p> <p>OpenAI separates its crawlers by function: GPTBot gathers training data, while OAI-SearchBot powers live search in ChatGPT. Blocking one doesn\u2019t block the other. Perplexity makes a similar distinction between PerplexityBot for indexing and Perplexity-User for retrieval.<\/p> <p>These blocking choices affect where AI tools can pull citations from. If a site blocks retrieval bots, it may not appear when users ask AI assistants for sourced answers, even if the model already contains that site\u2019s content from training.<\/p> <p>The Google-Extended pattern is worth watching. US publishers block it at nearly twice the UK rate, though whether that reflects different risk calculations around Gemini\u2019s growth or different business relationships with Google isn\u2019t clear from the data.<\/p> <h2>Looking Ahead<\/h2> <p>The robots.txt method has limits, and sites that want to block AI crawlers may find CDN-level restrictions more effective than robots.txt alone.<\/p> <p>Cloudflare\u2019s Year in Review found GPTBot, ClaudeBot, and CCBot had the highest number of full disallow directives across top domains. The report also noted that most publishers use partial blocks for Googlebot and Bingbot rather than full blocks, reflecting the dual role Google\u2019s crawler plays in search indexing and AI training.<\/p> <p>For those tracking AI visibility, the retrieval bot category is what to watch. Training blocks affect future models, while retrieval blocks affect whether your content shows up in AI answers right now.<\/p> <hr\/> <p><em>Featured Image: Kitinut Jinapuck\/Shutterstock<\/em><\/p> <\/div> <p>Generative AI,News#Major #News #Publishers #Block #Training #amp #Retrieval #Bots #sejournal #MattGSouthern1767858351<\/p> ","protected":false},"excerpt":{"rendered":"<p>Most top news publishers block AI training bots via robots.txt, but they\u2019re also blocking the retrieval bots that determine whether sites appear in AI-generated answers. BuzzStream analyzed the robots.txt files of 100 top news sites across the US and UK and found 79% block at least one training bot. More notably, 71% also block at [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":982,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[16],"tags":[87,85,89,82,90,83,84,88,80,86],"class_list":["post-981","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-accessibility","tag-amp","tag-block","tag-bots","tag-major","tag-mattgsouthern","tag-news","tag-publishers","tag-retrieval","tag-sejournal","tag-training"],"acf":[],"_links":{"self":[{"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/posts\/981","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=981"}],"version-history":[{"count":0,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/posts\/981\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/media\/982"}],"wp:attachment":[{"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=981"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=981"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=981"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}