{"id":9689,"date":"2026-06-10T09:03:13","date_gmt":"2026-06-10T01:03:13","guid":{"rendered":"http:\/\/longzhuplatform.com\/?p=9689"},"modified":"2026-06-10T09:03:13","modified_gmt":"2026-06-10T01:03:13","slug":"us-publishers-demand-common-crawl-stop-scraping-their-content-via-sejournal-mattgsouthern","status":"publish","type":"post","link":"http:\/\/longzhuplatform.com\/?p=9689","title":{"rendered":"US Publishers Demand Common Crawl Stop Scraping Their Content via @sejournal, @MattGSouthern"},"content":{"rendered":"<p><\/p> <div id=\"narrow-cont\"> <p>Digital Content Next, a trade body representing US digital publishers, has sent a cease and desist letter to the Common Crawl Foundation.<\/p> <p>The letter demands Common Crawl stop collecting publisher content and remove material already in its datasets.<\/p> <p>DCN CEO Jason Kint announced the legal notice in a blog post, and Press Gazette reported additional details from the letter this week.<\/p> <p>Common Crawl has crawled several billion new pages each month since 2007 to build a free public archive. That archive has been used to train many of the AI models in use today. OpenAI\u2019s GPT-3 paper listed filtered Common Crawl as 60% of the model\u2019s training mix.<\/p> <p>The dispute matters for any site that blocks AI crawlers. Blocking Common Crawl\u2019s crawler, CCBot, stops future collection but doesn\u2019t touch content already in the archive, which anyone can still download.<\/p> <h2>What DCN Demands<\/h2> <p>The letter calls on Common Crawl to stop \u201cscraping, retaining, or sharing copyrighted, paywalled, subscriber-only, or otherwise protected content from DCN member companies in its datasets,\u201d and to remove member content it has already collected.<\/p> <p>DCN claims Common Crawl has \u201cflagrantly infringed\u201d copyrighted content by creating its datasets and sharing them with AI companies.<\/p> <p>The letter argues \u201ccopyright law is not an opt-out regime.\u201d In other words, DCN\u2019s position is that publishers shouldn\u2019t have to ask to be excluded. Common Crawl should need permission to include them.<\/p> <p>Kint wrote that the notice:<\/p> <blockquote> <p>\u201cchallenges a growing assumption that content created through substantial investment can be collected, stored, repurposed, and monetized simply because it is technically accessible.\u201d<\/p> <\/blockquote> <h2>Why DCN Doubts The Removal Process<\/h2> <p>The DCN letter questions whether Common Crawl follows opt-out instructions and whether it removes content when asked. Per Press Gazette, DCN\u2019s lawyers are examining whether Common Crawl\u2019s statements to publishers \u201cmay have been inaccurate or misleading.\u201d<\/p> <p>Common Crawl publishes a public registry of websites that have asked not to be scraped. It includes entries for the Associated Press, the BBC, and a large News\/Media Alliance submission covering hundreds of domains. Press Gazette reports the list also includes other major publishers.<\/p> <p>This isn\u2019t the first time the removal process has been questioned. The Atlantic reported in November that content from The New York Times and Danish publishers was still available after Common Crawl agreed to remove it.<\/p> <h2>Common Crawl\u2019s Response<\/h2> <p>Common Crawl executive director Rich Skrenta declined to comment on the letter when contacted by Press Gazette.<\/p> <p>He has pushed back on similar claims before. In a November blog post responding to The Atlantic, Skrenta denied that the organization lied to publishers or scrapes paywalled material.<\/p> <p>He said the archive\u2019s file format can\u2019t be edited after publication without breaking its integrity. Instead, Common Crawl says it removes or filters affected URLs from subsequent crawls and makes them inaccessible through its public tools and indices:<\/p> <blockquote> <p>\u201cWhen a publisher asks us to remove previously crawled material, we respond promptly and initiate a removal process that reflects the technical design of our dataset.\u201d<\/p> <\/blockquote> <p>He added:<\/p> <blockquote> <p>\u201cNo one at Common Crawl has ever claimed this work was instantaneous or complete; rather, we have been open about its complexity and ongoing nature.\u201d<\/p> <\/blockquote> <p>In a forum post this week, Skrenta said Common Crawl is contributing to open standards work on how websites express AI scraping preferences.<\/p> <h2>Why This Matters<\/h2> <p>The DCN letter targets the stored archive, not just future crawling, and argues the burden should not fall on publishers to opt out in the first place.<\/p> <p>Most publishers in BuzzStream\u2019s sample have already made the blocking decision, with 79% of the 100 news sites it checked blocking at least one training bot. Cloudflare\u2019s Year in Review data we covered in January found CCBot among the bots with the most full disallow directives across top domains. The question DCN raises is what those blocks accomplish if years of content stay available for training anyway.<\/p> <h2>Looking Ahead<\/h2> <p>Whether DCN escalates depends on how Common Crawl responds, and Common Crawl hasn\u2019t said how it will. The two sides want different rules for who acts first.<\/p> <p>Skrenta is backing standards work that would let sites state their scraping preferences, which keeps opting out as the model. The UK\u2019s CMA took a similar path when it required Google to let publishers opt out of AI search features.<\/p> <p>DCN argues scrapers should need permission first. If more trade groups take up that argument, the pressure moves from individual robots.txt files to the archives themselves.<\/p> <hr\/> <p><em>Featured Image: <span class=\"MuiBox-root mui-16qd35q-centeredContent-avatarContainer\"><span class=\"MuiTypography-root MuiTypography-body1 mui-1w8ttpd-contributorLabel-linkAvatarLabel\">Andre Boukreev<\/span><\/span>\/Shutterstock<\/em><\/p> <\/div> <p>News,SEO#Publishers #Demand #Common #Crawl #Stop #Scraping #Content #sejournal #MattGSouthern1781053393<\/p> ","protected":false},"excerpt":{"rendered":"<p>Digital Content Next, a trade body representing US digital publishers, has sent a cease and desist letter to the Common Crawl Foundation. The letter demands Common Crawl stop collecting publisher content and remove material already in its datasets. DCN CEO Jason Kint announced the legal notice in a blog post, and Press Gazette reported additional [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":9690,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[16],"tags":[1744,185,8581,1578,90,84,13153,80,6301],"class_list":["post-9689","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-accessibility","tag-common","tag-content","tag-crawl","tag-demand","tag-mattgsouthern","tag-publishers","tag-scraping","tag-sejournal","tag-stop"],"acf":[],"_links":{"self":[{"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/posts\/9689","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=9689"}],"version-history":[{"count":0,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/posts\/9689\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/media\/9690"}],"wp:attachment":[{"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=9689"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=9689"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=9689"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}