\n\n\n\n\n\n\n
Information Retrieval Part 2: How To Get Into Model Training Data

There has never been a more important time in your career to spend time learning and understanding. Not because AI search differs drastically from traditional search. But because everyone else thinks it does.

Every C-suite in the country is desperate to get this right. Decision-makers need to feel confident that you and I are the right people to lead us into the new frontier.

We need to learn the fundamentals of information retrieval. Even if your business shouldn’t be doing anything differently.

Here, that starts with understanding the basics of model training data. What is it, how does it work and – crucially – how do I get in it.

TL;DR

  1. AI is the product of its training data. The quality (and quantity) the model trains on is key to its success.
  2. The web-sourced AI data commons is rapidly becoming more restricted. This will skew data representativity, freshness, and scaling laws.
  3. The more consistent, accurate brand mentions you have that appear in training data, the less ambiguous you are.
  4. Quality SEO, with better product and traditional marketing, will improve your appearance in the training and data, and eventually with real-time RAG/retrieval.

What Is Training Data?

Training data is the foundational dataset used in training LLMs to predict the most appropriate next word, sentence, and answer. The data can be labeled, where models are taught the right answer, or unlabeled, where they have to figure it out for themselves.

Without high-quality training data, models are completely useless.

From semi-libelous tweets to videos of cats and great works of art and literature that stand the test of time, nothing is off limits. Nothing. It’s not just words either. Speech-to-text models need to be trained to respond to different speech patterns and accents. Emotions even.

Information Retrieval Part 2: How To Get Into Model Training Data插图
Image Credit: Harry Clarkson-Bennett

How Does It Work?

The models don’t memorize, they compress. LLMs process billions of data points, adjusting internal weights through a mechanism known as backpropagation.

If the next word predicted in a string of training examples is correct, it moves on. If not, it gets the machine equivalent of Pavlovian conditioning.

Bopped on the head with a stick or a “good boy.”

The model is then able to vectorize. Creating a map of associations by term, phrase, and sentence.

  • Converting text into numerical vectors, aka Bag of Words.
  • Capturing semantic meaning of words and sentences, preserving wider context and meaning (word and sentence embeddings).

Rules and nuances are encoded as a set of semantic relationships; this is known as parametric memory. “Knowledge” baked directly into the architecture. The more refined a model’s knowledge on a topic, the less it has to use a form of grounding to verify its twaddle.

Worth noting that models with a high parametric memory are faster at retrieving accurate information (if available), but have a static knowledge base and literally forget things.

RAG and live web search is an example of a model using non-parametric memory. Infinite scale, but slower. Much better for news and when results require grounding.

Crafting Better Quality Algorithms

When it comes to the training data, drafting better quality algorithms relies on three elements:

  1. Quality.
  2. Quantity.
  3. Removal of bias.

Quality of data matters for obvious reasons. If you train a model on poorly labeled, solely synthetic data, the model performance cannot be expected to exactly mirror real problems or complexities.

Quantity of data is a problem, too. Mainly because these companies have eaten everything in sight and done a runner on the bill.

Leveraging synthetic data to solve issues of scale isn’t necessarily a problem. The days of accessing high-quality, free-to-air content on the internet for these guys are largely gone. For two main reasons:

  1. Unless you want diabolical racism, mean comments, conspiracy theories, and plagiarized BS, I’m not sure the internet is your guy anymore.
  2. If they respect company’s robots.txt directives at least. Eight in 10 of the world’s biggest news websites now block AI training bots. I don’t know how effective their CDN-level blocking is, but this makes quality training data harder to come by.

Bias and diversity (or lack of it) is a huge problem too. People have their own inherent biases. Even the ones building these models.

Shocking I know…

If models are fed data unfairly weighted towards certain characteristics or brands, it can reinforce societal issues. It can further discrimination.

Remember, LLMs are neither intelligent nor databases of facts. They analyze patterns from ingested data. Billions or trillions of numerical weights that determine the next word (token) following another in any given context.

How Is Training Data Collected?

Like every good SEO, it depends.

  1. If you built an AI model explicitly to identify pictures of dogs, you need pictures of dogs in every conceivable position. Every type of dog. Every emotion the pooch shows. You need to create or procure a dataset of millions, maybe billions, of canine images.
  2. Then it must be cleaned. Think of it as structuring data into a consistent format. In said dog scenario, maybe a feline friend nefariously added pictures of cats dressed up as dogs to mess you around. Those must be identified.
  3. Then labeled (for supervised learning). Data labeling (with some human annotation) ensures we have a sentient being somewhere in the loop. Hopefully, an expert to add relevant labels to a tiny portion data, so that a model can learn. For example, a dachshund sitting on a box looking melancholic.
  4. Pre-processing. Responding to issues like cats masquerading as dogs. Ensuring you minimize potential biases in the dataset like specific dog breeds being mentioned far more frequently than others.
  5. Partitioned. A portion of the data is kept back so the model can’t memorise the outputs. This is the final validation stage. Kind of like a placebo.

This is, obviously, expensive and time-consuming. It’s not feasible to take up hundreds of thousands of hours of expertise from real people in fields that matter.

Think of this. You’ve just broken your arm, and you’re waiting in the ER for six hours. You finally get seen, only to be told you had to wait because all the doctors have been processing images for OpenAI’s new model.

“Yes sir, I know you’re in excruciating pain, but I’ve got a hell of a lot of sad looking dogs to label.”

Data labeling is a time-consuming and tedious process. To combat this, many businesses hire large teams of human data annotators (aka humans in the loop, you know, actual experts), assisted by automated weak labeling models. In supervised learning, they sort the initial labeling.

For perspective, one hour of video data can take humans up to 800 hours to annotate.

Micro Models

So, companies build micro-models. Models that don’t require as much training or data to run. The humans in the loop (I’m sure they have names) can start training micro-models after annotating a few examples.

The models learn. They train themselves.

So over time, human input decreases, and we’re only needed to validate the outputs. And to make sure the models aren’t trying to undress children, celebrities, and your coworkers on the internet.

But who cares about that in the face of “progress.”

Information Retrieval Part 2: How To Get Into Model Training Data插图1
Image Credit: Harry Clarkson-Bennett

Types Of Training Data

Training data is usually categorized by how much guidance is provided or required (supervision) and the role it plays in the model’s lifecycle (function).

Ideally a model is largely trained on real data.

Once a model is ready, it can be trained and fine-tuned on synthetic data. But synthetic data alone is unlikely to create high-quality models.

  • Supervised (or labeled): Where every input is annotated with the “right” answer.
  • Unsupervised (or unlabeled): Work it out yourself, robots, I’m off for a beer.
  • Semi-supervised: where a small amount of the data is properly labeled and model “understands” the rules. More, I’ll have a beer in the office.
  • RLHF (Reinforcement Learning from Human Feedback): humans are shown two options and asked to pick the “right” one (preference data). Or a person demonstrates the task at hand for the mode to imitate (demonstration data).
  • Pre-training and fine-tuning data: Massive datasets allow for broad information acquisition, and fine-tuning is used to turn the model into a category expert.
  • Multi-modal: Images, videos, text, etc.

Then some what’s known as edge case data. Data designed to “trick” the model to make it more robust.

In light of the let’s call it “burgeoning” market for AI training data, there are obvious issues of “fair use” surrounding it.

“We find that 23% of supervised training datasets are published under research or non-commercial licenses.”

So pay people.

The Spectrum Of Supervision

In supervised learning, the AI algorithm is given labeled data. These labels define the outputs and are fundamental to the algorithm being able to improve over time on its own.

Let’s say you’re training a model to identify colors. There are dozens of shades of each color. Hundreds even. So while this is an easy example, it requires accurate labeling. The problem with accurate labeling is its time-consuming and potentially costly.

In unsupervised learning, the AI model is given unlabeled data. You chuck millions of rows, images, or videos at a machine, sit down for a coffee, and then kick it when it hasn’t worked out what to do.

It allows for more exploratory “pattern recognition.” Not learning.

While this approach has obvious drawbacks, it’s incredibly useful at identifying patterns a human might miss. The model can essentially define its own labels and pathway.

Models can and do train themselves, and they will find things a human never could. They’ll also miss things. It’s like a driverless car. Driverless cars may have fewer accidents than when a human is in the loop. But when they do, we find it far more unpalatable.

Information Retrieval Part 2: How To Get Into Model Training Data插图2
We don’t trust tech autonomy. (Image Credit: Harry Clarkson-Bennett)

It’s the technology that scares us. And rightly so.

Combatting Bias

Bias in training data is very real and potentially very damaging. There are three phases:

  1. Origin bias.
  2. Development bias.
  3. Deployment bias.

Origin bias references the validity and fairness of the dataset. Is the data all-encompassing? Is there any obvious systemic, implicit, or confirmation bias present?

Development bias includes the features or tenets of the data the model is being trained on. Does algorithmic bias occur because of the training data?

Then we have deployment bias. Where the evaluation and processing of the data leads to flawed outputs and automated/feedback loop bias.

You can really see why we need a human in the loop. And why AI models training on synthetic or inappropriately chosen data would be a disaster.

In healthcare, data collection activities influenced by human bias can lead to the training of algorithms that replicate historical inequalities. Yikes.

Leading to a pretty bleak cycle of reinforcement.

The Most Frequently Used Training Data Sources

Training data sources are wide-ranging in both quality and structure. You’ve got the open web, which is obviously a bit mental. X, if you want to train something to be racist. Reddit, if you’re looking for the Incel Bot 5000.

Or highly structured academic and literary repositories if you want to build something, you know, good … Obviously then you have to pay something.

Common Crawl

Common Crawl is a public web repository, a free, open-source storehouse of historical and current web crawl data available to pretty much anyone on the internet.

The full Common Crawl Web Graph currently contains around 607 million domain records across all datasets, with each monthly release covering 94 to 163 million domains.

In the Mozilla Foundation’s 2024 report, Training Data for the Price of a Sandwich, 64% of the 47 LLMs analysed used at least one filtered version of Common Crawl data.

If you aren’t in the training data, you’re very unlikely to be cited and referenced. The Common Crawl Index Server lets you search any URL pattern against their crawl archives and Metehan’s Web Graph helps you see how “centered you are.”

Wikipedia (And Wikidata)

The default English Wikipedia dataset contains 19.88 GB of complete articles that help with language modeling tasks. And Wikidata is an enormous, incredibly comprehensive knowledge graph. Immensely structured data.

While representing only a small percentage of the total tokens, Wikipedia is perhaps the most influential source for entity resolution and factual consensus. It is one of the most factually accurate, up-to-date, and well-structured repositories of content in existence.

Some of the biggest guys have just signed deals with Wikipedia.

Publishers

OpenAI, Gemini, etc., have multi-million dollar licensing deals with a number of publishers.

The list goes on, but only for a bit … and not recently. I’ve heard things have clammed shut. Which, given the state of their finances, may not be surprising.

Media & Libraries

This is mainly for multi-modal content training. Shutterstock (images/video), Getty Images have one with Perplexity, and Disney (a 2026 partner for the Sora video platform) provides the visual grounding for multi-modal models.

As part of this three-year licensing agreement with Disney, Sora will be able to generate short, user-prompted social videos based on Disney characters.

As part of the agreement, Disney will make a $1 billion equity investment in OpenAI, and receive warrants to purchase additional equity.

Books

BookCorpus turned scraped data of 11,000 unpublished books into a 985 million-word dataset.

We cannot write books fast enough for models to continually learn on. It’s part of the soon to happen model collapse.

Code Repositories

Coding has become one of the most influential and valuable features of LLMs. Specific LLMs like Cursor or Claude Code are incredible. GitHub and Stack Overflow data have built these models.

They’ve built the vibe-engineering revolution.

Public Web Data

Diverse (but relevant) web data results in faster convergence during training, which in turn reduces computational requirements. It’s dynamic. Ever-changing. But, unfortunately, a bit nuts and messy.

But, if you need vast swathes of data, maybe in real-time, then public web data is the way forward. Ditto for real opinions and reviews of products and services. Public web data, review platforms, UGC, and social media sites are great.

Why Models Aren’t Getting (Much) Better

While there’s no shortage of data in the world, most of it is unlabeled and, thus, can’t actually be used in supervised machine learning models. Every incorrect label has a negative impact on a model’s performance.

According to most, we’re only a few years away from running out of quality data. Inevitably, this will lead to a time when those genAI tools start consuming their own garbage.

This is a known problem that will cause model collapse.

  • They are being blocked by companies that do not want their data used pro bono to train the models.
  • Robots.txt protocols (a directive, not something directly enforceable), CDN-level blocking, and terms of service pages have been updated to tell these guys to get lost.
  • They consume data quicker than we can produce it.

Frankly, as more publishers and websites are forced into paywalling (a smart business decision), the quality of these models only gets worse.

So, How Do You Get In The Training Data?

There are two obvious approaches I think of.

  1. To identify the seed data sets of models that matter and find ways into them.
  2. To forgo the specifics and just do great SEO and wider marketing. Make a tangible impact in your industry.

I can see pros and cons to both. Finding ways into specific models is probably highly unnecessary for most brands. To me this smells more like grey hat SEO. Most brands will be better off just doing some really good marketing and getting shared, cited and you know, talked about.

These models are not trained on directly up-to-date data. This is important because you cannot retroactively get into a specific model’s training data. You have to plan ahead.

If you’re an individual, you should be:

  • Creating and sharing content.
  • Going on podcasts.
  • Attending industry events.
  • Sharing other people’s content.
  • Doing webinars.
  • Getting yourself in front of relevant publishers, publications, and people.

There are some pretty obvious sources of highly structured data that models have paid for in recent times. I know, they’ve actually paid for it. I don’t know what the guys at Reddit and Wikipedia had to do to get money from these guys, and maybe I don’t want to.

How Can I Tell What Datasets Models Use?

Everyone has become a lot more closed off with what they do and don’t use for training data. I suspect this is both legally and financially motivated. So, you’ll need to do some digging.

And there are some massive “open source” datasets I suspect they all use:

  • Common Crawl.
  • Wikipedia.
  • Wikidata.
  • Coding repositories.

Fortunately, most deals are public, and it’s safe to assume that models use data from these platforms.

Google has a partnership with Reddit and access to an insane amount of transcripts from YouTube. They almost certainly have more valuable, well-structured data at their fingertips than any other company.

Grok trained almost exclusively on real-time data from X. Hence why it acts like a pre-pubescent school shooter and undresses everyone.

Worth noting that AI companies use third party vendors. Factories where data is scraped, cleaned and structured to create supervised datasets. Scale AI is the data engine that the big players use. Bright Data specialise in web data collection.

A Checklist

OK, so we’re trying to feature in parametric memory. To appear in the LLMs training data so the model recognizes you and you’re more likely to be used for RAG/retrieval. That means we need to:

  1. Manage the multi-bot ecosystem of training, indexing, and browsing.
  2. Entity optimization. Well-structured, well-connected content, consistent NAPs, sameAs schema properties, and Knowledge Graph presence. In Google and Wikidata.
  3. Make sure your content is rendered on the server side. Google has become very adept at rendering content on the client side. Bots like GPT-bot only see the HTML response. JavaScript is still clunky.
  4. Well-structured, machine-readable content in relevant formats. Tables, lists, properly structured semantic HTML.
  5. Get. Yourself. Out. There. Share your stuff. Make noise.
  6. Be ultra, ultra clear on your website about who you are. Answer the relevant questions. Own your entities.

You have to balance direct associations (what you say) with semantic associations (what others say about you). Make your brand the obvious next word.

Modern SEO, with better marketing.

More Resources:


Read Leadership In SEO, subscribe now.


Featured Image: Collagery/Shutterstock

SEO#Information #Retrieval #Part #Model #Training #Data1770216520

Leave a Reply

Your email address will not be published. Required fields are marked *

Instagram

This error message is only visible to WordPress admins

Error: No feed found.

Please go to the Instagram Feed settings page to create a feed.