{"id":2814,"date":"2026-02-04T22:48:40","date_gmt":"2026-02-04T14:48:40","guid":{"rendered":"http:\/\/longzhuplatform.com\/?p=2814"},"modified":"2026-02-04T22:48:40","modified_gmt":"2026-02-04T14:48:40","slug":"information-retrieval-part-2-how-to-get-into-model-training-data","status":"publish","type":"post","link":"http:\/\/longzhuplatform.com\/?p=2814","title":{"rendered":"Information Retrieval Part 2: How To Get Into Model Training Data"},"content":{"rendered":"<p><\/p> <div id=\"narrow-cont\"> <p>There has never been a more important time in your career to spend time learning and understanding. Not because AI search differs drastically from traditional search. But because everyone else thinks it does.<\/p> <p>Every C-suite in the country is desperate to get this right. Decision-makers need to feel confident that you and I are the right people to lead us into the new frontier.<\/p> <p>We need to learn the fundamentals of information retrieval. Even if your business shouldn\u2019t be doing anything differently.<\/p> <p>Here, that starts with understanding the basics of model training data. What is it, how does it work and \u2013 crucially \u2013 <em>how do I get in it<\/em>.<\/p> <h2><strong>TL;DR<\/strong><\/h2> <ol> <li>AI is the product of its training data. The quality (and quantity) the model trains on is key to its success.<\/li> <li>The web-sourced AI data commons is rapidly becoming more restricted. This will skew data representativity, freshness, and scaling laws.<\/li> <li>The more consistent, accurate brand mentions you have that appear in training data,\u00a0the less ambiguous you are.<\/li> <li>Quality SEO, with better product and traditional marketing, will improve your appearance in the training and data, and eventually with real-time RAG\/retrieval.<\/li> <\/ol> <h2>What Is Training Data?<\/h2> <p>Training data is the foundational dataset used in training LLMs to predict the most appropriate next word, sentence, and answer. The data can be labeled, where models are taught the right answer, or unlabeled, where they have to figure it out for themselves.<\/p> <p>Without high-quality training data, models are completely useless.<\/p> <p>From semi-libelous tweets to videos of cats and great works of art and literature that stand the test of time, nothing is off limits. Nothing. It\u2019s not just words either. Speech-to-text models need to be trained to respond to different speech patterns and accents. Emotions even.<\/p> <figure id=\"attachment_566373\" class=\"wp-caption aligncenter\" style=\"width: 800px\"><img decoding=\"async\" src=\"https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-2-804.webp\"  width=\"800\" height=\"533\" class=\"wp-image-566373 size-full\" srcset=\"https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-2-804-384x256.webp 384w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-2-804-425x283.webp 425w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-2-804-480x320.webp 480w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-2-804-680x453.webp 680w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-2-804-768x512.webp 768w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-2-804.webp 800w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" loading=\"lazy\" title=\"Information Retrieval Part 2: How To Get Into Model Training Data\u63d2\u56fe\" alt=\"Information Retrieval Part 2: How To Get Into Model Training Data\u63d2\u56fe\" \/><figcaption class=\"wp-caption-text\">Image Credit: Harry Clarkson-Bennett<\/figcaption><\/figure> <h2>How Does It Work?<\/h2> <p>The models don\u2019t memorize, they compress. LLMs process billions of data points, adjusting internal weights through a mechanism known as backpropagation.<\/p> <p>If the next word predicted in a string of training examples is correct, it moves on. If not, it gets the machine equivalent of Pavlovian conditioning.<\/p> <p>Bopped on the head with a stick or a \u201cgood boy.\u201d<\/p> <p>The model is then able to\u00a0vectorize. Creating a map of associations by term, phrase, and sentence.<\/p> <ul> <li>Converting text into numerical vectors, aka Bag of Words.<\/li> <li>Capturing semantic meaning of words and sentences, preserving wider context and meaning (word\u00a0and\u00a0sentence embeddings).<\/li> <\/ul> <p>Rules and nuances are encoded as a set of semantic relationships; this is known as parametric memory. \u201cKnowledge\u201d baked directly into the architecture. The more refined a model\u2019s knowledge on a topic, the less it has to use a form of grounding to verify its twaddle.<\/p> <blockquote> <p>Worth noting that models with a high parametric memory are faster at retrieving accurate information (if available), but have a static knowledge base and literally forget things.<\/p> <p>RAG and live web search is an example of a model using non-parametric memory. Infinite scale, but slower. Much better for news and when results require grounding.<\/p> <\/blockquote> <h3>Crafting Better Quality Algorithms<\/h3> <p>When it comes to the training data, drafting better quality algorithms relies on three elements:<\/p> <ol> <li>Quality.<\/li> <li>Quantity.<\/li> <li>Removal of bias.<\/li> <\/ol> <p><strong>Quality of data\u00a0<\/strong>matters for obvious reasons. If you train a model on poorly labeled, solely synthetic data, the model performance cannot be expected to exactly mirror real problems or complexities.<\/p> <p><strong>Quantity of data\u00a0<\/strong>is a problem, too. Mainly because these companies have eaten everything in sight and done a runner on the bill.<\/p> <p>Leveraging synthetic data to solve issues of scale isn\u2019t necessarily a problem. The days of accessing high-quality, free-to-air content on the internet for these guys are largely gone. For two main reasons:<\/p> <ol> <li>Unless you want diabolical racism, mean comments, conspiracy theories, and plagiarized BS, I\u2019m not sure the internet is your guy anymore.<\/li> <li>If they respect company\u2019s\u00a0robots.txt directives\u00a0at least.\u00a0Eight in 10 of the world\u2019s biggest news websites now block AI training bots. I don\u2019t know how effective their CDN-level blocking is, but this makes quality training data harder to come by.<\/li> <\/ol> <p>Bias and diversity\u00a0(or lack of it) is a huge problem too. People have their own inherent biases. Even the ones building these models.<\/p> <p>Shocking I know\u2026<\/p> <p>If models are fed data unfairly weighted towards certain characteristics or brands, it can reinforce societal issues. It can further discrimination.<\/p> <blockquote> <p>Remember, LLMs are neither intelligent nor databases of facts. They\u00a0analyze patterns from ingested data. Billions or trillions of numerical weights that determine the next word (token) following another in any given context.<\/p> <\/blockquote> <h2>How Is Training Data Collected?<\/h2> <p>Like every good SEO, it depends.<\/p> <ol> <li>If you built an AI model explicitly to identify pictures of dogs, you need pictures of dogs in every conceivable position. Every type of dog. Every emotion the pooch shows. You need to create or procure a dataset of millions, maybe billions, of canine images.<\/li> <li>Then it must be cleaned. Think of it as structuring data into a consistent format. In said dog scenario, maybe a feline friend nefariously added pictures of cats dressed up as dogs to mess you around. Those must be identified.<\/li> <li>Then labeled (for supervised learning). Data labeling (with some human annotation) ensures we have a sentient being somewhere in the loop. Hopefully, an expert to add relevant labels to a tiny portion data, so that a model can learn. For example, a dachshund sitting on a box looking melancholic.<\/li> <li>Pre-processing. Responding to issues like cats masquerading as dogs. Ensuring you minimize potential biases in the dataset like specific dog breeds being mentioned far more frequently than others.<\/li> <li>Partitioned.\u00a0A portion of the data is kept back so the model can\u2019t memorise the outputs. This is the final validation stage. Kind of like a placebo.<\/li> <\/ol> <p>This is, obviously, expensive and time-consuming. It\u2019s not feasible to take up hundreds of thousands of hours of expertise from real people in fields that matter.<\/p> <p>Think of this. You\u2019ve just broken your arm, and you\u2019re waiting in the ER for six hours. You finally get seen, only to be told you had to wait because all the doctors have been processing images for OpenAI\u2019s new model.<\/p> <div> <p>\u201cYes sir, I know you\u2019re in excruciating pain, but I\u2019ve got a hell of a lot of sad looking dogs to label.\u201d<\/p> <\/div> <blockquote> <p>Data labeling is a time-consuming and tedious process. To combat this, many businesses hire large teams of human data annotators (aka humans in the loop, you know, actual experts), assisted by automated weak labeling models. In supervised learning, they sort the initial labeling.<\/p> <p>For perspective,\u00a0one hour of video data can take humans up to 800 hours to annotate.<\/p> <\/blockquote> <h3>Micro Models<\/h3> <p>So, companies build <strong>micro-models<\/strong>. Models that don\u2019t require as much training or data to run. The\u00a0humans in the loop\u00a0(I\u2019m sure they have names) can start training micro-models after annotating a few examples.<\/p> <p>The models learn. They train themselves.<\/p> <p>So over time, human input decreases, and we\u2019re only needed to validate the outputs. And to make sure the models aren\u2019t trying to undress children, celebrities, and your coworkers on the internet.<\/p> <p>But who cares about that in the face of \u201cprogress.\u201d<\/p> <figure id=\"attachment_566374\" class=\"wp-caption aligncenter\" style=\"width: 800px\"><img decoding=\"async\" src=\"https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-3-301.webp\"  width=\"800\" height=\"2000\" class=\"size-full wp-image-566374\" srcset=\"https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-3-301-384x960.webp 384w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-3-301-425x1063.webp 425w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-3-301-480x1200.webp 480w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-3-301-614x1536.webp 614w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-3-301-680x1700.webp 680w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-3-301-768x1920.webp 768w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-3-301.webp 800w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" loading=\"lazy\" title=\"Information Retrieval Part 2: How To Get Into Model Training Data\u63d2\u56fe1\" alt=\"Information Retrieval Part 2: How To Get Into Model Training Data\u63d2\u56fe1\" \/><figcaption class=\"wp-caption-text\">Image Credit: Harry Clarkson-Bennett<\/figcaption><\/figure> <h2>Types Of Training Data<\/h2> <p>Training data is usually categorized by how much guidance is provided or required (supervision) and the role it plays in the model\u2019s lifecycle (function).<\/p> <p>Ideally a model is largely trained on real data.<\/p> <p>Once a model is ready, it can be trained and fine-tuned on synthetic data. But synthetic data alone is unlikely to create high-quality models.<\/p> <ul> <li><strong>Supervised (or labeled):<\/strong> Where every input is annotated with the \u201cright\u201d answer.<\/li> <li><strong>Unsupervised (or unlabeled):<\/strong> Work it out yourself, robots, I\u2019m off for a beer.<\/li> <li><strong>Semi-supervised:\u00a0<\/strong>where a small amount of the data is properly labeled and model \u201cunderstands\u201d the rules. More, I\u2019ll have a beer in the office.<\/li> <li><strong>RLHF (Reinforcement Learning from Human Feedback):<\/strong> humans are shown two options and asked to pick the \u201cright\u201d one (preference data). Or a person demonstrates the task at hand for the mode to imitate (demonstration data).<\/li> <li><strong>Pre-training\u00a0and\u00a0fine-tuning data:<\/strong> Massive datasets allow for broad information acquisition, and fine-tuning is used to turn the model into a category expert.<\/li> <li><strong>Multi-modal: <\/strong>Images, videos, text, etc.<\/li> <\/ul> <p>Then some what\u2019s known as edge case data. Data designed to \u201ctrick\u201d the model to make it more robust.<\/p> <blockquote> <p>In light of the let\u2019s call it \u201cburgeoning\u201d market for AI training data, there are obvious issues of \u201cfair use\u201d surrounding it.<\/p> <p><em>\u201cWe find that 23% of supervised training datasets are published under research or non-commercial licenses.\u201d<\/em><\/p> <p>So pay people.<\/p> <\/blockquote> <h3>The Spectrum Of Supervision<\/h3> <p>In supervised learning, the AI algorithm is given labeled data. These labels define the outputs and are fundamental to the algorithm being able to improve over time on its own.<\/p> <p>Let\u2019s say you\u2019re training a model to identify colors. There are dozens of shades of each color. Hundreds even. So while this is an easy example, it requires accurate labeling. The problem with accurate labeling is its time-consuming and potentially costly.<\/p> <p>In unsupervised learning, the AI model is given unlabeled data. You chuck millions of rows, images, or videos at a machine, sit down for a coffee, and then kick it when it hasn\u2019t worked out what to do.<\/p> <p>It allows for more exploratory \u201cpattern recognition.\u201d Not learning.<\/p> <p>While this approach has obvious drawbacks, it\u2019s incredibly useful at identifying patterns a human might miss. The model can essentially define its own labels and pathway.<\/p> <p>Models can and do train themselves, and they will find things a human never could. They\u2019ll also miss things. It\u2019s like a driverless car. Driverless cars may have fewer accidents than when a human is in the loop. But when they do, we find it far more unpalatable.<\/p> <figure id=\"attachment_566375\" class=\"wp-caption aligncenter\" style=\"width: 1013px\"><img decoding=\"async\" src=\"https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-4-495.webp\"  width=\"1013\" height=\"343\" class=\"size-full wp-image-566375\" srcset=\"https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-4-495-384x130.webp 384w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-4-495-425x144.webp 425w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-4-495-480x163.webp 480w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-4-495-680x230.webp 680w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-4-495-768x260.webp 768w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-4-495-850x288.webp 850w, https:\/\/cdn.searchenginejournal.com\/wp-content\/uploads\/2026\/02\/retrieval-pt2-4-495.webp 1013w\" sizes=\"auto, (max-width: 1013px) 100vw, 1013px\" loading=\"lazy\" title=\"Information Retrieval Part 2: How To Get Into Model Training Data\u63d2\u56fe2\" alt=\"Information Retrieval Part 2: How To Get Into Model Training Data\u63d2\u56fe2\" \/><figcaption class=\"wp-caption-text\">We don\u2019t trust tech autonomy. (Image Credit: Harry Clarkson-Bennett)<\/figcaption><\/figure> <p>It\u2019s the technology that scares us. And rightly so.<\/p> <h2>Combatting Bias<\/h2> <p>Bias in training data is very real and potentially very damaging. There are three phases:<\/p> <ol> <li>Origin bias.<\/li> <li>Development bias.<\/li> <li>Deployment bias.<\/li> <\/ol> <p><strong>Origin bias<\/strong> references the validity and fairness of the dataset. Is the data all-encompassing? Is there any obvious systemic, implicit, or confirmation bias present?<\/p> <p><strong>Development bias<\/strong>\u00a0includes the features or tenets of the data the model is being trained on. Does algorithmic bias occur\u00a0<em>because of<\/em>\u00a0the training data?<\/p> <p>Then we have\u00a0<strong>deployment bias<\/strong>. Where the evaluation and processing of the data leads to flawed outputs and automated\/feedback loop bias.<\/p> <p>You can really see why we need a human in the loop. And why AI models training on synthetic or inappropriately chosen data would be a disaster.<\/p> <blockquote> <p>In healthcare, data collection activities influenced by human bias can lead to the training of algorithms that replicate historical inequalities. Yikes.<\/p> <p>Leading to a pretty bleak cycle of reinforcement.<\/p> <\/blockquote> <h2>The Most Frequently Used Training Data Sources<\/h2> <p>Training data sources are wide-ranging in both quality and structure. You\u2019ve got the open web, which is obviously a bit mental. X, if you want to train something to be racist. Reddit, if you\u2019re looking for the Incel Bot 5000.<\/p> <p>Or highly structured academic and literary repositories if you want to build something, you know, good \u2026 Obviously then you have to pay something.<\/p> <h3>Common Crawl<\/h3> <p>Common Crawl is a public web repository, a free, open-source storehouse of historical and current web crawl data available to pretty much anyone on the internet.<\/p> <p>The full Common Crawl Web Graph currently contains around\u00a0607 million domain records across all datasets, with each monthly release covering 94 to 163 million domains.<\/p> <p>In the Mozilla Foundation\u2019s 2024 report, Training Data for the Price of a Sandwich,\u00a064% of the 47 LLMs\u00a0analysed used at least one filtered version of Common Crawl data.<\/p> <blockquote> <p>If you aren\u2019t in the training data, you\u2019re very unlikely to be cited and referenced. The\u00a0Common Crawl Index Server\u00a0lets you search any URL pattern against their crawl archives and\u00a0Metehan\u2019s Web Graph helps you see how \u201ccentered you are.\u201d<\/p> <\/blockquote> <h3>Wikipedia (And Wikidata)<\/h3> <p>The default English Wikipedia dataset\u00a0contains 19.88 GB of complete articles\u00a0that help with language modeling tasks. And\u00a0Wikidata\u00a0is an enormous, incredibly comprehensive knowledge graph. Immensely structured data.<\/p> <p>While representing only a small percentage of the total tokens, Wikipedia is perhaps the most influential source for entity resolution and factual consensus. It is one of the most factually accurate, up-to-date, and well-structured repositories of content in existence.<\/p> <p>Some of the biggest guys have just signed deals with Wikipedia.<\/p> <h3>Publishers<\/h3> <p>OpenAI, Gemini, etc., have multi-million dollar licensing deals with a number of publishers.<\/p> <p>The list goes on, but only for a bit \u2026 and not recently. I\u2019ve heard things have clammed shut. Which, given the state of their finances, may not be surprising.<\/p> <h3>Media &amp; Libraries<\/h3> <p>This is mainly for multi-modal content training.\u00a0Shutterstock\u00a0(images\/video),\u00a0Getty Images\u00a0have one with Perplexity, and\u00a0Disney (a 2026 partner for the Sora video platform) provides the visual grounding for multi-modal models.<\/p> <blockquote> <p>As part of this three-year licensing agreement with Disney, Sora will be able to generate short, user-prompted social videos based on Disney characters.<\/p> <p>As part of the agreement, Disney will make a\u00a0$1 billion equity investment\u00a0in OpenAI, and receive warrants to purchase additional equity.<\/p> <\/blockquote> <h3>Books<\/h3> <p>BookCorpus\u00a0turned scraped data of 11,000 unpublished books into a 985 million-word dataset.<\/p> <blockquote> <p>We cannot write books fast enough for models to continually learn on. It\u2019s part of the soon to happen model collapse.<\/p> <\/blockquote> <h3>Code Repositories<\/h3> <p>Coding has become one of the most influential and valuable features of LLMs. Specific LLMs like Cursor or Claude Code are incredible. GitHub and Stack Overflow data have built these models.<\/p> <p>They\u2019ve built the vibe-engineering revolution.<\/p> <h3>Public Web Data<\/h3> <p>Diverse (but relevant) web data results in faster convergence during training, which in turn reduces computational requirements. It\u2019s dynamic. Ever-changing. But, unfortunately, a bit nuts and messy.<\/p> <p>But, if you need vast swathes of data, maybe in real-time, then public web data is the way forward. Ditto for real opinions and reviews of products and services. Public web data, review platforms, UGC, and social media sites are great.<\/p> <h2>Why Models Aren\u2019t Getting (Much) Better<\/h2> <p>While there\u2019s no shortage of data in the world, most of it is unlabeled and, thus, can\u2019t actually be used in supervised machine learning models. Every incorrect label has a negative impact on a model\u2019s performance.<\/p> <p>According to most, we\u2019re only\u00a0a few years away from running out of quality data. Inevitably, this will lead to a time when those genAI tools start consuming their own garbage.<\/p> <p>This is a\u00a0<strong\/>known problem that will cause model collapse.<\/p> <ul> <li>They are being blocked by companies that do not want their data used pro bono to train the models.<\/li> <li>Robots.txt protocols (a directive, not something directly enforceable), CDN-level blocking, and terms of service pages have been updated to tell these guys to get lost.<\/li> <li>They consume data quicker than we can produce it.<\/li> <\/ul> <p>Frankly, as more publishers and websites are forced into paywalling\u00a0(a smart business decision), the quality of these models only gets worse.<\/p> <h2>So, How Do You Get In The Training Data?<\/h2> <p>There are two obvious approaches I think of.<\/p> <ol> <li>To identify the\u00a0seed data sets of models that matter and find ways into them.<\/li> <li>To forgo the specifics and just\u00a0do great SEO and wider marketing. Make a tangible impact in your industry.<\/li> <\/ol> <p>I can see pros and cons to both. Finding ways into specific models is probably highly unnecessary for most brands. To me this smells more like grey hat SEO. Most brands will be better off just doing some really good marketing and getting shared, cited and you know, talked about.<\/p> <p>These models are not trained on directly up-to-date data. This is important because you cannot retroactively get into a specific model\u2019s training data. You have to plan ahead.<\/p> <p>If you\u2019re an individual, you should be:<\/p> <ul> <li>Creating and sharing content.<\/li> <li>Going on podcasts.<\/li> <li>Attending industry events.<\/li> <li>Sharing other people\u2019s content.<\/li> <li>Doing webinars.<\/li> <li>Getting yourself in front of relevant publishers, publications, and people.<\/li> <\/ul> <p>There are some pretty obvious sources of highly structured data that models have paid for in recent times. I know,\u00a0<em>they\u2019ve actually paid for it<\/em>. I don\u2019t know what the guys at Reddit and Wikipedia had to do to get money from these guys, and maybe I don\u2019t want to.<\/p> <h3>How Can I Tell What Datasets Models Use?<\/h3> <p>Everyone has become a lot more closed off with what they do and don\u2019t use for training data. I suspect this is both legally and financially motivated. So, you\u2019ll need to do some digging.<\/p> <p>And there are some massive \u201copen source\u201d datasets I suspect they all use:<\/p> <ul> <li>Common Crawl.<\/li> <li>Wikipedia.<\/li> <li>Wikidata.<\/li> <li>Coding repositories.<\/li> <\/ul> <p>Fortunately, most deals are public, and it\u2019s safe to assume that models use data from these platforms.<\/p> <p>Google has a partnership with Reddit and access to an insane amount of transcripts from YouTube. They almost certainly have more valuable, well-structured data at their fingertips than any other company.<\/p> <p>Grok trained almost exclusively on real-time data from X. Hence why it acts like a pre-pubescent school shooter and undresses everyone.<\/p> <blockquote> <p>Worth noting that AI companies use third party vendors. Factories where data is scraped, cleaned and structured to create supervised datasets. Scale AI is the data engine that the big players use. Bright Data specialise in web data collection.<\/p> <\/blockquote> <h2>A Checklist<\/h2> <p>OK, so we\u2019re trying to feature in\u00a0parametric memory. To appear in the LLMs training data so the model recognizes you and you\u2019re more likely to be used for RAG\/retrieval. That means we need to:<\/p> <ol> <li>Manage the multi-bot ecosystem of training, indexing, and browsing.<\/li> <li>Entity optimization. Well-structured, well-connected content, consistent NAPs, sameAs schema properties, and Knowledge Graph presence. In Google and Wikidata.<\/li> <li>Make sure your content is\u00a0rendered on the server side. Google has become very adept at rendering content on the client side. Bots like GPT-bot only see the HTML response. JavaScript is still clunky.<\/li> <li>Well-structured, machine-readable content in relevant formats. Tables, lists, properly structured semantic HTML.<\/li> <li>Get. Yourself. Out. There. Share your stuff.\u00a0Make noise.<\/li> <li>Be ultra, ultra clear on your website about who you are. Answer the relevant questions.\u00a0Own your entities.<\/li> <\/ol> <p>You have to balance direct associations (what\u00a0<em>you say<\/em>) with semantic associations (what <em>others say\u00a0<\/em>about you). Make your brand the obvious next word.<\/p> <p>Modern SEO, with better marketing.<\/p> <p><strong>More Resources:<\/strong><\/p> <hr\/> <p><em>Read Leadership In SEO, subscribe now.<\/em><\/p> <hr\/> <p><em>Featured Image: Collagery\/Shutterstock<\/em><\/p> <\/div> <p>SEO#Information #Retrieval #Part #Model #Training #Data1770216520<\/p> ","protected":false},"excerpt":{"rendered":"<p>There has never been a more important time in your career to spend time learning and understanding. Not because AI search differs drastically from traditional search. But because everyone else thinks it does. Every C-suite in the country is desperate to get this right. Decision-makers need to feel confident that you and I are the [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2815,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[16],"tags":[450,6657,411,3800,88,86],"class_list":["post-2814","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-accessibility","tag-data","tag-information","tag-model","tag-part","tag-retrieval","tag-training"],"acf":[],"_links":{"self":[{"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/posts\/2814","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2814"}],"version-history":[{"count":0,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/posts\/2814\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=\/wp\/v2\/media\/2815"}],"wp:attachment":[{"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2814"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2814"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/longzhuplatform.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2814"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}