How AI Reads, Chunks and Cites the Web

Search Digest Issue #003 — week of 3 Mar 2026, AI content parsing and citation pipelines

This week gave us an unusually clear window into the mechanics of how AI systems actually process and use content. Chris Long's breakdown of Google's Vertex AI documentation is the kind of primary-source find that should be in every technical SEO's reading list. The 500-token chunk size, the three parser types, the layout-aware chunking — this is not speculation about how AI ingests content, it is Google's own documentation laying it out.

Jason Barnard's 10-gate pipeline piece sits alongside it as the clearest framework I have seen for understanding why some brands appear reliably in AI recommendations and others do not. The cascading confidence model — where dropping from 90% to 80% at one gate cuts end-to-end confidence from 35% to 11% — changes how you think about the whole optimisation problem.

The WebMCP story is the one that could define the next phase of technical SEO, and most people are not paying attention to it yet. The Google AI Mode self-citation study is useful and slightly uncomfortable. Nearly 1 in 5 citations in AI Mode answers now points back to Google itself. The implications for anyone who thought AI search would be a more open playing field are worth sitting with.

Chris Long / LinkedIn

Google's Vertex AI Documentation Reveals How AI Systems Parse and Chunk Your Content

Chris Long, co-founder of Nectiv, surfaced a gem this week: Google's Vertex AI Search documentation includes a detailed walkthrough of how the system understands and processes documents. This is primary-source material from Google about how AI content ingestion actually works, and it confirms several things that SEOs have been inferring from indirect evidence.

The 500-token default chunk size is significant. It translates to roughly 350-400 words, which closely matches the research from Dan Petrovic on optimal content segment length. The layout-aware chunking setting is even more interesting — when enabled, the system ensures it does not split tables or bullet lists mid-element, keeping the chunk semantically coherent. That is a direct argument for using proper HTML structure rather than wall-to-wall prose.

The boilerplate exclusion feature is the detail I found most actionable. The documentation references directives like "excludeHtmlElements" and "excludeHtmlClasses" that let you filter out navigation, footers, and other non-content elements from what the AI reads. It is confirmation that Google's AI layer is not treating every word on a page equally — and that how you structure your HTML genuinely affects what gets ingested.

Key points

  • Three parser types in Vertex AI: Digital Parser for machine-readable text, OCR Parser for PDFs, and Layout Parser for structured content with headings, tables, and lists
  • Default chunk size is 500 tokens — approximately 350-400 words — consistent with independent research on optimal content segment length
  • Layout-aware chunking prevents the system from splitting tables or lists mid-element, preserving semantic context within each chunk
  • Boilerplate exclusion can filter out navigation, footer, and repeated UI elements — meaning structural HTML directly affects what content gets processed
  • Content is read in context windows of chunks, not as a complete page — which explains why position within the document affects citation probability

Key takeaway

Structure your content for chunk boundaries, not just for readers. Each section of approximately 350-400 words should be able to stand alone and answer a specific question. If a chunk gets pulled into an AI context window in isolation, it should still make sense and be attributable to your brand.

Also worth considering

The layout parser confirms that semantic HTML — proper heading hierarchy, lists in list elements, tables with headers — is not just accessibility best practice. It directly affects how AI systems segment and understand your content. Pages built with div soup and inline styling are harder for AI to parse correctly, and harder to chunk coherently.

What I'm testing

Reviewing several long-form pages to identify where natural chunk boundaries fall and whether the content at each boundary is self-contained enough to earn a citation in isolation. Adding explicit subheadings at 350-word intervals where they are currently missing.

Read the full post

Jason Barnard / Search Engine Land

The AI Engine Pipeline: 10 Gates That Decide Whether You Win the Recommendation

This is the clearest framework I have seen for understanding why AI recommendations are so inconsistent across brands. Barnard's argument is that every piece of content passes through 10 gates — Discovered, Selected, Crawled, Rendered, Indexed, Annotated, Recruited, Grounded, Displayed, Won — before it becomes an AI recommendation. Drop significantly at any one of them and you fail, regardless of how well you perform at the others.

The cascading confidence model is the part that makes this more than just a taxonomy. If you achieve 90% confidence at each of ten gates, your end-to-end success rate is only 35%. Drop one gate to 50% — say, rendering fails due to heavy JavaScript — and the total drops to 19% even if every other gate is near-perfect. The compounding effect means that SEOs who focus exclusively on content quality and ignore technical rendering, entity recognition, or crawl selection are solving a small part of a much larger problem.

Key points

  • Ten gates in the pipeline: Discovered, Selected, Crawled, Rendered, Indexed, Annotated, Recruited, Grounded, Displayed, Won
  • Confidence compounds across gates — 90% at each gives only 35% end-to-end; one weak gate at 50% drops the total to 19%
  • Most AI tracking tools measure at the Display stage — but by then, all the decisions have already been made upstream
  • The "nested audience model" — bot, then algorithm, then person — means every optimisation should start by asking which of the three audiences it is serving
  • Brands that appear inconsistently in AI answers typically have a specific gate failing, not a general visibility problem

Key takeaway

Before adding more content or schema markup, audit which gate in the pipeline you are most likely failing at. If your content is not rendering correctly due to JavaScript, fixing that single gate may move results more than six months of content production. The weakest gate is always your highest-impact fix.

Also worth considering

The distinction between measuring at Display and understanding what happened upstream is genuinely important. Most visibility tools tell you whether you appeared in an AI answer. They do not tell you why you did not appear, or at which gate you were filtered out. Building a way to audit the earlier gates — crawl, render, entity annotation — is where the diagnostic value actually lives.

Read the full article

Search Engine Land

WebMCP Explained: Inside Chrome 146's Agent-Ready Web Preview

This is the story most SEOs are not paying attention to and probably should be. WebMCP is a proposed web standard that lets websites explicitly publish a structured list of actions that AI agents can take on them — search, book, buy, compare — rather than forcing agents to reverse-engineer the page by guessing what each button does. Chrome 146 launched a behind-a-flag preview of it, and the implications for how AI agents interact with websites are substantial.

The problem it solves is real. Right now, an AI agent trying to book a flight on your site has to identify the right input fields, guess the correct data format, and hope nothing breaks. WebMCP lets you declare "here is the bookFlight tool, here are its parameters, here is how to call it." The agent can then execute the action reliably. Standard HTML forms can become agent-compatible with a few attribute additions — no separate API required.

Dan Petrovic called it the biggest shift in technical SEO since structured data. I think that is right directionally even if the timeline is uncertain. If AI agents are going to take actions on behalf of users rather than just answering questions, the websites that define their actions clearly will be the ones those agents choose.

Key points

  • WebMCP lets websites publish structured "Tool Contracts" — explicit declarations of what actions AI agents can take and how to execute them
  • Chrome 146 launched a behind-a-flag preview, running on a new browser API called navigator.modelContext
  • Standard HTML forms can become agent-compatible by adding toolname, tooldescription, and toolautosubmit attributes — no separate API build required
  • Addresses the core problem with current agent-website interaction: agents reverse-engineer pages by guessing, WebMCP removes the guessing
  • If an agent cannot find or verify a structured action on your site, it will route to a competitor that has defined one clearly

Key takeaway

Watch this closely but do not panic-build yet. WebMCP is in preview behind a flag. It will take 12-18 months before it reaches a level of browser support that makes implementation urgent. But understanding it now means you will be ready to move quickly when it matters rather than catching up under pressure.

Also worth considering

This is structured data for the agentic web. Just as schema.org markup helps search engines understand what your content is about, WebMCP helps AI agents understand what your site can do. The sites that defined schema early had an advantage. The same pattern is likely to repeat here.

Read the full article

Danny Goodwin / Search Engine Land

Google's AI Mode Is Citing Google More Than Any Other Site: Study

SE Ranking analysed 68,313 keywords across 20 industries and over 1.3 million AI Mode citations to find that Google.com now accounts for 17.42% of all citations in AI Mode answers — nearly one in five. In June 2025 that figure was 5.7%. It has tripled in nine months. Google is not just the host of AI Mode, it is now the top-cited source within it, by a significant margin.

The six next-most-cited sources — YouTube, Facebook, Reddit, Amazon, Indeed, and Zillow — do not collectively match Google's self-citation share. Including YouTube, Google-controlled properties account for roughly 20% of all AI Mode citations. For travel queries specifically, Google self-citation hits 53%, almost entirely routing users to another Google search rather than to an external publisher.

This is worth paying attention to beyond the obvious "Google favours Google" observation. It tells you something about how AI Mode is being designed: as a surface that keeps users within Google's ecosystem rather than sending them outward. The implications for publishers who assumed AI search would democratise citation opportunities are uncomfortable.

Key points

  • Google.com now accounts for 17.42% of all AI Mode citations — up from 5.7% in June 2025, a near tripling in nine months
  • Google-controlled properties (Google.com + YouTube) represent roughly 20% of all AI Mode citations
  • Travel queries show 53% Google self-citation — almost entirely routing users to another Google Search rather than to external publishers
  • The only category where Google is not the top citation source is careers and jobs, where Indeed dominates
  • The pattern suggests AI Mode is being designed to retain users within the Google ecosystem, not to distribute traffic outward

Key takeaway

If your current GEO strategy assumes AI Mode will be a meaningful traffic referrer for informational and research-stage queries, this data should prompt a reassessment. AI Mode is increasingly keeping users in Google. The more valuable opportunity may be in the commercial and transactional queries where Google's self-citation rate is lower.

Also worth considering

Being cited in AI Mode does not necessarily mean receiving a click. The traffic model for AI search looks increasingly like brand exposure and consideration-stage influence rather than direct click-through. Measuring AI visibility purely through traffic referrals will undercount its commercial value — and overcount the opportunity in certain query categories.

Read the full article

Search Engine Journal / SISTRIX

Google AI Overviews Cut Germany's Top Organic CTR by 59%: SISTRIX Analysis of 100M Keywords

SISTRIX founder Johannes Beus published what is probably the largest dataset analysis of AI Overview click impact to date: 100 million keywords in the German market, one year after AI Overviews became broadly active there. The headline finding — position-one click rate drops from 27% to 11% when an AI Overview is present — is a 59% relative decline. For position-one rankings, specifically, that is a serious number.

The total monthly click loss across the German market is estimated at 265 million organic clicks. That sounds enormous but works out to a 6.6% average loss across all keywords, because AI Overviews only appear on roughly 20% of queries. The impact is concentrated. Informational queries take the biggest hits; transactional searches are largely spared because AI cannot replace the action of actually purchasing something.

The industry-level breakdown is where the data gets useful for planning. Parenting and baby content sites lost over 24% of organic clicks. Health information sites lost significantly above average. Recipe sites like Chefkoch lost approximately 1%. The pattern is consistent: the more your traffic comes from queries where a summary can satisfy the user's need, the more exposure you have to AI Overview displacement.

Key points

  • Position-one CTR drops from 27% to 11% when an AI Overview is present — a 59% relative decline
  • AI Overviews appear on approximately 20% of German queries, giving a weighted average click loss of 6.6% across all keywords
  • Informational queries are hit hardest; transactional queries see minimal AI Overview impact
  • Industry variation is large — parenting content sites lost over 24% of clicks; recipe sites lost approximately 1%
  • Ahrefs and Seer Interactive found similar figures globally: 58-61% CTR reduction at position one for affected queries

Key takeaway

Segment your keyword portfolio by whether AI Overviews currently appear for those queries. The click loss is real but it is not evenly distributed. Doubling down on transactional and commercial intent content — where AI Overviews are rare and position-one CTR is largely intact — is the most direct response to this data.

Also worth considering

Being cited in an AI Overview that reduces clicks might still be valuable if it drives brand recognition in the zero-click moment. The brands that appear as the cited source in an AI Overview summary are building top-of-funnel awareness they cannot easily measure but also cannot afford to ignore. Track both citation frequency and click-through separately.

Read the full article

That is issue #003. The theme this week is mechanics — how AI systems actually read, segment, and decide what to cite. The Vertex AI documentation and Barnard's pipeline framework together give you a more complete picture of the process than most GEO advice starts from. Most optimisation guidance talks about what to write. This week's reading is about how the machine reads what you wrote.

If any of this changed how you are approaching content structure or your technical setup, I would be interested to hear what you are doing differently.

Free Consultation

Let's Talk


Tell me what you're working on. I'll give you an honest assessment and we'll explore if working together makes sense — no hard sell, just a free, no-obligation call.