Chris Long / LinkedIn
Google's Vertex AI Documentation Reveals How AI Systems Parse and Chunk Your Content
Chris Long, co-founder of Nectiv, surfaced a gem this week: Google's Vertex AI Search documentation includes a detailed walkthrough of how the system understands and processes documents. This is primary-source material from Google about how AI content ingestion actually works, and it confirms several things that SEOs have been inferring from indirect evidence.
The 500-token default chunk size is significant. It translates to roughly 350-400 words, which closely matches the research from Dan Petrovic on optimal content segment length. The layout-aware chunking setting is even more interesting — when enabled, the system ensures it does not split tables or bullet lists mid-element, keeping the chunk semantically coherent. That is a direct argument for using proper HTML structure rather than wall-to-wall prose.
The boilerplate exclusion feature is the detail I found most actionable. The documentation references directives like "excludeHtmlElements" and "excludeHtmlClasses" that let you filter out navigation, footers, and other non-content elements from what the AI reads. It is confirmation that Google's AI layer is not treating every word on a page equally — and that how you structure your HTML genuinely affects what gets ingested.
Key points
- Three parser types in Vertex AI: Digital Parser for machine-readable text, OCR Parser for PDFs, and Layout Parser for structured content with headings, tables, and lists
- Default chunk size is 500 tokens — approximately 350-400 words — consistent with independent research on optimal content segment length
- Layout-aware chunking prevents the system from splitting tables or lists mid-element, preserving semantic context within each chunk
- Boilerplate exclusion can filter out navigation, footer, and repeated UI elements — meaning structural HTML directly affects what content gets processed
- Content is read in context windows of chunks, not as a complete page — which explains why position within the document affects citation probability
Key takeaway
Structure your content for chunk boundaries, not just for readers. Each section of approximately 350-400 words should be able to stand alone and answer a specific question. If a chunk gets pulled into an AI context window in isolation, it should still make sense and be attributable to your brand.
Also worth considering
The layout parser confirms that semantic HTML — proper heading hierarchy, lists in list elements, tables with headers — is not just accessibility best practice. It directly affects how AI systems segment and understand your content. Pages built with div soup and inline styling are harder for AI to parse correctly, and harder to chunk coherently.
What I'm testing
Reviewing several long-form pages to identify where natural chunk boundaries fall and whether the content at each boundary is self-contained enough to earn a citation in isolation. Adding explicit subheadings at 350-word intervals where they are currently missing.