- This checklist covers every technical step for AEO implementation: schema types, llms.txt configuration, and AI crawler management
- Schema markup tells AI systems “this is a question, this is the answer” but most sites miss author disambiguation, the single most overlooked JSON-LD pattern
- llms.txt is a proposed standard that no major AI platform actually uses yet, so don’t prioritize it over proven technical work
- PerplexityBot drives real referral traffic. GPTBot trains models. ClaudeBot has variable compliance. Your robots.txt should reflect those differences.
- Print this. Work through it section by section. Skip what doesn’t apply.
Technical domains: schema, llms.txt, AI crawlers
Most overlooked: author disambiguation in JSON-LD
AI crawlers with different robots.txt behavior
Technical AEO implementation overview
The Full Technical AEO Checklist (With Opinions)
Most AEO guides give you the same watered-down advice: “add schema markup” and “optimize for AI.” That’s like telling someone to “do exercise” and expecting them to run a marathon.
This checklist is different. It’s the exact technical implementation sequence we follow when onboarding clients at Metronyx. Every item has a reason, a priority level, and context about whether it actually matters or just looks good in a blog post. Some of these items are genuinely urgent. Others are nice-to-haves that the SEO industry has inflated into must-dos.
I’ll tell you which is which.
Schema Markup Implementation
Foundation schema (Organization, Article, Author), content-type schema (FAQ, HowTo, Product), and advanced schema (ProfilePage, BreadcrumbList, Image metadata).
llms.txt Configuration
Create the proposed AI crawler config file with H1 header, blockquote summary, and curated page lists. Low priority but zero-downside.
AI Crawler Management
Configure robots.txt per-bot: always allow PerplexityBot, selective GPTBot access, evaluate ClaudeBot, understand Google-Extended.
Content Accessibility
Server-side render critical content, remove nosnippet tags, fix canonical tags, whitelist AI bot IP ranges.
Content Structure for Retrieval
Standalone sections, BLUF method, question-based headings, answer capsules, and sourced statistics.
Monitoring & Verification
Monthly AI platform testing, hallucinated URL monitoring, sentiment tracking, schema validation, and server log review.
Part 1: Schema Markup Implementation
Structured data is how you translate your content from “text on a page” into something AI systems can parse, classify, and cite. Google and Microsoft both confirm that schema markup helps LLMs correctly interpret page content. Our schema generator handles the JSON-LD automatically, but understanding what each type does will help you audit what’s already on your site.
Priority 1: Foundation Schema (Do This Week)
- Organization schema on your homepage
- WebSite schema with SearchAction
- Article or BlogPosting schema on every content page
- Author markup with url AND sameAs properties
- datePublished and dateModified in ISO 8601 format with timezone
☐ Organization schema on your homepage
This defines your brand as an entity in Google’s Knowledge Graph. Without it, AI systems have to guess what your company is. Include: name, url, logo, sameAs (linking to your social profiles), contactPoint, and description. Use the sameAs property to connect your entity to LinkedIn, Twitter, Wikipedia, and any other official profiles.
☐ WebSite schema with SearchAction
Tells search engines your site has internal search functionality. Less about AI citations, more about appearing in sitelinks search boxes. Low effort, set it once.
☐ Article or BlogPosting schema on every content page
This is where most sites stop. They add Article schema and think they’re done. They’re not. The schema itself is table stakes. What matters is what goes inside it.
☐ Author markup with url AND sameAs properties
This is the single most impactful schema pattern most sites miss. Don’t just add an author name as a text string. Link to a dedicated author profile page using the url property, and connect to external profiles using sameAs. Google’s documentation says this helps “uniquely identify and disambiguate the exact author.” If AI can’t verify who wrote your content, it treats the content as less trustworthy.
{
"@type": "Person",
"name": "Sarah Chen",
"url": "https://yoursite.com/team/sarah-chen/",
"sameAs": [
"https://linkedin.com/in/sarahchen",
"https://twitter.com/sarahchen"
],
"jobTitle": "Head of SEO"
}
☐ datePublished and dateModified in ISO 8601 format with timezone
AI platforms weight recency. Google’s documentation specifically says they default to Googlebot’s timezone (PST) if you don’t provide one. Always include the timezone offset.
Priority 2: Content-Type Schema (Do This Month)
☐ FAQPage schema on pages with Q&A sections
AI is designed to answer questions. FAQs give models ready-made question-answer pairs they can extract directly. According to Google’s FAQ schema docs, this structured labeling improves the reliability of answer extraction.
☐ HowTo schema on step-by-step guides
Research shows AI models reproduce instructions with higher accuracy when content is segmented into labeled steps rather than free-flowing paragraphs. Use numbered steps with H3 labels.
☐ Product or Review schema on commercial pages
Google’s Shopping Graph has over 35 billion product listings that feed AI-generated shopping responses. If you sell products and don’t have Product schema, you don’t exist in that dataset.
☐ Speakable schema on news/article content (if applicable)
The speakable property identifies sections best suited for text-to-speech playback. When Google Assistant reads these sections aloud, it explicitly attributes the source and sends the full article URL to the user’s mobile device. Still in beta, but worth implementing on high-value pages.
Priority 3: Advanced Schema (When Resources Allow)
☐ ProfilePage schema on author bio pages
Reinforces author identity and connects back to Article schema author markup.
☐ BreadcrumbList schema for site navigation
Helps AI understand your site hierarchy and content relationships.
☐ Image metadata with C2PA and IPTC
Google extracts C2PA metadata and IPTC photo metadata to display citations in the “About this image” feature. If you produce original images, this is how you get credited.
Part 2: llms.txt Configuration
I need to be upfront about this: llms.txt is a proposed standard that no major AI platform currently uses. Google’s John Mueller has explicitly called it “unnecessary.” Search Engine Journal’s Roger Montti documented how the SEO industry has turned llms.txt into a self-reinforcing loop of misunderstanding, where tools check for it and flag its absence as a “risk” even though no AI platform reads it.
So should you skip it entirely? Not necessarily. But put it dead last on your priority list.
If You Choose to Implement llms.txt
☐ Create /llms.txt in your site root
The file follows a specific Markdown structure proposed by Jeremy Howard in September 2024. Place it at yoursite.com/llms.txt.
☐ Required: H1 header with your project/site name
This is the only strictly required element.
☐ Optional: Blockquote summary of your site
A 2-3 sentence description providing key background.
☐ Optional: Markdown sections with URLs to key content
Format as: [Resource Name](https://yoursite.com/page): Brief description
☐ Optional: “Optional” H2 section for lower-priority URLs
Content the LLM can skip if its context window is tight.
Here’s what a minimal llms.txt looks like:
# Your Company Name
> Brief description of what your company does
> and what kind of content lives on your site.
## Core Resources
- [Product Overview](https://yoursite.com/product): Main product page
- [Documentation](https://yoursite.com/docs): Technical documentation
- [Blog](https://yoursite.com/blog): Industry insights and guides
## Optional
- [Changelog](https://yoursite.com/changelog): Product updates
- [Team](https://yoursite.com/team): About our team
The Security Risk Nobody Talks About
A 2024 research paper on Adversarial Search Engine Optimization for LLMs demonstrated that Markdown files like llms.txt can be manipulated through “Preference Manipulation Attacks.” Attackers inject hidden text or prompts to trick LLMs into unfairly promoting their content. In testing, these attacks made targeted products 2.5x more likely to be recommended by LLMs.
This means llms.txt creates a new attack surface. Something to weigh if you’re in a competitive space where someone might try to sabotage your AI presence.
Part 3: AI Crawler Management
This is where most technical AEO conversations start. It should be where they end, because crawler management only works when schema and content structure are already in place. But it’s still a dealbreaker if you get it wrong.
robots.txt Configuration for AI Crawlers
Stop thinking about AI crawlers as one category. They have different purposes and different compliance records. Here’s what each major bot does and how to handle it.
- PerplexityBot: ALLOW , drives real referral traffic with clickable citations
- GPTBot: SELECTIVE , allow public content, block proprietary material
- ClaudeBot/Claude-SearchBot: EVALUATE , variable compliance with robots.txt
- Google-Extended: ALLOW , controls Gemini visibility
- CCBot (Common Crawl): ALLOW , low-risk, broad AI training data
☐ PerplexityBot: ALLOW
This is the one AI crawler you should never block. PerplexityBot retrieves real-time content for Perplexity’s search engine and provides direct, clickable source citations. It drives highly qualified referral traffic. The Perplexity documentation confirms it respects robots.txt directives.
☐ GPTBot: SELECTIVE
OpenAI’s crawler collects data to train models like ChatGPT. It provides brand awareness through model training but zero direct referral traffic. Allow it for public thought leadership content. Block it for proprietary content or competitive intelligence.
User-agent: GPTBot
Allow: /blog/
Allow: /guides/
Disallow: /internal/
Disallow: /pricing/
☐ ClaudeBot/Claude-SearchBot: EVALUATE
Anthropic runs two bots. ClaudeBot crawls for model training. Claude-SearchBot crawls for real-time search. ClaudeBot has shown variable compliance with robots.txt delay rules, so blocking the training bot while allowing the search bot is a reasonable default.
☐ Googlebot and Google-Extended: UNDERSTAND THE DIFFERENCE
Googlebot handles standard Google Search indexing. Google-Extended controls whether your content trains Gemini and Vertex AI. Blocking Google-Extended doesn’t affect your standard search rankings at all, but it stops your content from appearing in Gemini’s responses. If AI visibility matters to you, allow it.
☐ CCBot (Common Crawl): ALLOW
Non-profit crawler that builds a public web archive. Generally honors robots.txt directives. Many AI models use Common Crawl data in their training sets, so allowing it is a low-risk way to increase your training data footprint.
Recommended robots.txt Template
# AI Search Crawlers (provide citations + traffic)
User-agent: PerplexityBot
Allow: /
# AI Training Crawlers (selective access)
User-agent: GPTBot
Allow: /blog/
Allow: /guides/
Allow: /resources/
Disallow: /internal/
Disallow: /client-portal/
# Claude Search (allow) vs Training (block)
User-agent: Claude-SearchBot
Allow: /
User-agent: ClaudeBot
Disallow: /
# Google AI (allow for Gemini visibility)
User-agent: Google-Extended
Allow: /
# Common Crawl (allow for broad AI training)
User-agent: CCBot
Allow: /
Beyond robots.txt: Enforcement Layers
Robots.txt compliance is voluntary. Most commercial crawlers respect it, but some don’t. If you need stricter control:
☐ Layer 2: Rate limiting via Cloudflare or nginx
For crawlers that respect robots.txt but ignore crawl-delay directives.
☐ Layer 3: User-agent filtering at the server level
Hard blocks for crawlers with documented compliance problems.
☐ Layer 4: WAF rules for sophisticated violators
Catches bots that spoof user-agents or rotate IPs.
☐ Layer 5: Behavioral analysis
Traffic pattern monitoring that identifies bot-like behavior regardless of declared user-agent.
This isn’t “set and forget.” It’s continuous compliance monitoring. We track actual crawler behavior against declared policies, identify violators through log analysis, and adjust enforcement as patterns change.
Part 4: Content Accessibility for AI Systems
☐ Server-side render all critical content
Not all AI systems render JavaScript. If your product descriptions, FAQ answers, or key content loads via client-side JS, LLM scrapers may never see it. Pre-render or SSR everything that matters.
☐ Remove nosnippet from high-value pages
The nosnippet meta robots rule prevents content from being used as a direct input for AI Overviews and AI Mode. If you have this on informational content pages, remove it.
☐ Set self-referencing canonical tags
AI systems need to know which version of a URL to retrieve for synthesis. Without canonical tags, duplicate content signals confuse which page gets cited.
☐ Use descriptive anchor text for internal links
AI systems use anchor text to understand semantic relationships between pages. “Click here” tells them nothing. “Learn how crawlability impacts SEO performance” gives them context.
☐ Whitelist AI bot IP ranges in your firewall/CDN
Some Cloudflare and WAF configurations block AI bots by default under “bot protection” settings. Check your bot management rules to confirm PerplexityBot and GPTBot aren’t getting caught in blanket blocks.
☐ Implement image accessibility for AI multimodal retrieval
AI search is increasingly pulling images, charts, and tables. Serve images via clean HTML, use descriptive alt text, add captions, and never use images of tables. AI can parse HTML tables but can’t read screenshots of spreadsheets.
Part 5: Content Structure for Chunk-Level Retrieval
☐ Structure every section as a standalone answer
AI engines break pages into passages and retrieve the single most relevant chunk. Each section should be understandable without reading anything else on the page.
☐ Use the BLUF method (Bottom Line Up Front)
Answer the question in the first sentence of every section. Expand after. AI grabs the opening, not the conclusion.
☐ Write question-based H2/H3 headings
People ask AI full questions: “What’s the best CRM for small ecommerce businesses?” Your headings should mirror that exact phrasing.
☐ Add an answer capsule after every question heading
A concise 1-2 sentence direct answer in the first 40-60 words. Research from Seer Interactive found brands cited in AI Overviews earn 35% more organic clicks and 91% more paid clicks compared to uncited brands.
☐ Replace vague claims with sourced statistics
Content with specific, sourced stats gets cited more often than generalizations. “Many businesses struggle with email marketing” loses to “Email marketing generates $42 for every $1 spent, according to Litmus’s 2024 research.”
☐ Maintain heading frequency (one every 150-200 words)
Part 6: Monitoring & Verification
☐ Test your brand across AI platforms monthly
Search your brand name and key topics on ChatGPT, Perplexity, Gemini, and Claude. Document which sources get cited.
☐ Monitor for hallucinated URLs
AI systems sometimes cite pages on your domain that don’t exist. Check your analytics for pages receiving AI referral traffic that resolve to 404 errors. 301 redirect those hallucinated URLs to relevant live pages.
☐ Track brand sentiment across AI responses
Tag mentions as positive, neutral, or negative. Note which third-party sources the models cite when framing your brand. If AI keeps repeating an outdated complaint, update the upstream sources it’s pulling from.
☐ Validate schema with Google’s Rich Results Test
Run every major page template through Google’s testing tool. Fix errors before moving to the next implementation item.
☐ Review server logs for AI crawler activity
Are the bots you’ve allowed actually visiting? How frequently? Which pages do they hit most? This data tells you what’s working and what’s being ignored.
Implementation Sequence
Don’t do everything at once. This is the order that produces the fastest results based on what we’ve seen across hundreds of AI visibility audits:
- Week 1: Fix robots.txt and firewall/CDN bot rules (Part 3)
- Week 2: Add Foundation Schema to homepage and top 10 content pages (Part 1, Priority 1)
- Week 3: Restructure top 5 content pages for chunk-level retrieval (Part 5)
- Week 4: Add FAQ and HowTo schema, fix content accessibility issues (Parts 1-2, Part 4)
- Ongoing: Monthly monitoring and verification (Part 6)
If your entity architecture is already solid and you just need the technical layer, this checklist covers it. If you’re starting from scratch and want someone to run through the whole thing with you, that’s what our audit is for.
Priority ranking for technical AEO implementation areas
Frequently Asked Questions
Organization schema on your homepage, Article/BlogPosting schema with proper author markup on content pages, and FAQPage schema on pages with Q&A sections. Author disambiguation using the url and sameAs properties is the most overlooked pattern. It lets AI systems verify who wrote the content, which directly impacts citation trustworthiness.
No major AI platform currently reads llms.txt files. Google’s John Mueller has called it “unnecessary.” It’s a proposed standard, not an adopted one. Focus on schema markup, robots.txt configuration, and content structure first. If you have time after everything else is done, implementing llms.txt is low-risk but likely low-reward.
Always allow PerplexityBot because it drives real referral traffic with clickable citations. Allow GPTBot selectively for public content. Allow Google-Extended if you want Gemini visibility. Consider blocking ClaudeBot (training) while allowing Claude-SearchBot (search). Each crawler serves a different purpose, so blanket allow or blanket block policies are both wrong.
Search your brand name and key topics on ChatGPT, Perplexity, and Claude. If your content never appears as a source, check robots.txt for AI crawler blocks, review your CDN and WAF settings for bot-blocking rules, and confirm critical content isn’t rendered via client-side JavaScript only.
Check and fix your robots.txt file to allow PerplexityBot and GPTBot, then add Organization schema to your homepage and Article schema with proper author markup to your top 5 content pages. These two changes take under an hour and cover the highest-impact items on this checklist.