How we measure AI search visibility
Every metric in this dashboard is backed by primary research. Here's exactly how the numbers are calculated - and why we do it differently.
Why repeated sampling?
“Single-run visibility metrics provide a misleadingly precise picture.”
LLMs are non-deterministic - the same prompt can produce different outputs on every call. A single snapshot confuses random variation with real signal.
We run each query N=5 times per AI engine and report bootstrap 95% confidence intervals - so you see not just the point estimate, but the uncertainty range. See also: arXiv:2604.07585 - “Don't Measure Once”, which establishes N=5 as the practical minimum for stable LLM visibility estimates.
The four metrics we track
Metric definitions from arXiv:2604.25707 (per-engine citation divergence + selection rate):
Selection Rate
% of prompts where the AI mentions your brand in its response. The primary visibility signal.
AI Share of Voice
Your mention count divided by total brand mentions across you + competitors. Competitive positioning.
Citation Rate
% of responses where your brand is cited with a URL reference, not just mentioned by name.
Discovery Gap
Difference between your branded query recognition rate and your open-discovery appearance rate.
Vertical-specific scoring
The 8 KPIs combine into the composite under a vertical-specific weighting calibrated to how buyers actually discover that kind of business. A SaaS company is evaluated against named alternatives, so comparison and share of voice carry more weight; a publisher monetises referral traffic, so being cited matters more than being named. The weighting is normalised so scores stay directly comparable on the 0 to 100 scale.
Current score model: v2. Scores rendered under earlier models are tagged with the model they were computed under and are not silently re-priced. See your report header for the model badge.
Hotels
Discovery and share of voice dominate; open-ended "best hotel in city" prompts win the booking.
SaaS
Comparison and share of voice dominate; "X vs Y" buyer prompts decide the shortlist.
Ecommerce
Share of voice and product-level comparison dominate the cart-winning slot.
Publishers
Citation rate and branded recognition lead; being linked matters more than being named.
The discovery gap
Research (arXiv:2601.00912) found a stark asymmetry in how AI assistants handle branded vs. open queries:
AI assistants are reliable at answering questions about your brand, but rarely surface you unprompted in category discovery. The Agentic Commerce Readiness Score focuses on closing this discovery gap.
Agent Readiness
On top of measuring what AI engines say about your brand, we check whether your domain is set up to be read by AI agents in the first place. Each scan fetches five public-web signals from your site. Every signal is weighted equally (0.20) and the composite is rounded to a 0–100 integer. Each fetch has its own 3-second timeout; a timeout, network error, or non-200 response counts as “absent” rather than failing the report.
llms.txt
GET /llms.txt returns 200 with non-empty body, an emerging site-level instructions file AI coding agents already read.
AI bot directives in robots.txt
robots.txt names at least one of GPTBot, CCBot, ClaudeBot, anthropic-ai, PerplexityBot, or Google-Extended as a User-agent. Plain User-agent: * does NOT count.
Schema.org JSON-LD
Your homepage HTML contains a <script type="application/ld+json"> block with a schema.org @type. Machine-readable entity facts engines can merge.
MCP manifest
GET /.well-known/mcp.json returns 200 and parses as JSON, the emerging convention for declaring agent endpoints.
Sitemap
GET /sitemap.xml returns 200 with non-empty body, the canonical list of pages every major crawler honours.
Why we measure all 4 engines
Different AI engines exhibit meaningfully different citation behaviour. ChatGPT tends to cite fewer sources per response but exercises deeper influence on each cited page. Perplexity surfaces more sources per response but the per-source influence is shallower. Gemini and Claude follow distinct retrieval patterns again. No single engine represents your overall AI visibility - you need all four.
Our consistency score penalises high variance across engines, rewarding brands that are visible everywhere rather than just dominant on one platform.
How we compare
Confidence intervals are the uncertainty range you get from asking the same question repeatedly - competitors' single-run snapshots give you one point estimate with no idea how much it would move on the next run.
| Tool | Sampling | Confidence intervals | Engines | Rigour |
|---|---|---|---|---|
| Semrush AI Toolkit | Single snapshot ($99 add-on) | None | ChatGPT, Perplexity, Gemini | Point estimate only |
| Superlines | Single daily snapshot | None | ChatGPT, Perplexity, Gemini | Point estimate only |
| Evertune | Single daily snapshot | None | Not disclosed | Point estimate only |
| AthenaHQ | Single daily snapshot | None | Varies by plan | Point estimate only |
| Bluerails Discovery | N=5 per prompt per engine | Bootstrap 95% CI | ChatGPT, Perplexity, Gemini, Claude | arXiv-cited bootstrap methodology |
Primary research cited
- arXiv:2603.08924Repeated sampling and statistical confidence in LLM visibility measurement
- arXiv:2604.07585"Don't Measure Once" - N=5 minimum for stable LLM visibility estimates
- arXiv:2604.25707Selection rate definition and per-engine citation divergence
- arXiv:2601.0091299.4% named-query recognition vs 3.32% discovery surfacing in AI responses