How we measure AI search visibility

Every metric in this dashboard is backed by primary research. Here's exactly how the numbers are calculated - and why we do it differently.

Why repeated sampling?

“Single-run visibility metrics provide a misleadingly precise picture.”
- arXiv:2603.08924

LLMs are non-deterministic - the same prompt can produce different outputs on every call. A single snapshot confuses random variation with real signal.

We run each query N=5 times per AI engine and report bootstrap 95% confidence intervals - so you see not just the point estimate, but the uncertainty range. See also: arXiv:2604.07585 - “Don't Measure Once”, which establishes N=5 as the practical minimum for stable LLM visibility estimates.

The four metrics we track

Metric definitions from arXiv:2604.25707 (per-engine citation divergence + selection rate):

Selection Rate

% of prompts where the AI mentions your brand in its response. The primary visibility signal.

AI Share of Voice

Your mention count divided by total brand mentions across you + competitors. Competitive positioning.

Citation Rate

% of responses where your brand is cited with a URL reference, not just mentioned by name.

Discovery Gap

Difference between your branded query recognition rate and your open-discovery appearance rate.

Vertical-specific scoring

The 8 KPIs combine into the composite under a vertical-specific weighting calibrated to how buyers actually discover that kind of business. A SaaS company is evaluated against named alternatives, so comparison and share of voice carry more weight; a publisher monetises referral traffic, so being cited matters more than being named. The weighting is normalised so scores stay directly comparable on the 0 to 100 scale.

Current score model: v2. Scores rendered under earlier models are tagged with the model they were computed under and are not silently re-priced. See your report header for the model badge.

Hotels

Discovery and share of voice dominate; open-ended "best hotel in city" prompts win the booking.

SaaS

Comparison and share of voice dominate; "X vs Y" buyer prompts decide the shortlist.

Ecommerce

Share of voice and product-level comparison dominate the cart-winning slot.

Publishers

Citation rate and branded recognition lead; being linked matters more than being named.

The discovery gap

Research (arXiv:2601.00912) found a stark asymmetry in how AI assistants handle branded vs. open queries:

99.4%

Named-query recognition

“Tell me about [Brand]”

3.32%

Discovery surfacing rate

“Best tools for [category]”

AI assistants are reliable at answering questions about your brand, but rarely surface you unprompted in category discovery. The Agentic Commerce Readiness Score focuses on closing this discovery gap.

Agent Readiness

On top of measuring what AI engines say about your brand, we check whether your domain is set up to be read by AI agents in the first place. Each scan fetches five public-web signals from your site. Every signal is weighted equally (0.20) and the composite is rounded to a 0–100 integer. Each fetch has its own 3-second timeout; a timeout, network error, or non-200 response counts as “absent” rather than failing the report.

llms.txt

GET /llms.txt returns 200 with non-empty body, an emerging site-level instructions file AI coding agents already read.

AI bot directives in robots.txt

robots.txt names at least one of GPTBot, CCBot, ClaudeBot, anthropic-ai, PerplexityBot, or Google-Extended as a User-agent. Plain User-agent: * does NOT count.

Schema.org JSON-LD

Your homepage HTML contains a <script type="application/ld+json"> block with a schema.org @type. Machine-readable entity facts engines can merge.

MCP manifest

GET /.well-known/mcp.json returns 200 and parses as JSON, the emerging convention for declaring agent endpoints.

Sitemap

GET /sitemap.xml returns 200 with non-empty body, the canonical list of pages every major crawler honours.

Why we measure all 4 engines

Different AI engines exhibit meaningfully different citation behaviour. ChatGPT tends to cite fewer sources per response but exercises deeper influence on each cited page. Perplexity surfaces more sources per response but the per-source influence is shallower. Gemini and Claude follow distinct retrieval patterns again. No single engine represents your overall AI visibility - you need all four.

Our consistency score penalises high variance across engines, rewarding brands that are visible everywhere rather than just dominant on one platform.

How we compare

Confidence intervals are the uncertainty range you get from asking the same question repeatedly - competitors' single-run snapshots give you one point estimate with no idea how much it would move on the next run.

Tool	Sampling	Confidence intervals	Engines	Rigour
Semrush AI Toolkit	Single snapshot ($99 add-on)	None	ChatGPT, Perplexity, Gemini	Point estimate only
Superlines	Single daily snapshot	None	ChatGPT, Perplexity, Gemini	Point estimate only
Evertune	Single daily snapshot	None	Not disclosed	Point estimate only
AthenaHQ	Single daily snapshot	None	Varies by plan	Point estimate only
Bluerails Discovery	N=5 per prompt per engine	Bootstrap 95% CI	ChatGPT, Perplexity, Gemini, Claude	arXiv-cited bootstrap methodology

Primary research cited

arXiv:2603.08924Repeated sampling and statistical confidence in LLM visibility measurement
arXiv:2604.07585"Don't Measure Once" - N=5 minimum for stable LLM visibility estimates
arXiv:2604.25707Selection rate definition and per-engine citation divergence
arXiv:2601.0091299.4% named-query recognition vs 3.32% discovery surfacing in AI responses