How we measure AI visibility.

A short whitepaper on prompt sourcing, sampling, statistical thresholds, and how we handle model drift. Read this before deciding to buy or build.

Document version: May 5, 2026 · v1.2. Reproducibility code & validation dataset available under NDA on request.

0. Glossary

Three terms appear repeatedly. Defining them up-front prevents most of the misreadings.

Citation rate. The proportion of independent runs of a prompt, on a given engine, in which a target brand is named at least once. Reported as a percentage with a 95% Wilson confidence interval. Not a measure of how often a real user sees the brand — engines do not expose impression data.

Drift. A statistically significant change over time in either (a) citation rate (quantitative drift) or (b) the descriptor language used about the brand when cited (qualitative drift). Quantitative drift via two-proportion z-test against a 7-day rolling baseline; qualitative drift via cosine distance against a customer-curated reference embedding.

Canary panel. A fixed set of 50 brand-agnostic reference prompts (trivia, definitions, factual lookups) queried every 6 hours across all engines. Used to detect silent vendor-side model updates independently of any customer signal. When canary embeddings drift, we flag a "model event" and rebaseline customer trends — preventing a vendor patch from being misread as brand-side movement.

1. Why methodology matters

Three properties of LLMs make naïve "monitoring" unreliable:

Non-determinism. Same prompt → different answers across runs, even at temperature 0 (batching, vLLM order effects, sampler stochasticity).
Silent model updates. Vendors push minor model changes without bumping the public version string. Outputs can shift overnight without notice.
Persona sensitivity. "What's the best CRM?" returns different answers depending on system-prompt persona context (founder vs enterprise IT vs SMB).

A spreadsheet that queries each engine once a week measures noise. A statistically valid trend signal requires sample size, replication, and explicit drift accounting. That is what BrandMirror is.

2. Prompt sourcing

We do not write prompts in a vacuum. The 340 prompts in a Pilot subscription come from three sources:

Search-volume mining. We start from the keyword universe of your category (Ahrefs, Semrush, Google Search Console export). High-volume buyer-intent queries are converted into natural-language prompts.
Buyer interview synthesis. During onboarding we synthesize 10–15 prompts from your sales-call recordings or buyer interviews — the actual phrasing your buyers use.
Competitor-backed comparison prompts. "Best X for Y", "alternative to Z", "X vs Y vs Z" — derived from your top 5 competitors' brand searches.

Each prompt is tagged with: category, buyer-stage, persona, and engine-suitability. You see and approve the full set before tracking begins.

3. Sampling design

Each prompt is queried multiple times per cycle to control for non-determinism:

n = 8 runs per prompt per engine per day, distributed over the 24-hour window.
Total daily queries on a Pilot plan: 340 prompts × 4 engines × 8 runs = 10,880 queries/day.
Temperature: fixed at 0.7 (closer to default user experience than 0).
Persona system prompts: rotated across 3 buyer-persona variants to capture answer-space breadth.

4. Citation rate as a frequency

We do not report "your brand is/isn't cited". We report citation rate as a percentage of runs — e.g. "Acme is mentioned in 68% of runs for prompt P". This is statistically tractable:

Standard error: SE = √(p(1−p)/n). For n=8, p=0.68 → SE ≈ 0.16. We aggregate across multiple prompts in a category to tighten the CI.
Reported with 95% confidence intervals on the dashboard. A "68%" headline is an interval like [57%, 78%].
Weekly deltas: we use a two-proportion z-test. Alerts fire only when |Δp| / SE > 2.0 — i.e. statistically significant at α = 0.05.

5. Handling silent model updates

To detect when a model has shifted under us — independently of customer-specific drift — we run a fixed canary panel:

50 reference prompts (public domain: trivia, definitions, factual lookups) queried every 6 hours across all engines.
Canary outputs are embedded with text-embedding-3-large and the cosine similarity to the prior 7-day baseline is tracked.
When canary similarity drops below 0.92, we flag a "model event" on the customer dashboard and adjust trend-baselines automatically — so a vendor model change doesn't trigger spurious customer alerts.

Events are public on our model-event log with timestamp and signature.

6. Position and ordinal tracking

When a brand is cited, we track where. Position metrics:

Ordinal rank within the answer (1st, 2nd, 3rd brand named).
Token offset (early in answer vs buried after > 200 tokens).
Sentiment of the surrounding sentence (3-class: positive / neutral / negative) using a fine-tuned classifier validated at 89% accuracy on our hand-labeled benchmark.

7. Drift detection

Beyond citation rate, we track positioning drift: how the brand is described.

For each citation, we extract the descriptor span (the sentence or clause about the brand).
We embed the span and compare to a customer-curated reference embedding ("how we want to be described").
Cosine distance growing beyond a 7-day moving baseline triggers a Drift Alert with the offending phrasing surfaced verbatim.

8. What we do not claim

To be honest about limits:

We do not measure the impressions of AI answers (no engine exposes this; nobody can claim this honestly).
We do not attribute pipeline directly to AI citations. We provide a leading indicator; revenue attribution requires UTM-tagged onward clicks where the engine supports them.
Google AI Overviews exposure is partial: we sample via a public-search proxy, with regional limits. We disclose coverage levels per customer at onboarding.

9. Reproducibility

The full sampling code, the canary panel definitions, the sentiment classifier benchmark, and our weekly report-generation pipeline are documented in an internal whitepaper available under NDA to qualified prospects. Email hello@brandmirror.dev.

10. Build vs. buy: a calculator

The honest version of the question. Engineering costs assume one staff-level engineer at a fully-loaded $18,000/month (US blended rate, including benefits and overhead).

One-time build cost (v1)

Prompt sourcing pipeline + tagging UI: 1 engineer-week
Multi-engine query orchestrator with retries, rate limits, vendor SDK churn: 2 engineer-weeks
Statistical aggregation layer (CIs, two-proportion z-tests, alert thresholds): 1 engineer-week
Canary panel + silent-update detection: 1 engineer-week
Sentiment classifier training + benchmark labeling (~2,000 hand-labels): 2 engineer-weeks
Dashboard, alerting, basic reporting: 2 engineer-weeks
Total v1: ~9 engineer-weeks ≈ $40,500 in labor

Monthly run cost

LLM API spend (340 prompts × 4 engines × 8 runs × 30 days at blended vendor rates): ~$2,800/mo
Maintenance (vendor API breakage, canary retuning, sentiment classifier drift, prompt-set refresh): conservatively 0.5 engineer-week/mo ≈ $2,250/mo
Infra (compute, vector DB, alerting, on-call): ~$400/mo
Total run cost: ~$5,450/mo

Payback math against a $1,200/mo Pilot subscription

Month 1: build + run − buy = $40,500 + $5,450 − $1,200 = $44,750 in the hole
Per-month delta after build: $5,450 − $1,200 = $4,250/month worse
The build never pays back at this price point. It would need a $5,450/mo subscription floor, plus a strategic reason that justifies the upfront $40k, before the math turns positive.

This is why most of our customers — even the ones with engineering teams that could build this — buy. The build is interesting; the maintenance is not. We've already eaten that across hundreds of customers.

If you have a strategic reason to own the IP (regulatory, sovereignty, a research thesis, a horizontal product play), that math doesn't care about subscription pricing — and we'll happily share our methodology under NDA so you don't repeat our wrong turns.

11. Open peer review

This methodology is not finished. It will not be finished. Three things make us nervous and we'd like more eyes on them:

The n=8 sample size per prompt per day is a budget compromise. Standard error at p=0.5, n=8 is 0.18. We tighten by aggregating across prompts, but a per-prompt headline number is wider than we'd prefer. We're piloting n=16 on Growth-tier accounts.
Sentiment classification at 89% accuracy means 11% of drift alerts are sentiment misreads. We surface the verbatim sentence so customers can sanity-check, but a quieter classifier — even at lower coverage — might be a better default.
Our canary similarity threshold (0.92 cosine distance) was tuned mostly on GPT-4 class models. We don't yet know whether it generalizes to reasoning models with longer chain-of-thought outputs. Early signal: it overfires on o-series models. Recalibrating.

If you're an engineer, statistician, or applied researcher who wants to challenge any of this, email hello@brandmirror.dev with subject "methodology". Substantive critiques that change the methodology get credited — by name, with permission — in the next published revision.

If you'd rather build this yourself: see §10 for the math. Most customers conclude the service is cheaper than the maintenance. If you decide to build anyway, we'll happily share our wrong turns under NDA.