AI Agents

Your agent talks to your customers.
Do you know how it sounds?

Your AI is representing your brand in every interaction — writing emails to leads, responding to support tickets, answering phone calls. You're probably evaluating it by reading logs and hoping for the best. Candor lets you put real people in front of your agent's outputs and get actual signal on tone, trust, and clarity.

The Problem

You're flying blind on the thing that matters most

You've never tested this with real people

You read the logs, they look fine to you — but you built the thing. You have no idea how a stranger perceives the tone, clarity, or trustworthiness of what your agent writes.

Your feedback loop is lagging

You find out an agent response was bad when a customer churns, a lead ghosts, or a support ticket escalates. By then you've sent that same bad response to hundreds of people.

Vibes don't scale

You and your cofounder read through outputs and gut-check them. That works at 10 conversations a day. It doesn't work at 10,000.

Use Cases

Three ways to eval your agent with real people

Which response variant converts?

Your agent can respond to an inbound lead five different ways. Which one makes a real person most likely to reply, book, or buy? Run a pairwise comparison across response variants with real people who match your customer profile. Get a ranked leaderboard with win rates.

Participant view
vs
A
Tie
B

“Which of these two responses would make you more likely to book an appointment?”

How you'd run it
$ claude "compare these 5 lead response variants —
which one would make you most likely to book
an appointment?"
What you get back
Ranked by win rate (30 participants, pairwise):
#1 Variant C — "friendly + specific" 72% win rate
#2 Variant A — "short + direct" 61% win rate
#3 Variant E — "formal + detailed" 48% win rate
#4 Variant B — "casual + emoji" 35% win rate
#5 Variant D — "templated" 24% win rate

Does your agent sound trustworthy?

Have real people rate your agent's outputs on trust, professionalism, clarity, and helpfulness. Get per-dimension scores with standard deviations so you know where you're strong and where the AI sounds off.

Participant view
1
2
3
4
5

“Rate this response on trust, professionalism, clarity, and helpfulness.”

How you'd run it
$ claude "have 20 people rate these agent responses
on trust and professionalism"
What you get back
Ratings across 20 participants (1-5 scale):
Trust 3.8 ±0.6
Professionalism 4.2 ±0.4
Clarity 4.5 ±0.3
Helpfulness 3.4 ±0.9 ← high variance
Finding: helpfulness scores diverge — 6 participants
rated it 2/5 or below, citing "generic" and "didn't
answer my actual question."

Watch real users interact with your agent

Put real people in front of your agent experience. An AI voice moderator observes and asks follow-up questions as they interact — probing on moments of confusion, trust breakdown, or friction. Get transcripts and themes, not just satisfaction scores.

Participant view
AI MODERATOR
Browse agent
Voice session

“Talk me through what just happened — did you expect that response from the agent?”

How you'd run it
$ claude "run a 5-person usability test of our
customer-facing chat agent with voice interviews"
What you get back
Themes across 5 sessions:
Trust breakdown on pricing (4/5 sessions)
Participants hesitated when the agent quoted a price
without explaining how it was calculated.
Tone mismatch after complaints (3/5 sessions)
Agent stayed upbeat when participants expressed
frustration — read as dismissive, not helpful.
Positive: fast response time (5/5 sessions)
Every participant noted the agent felt responsive.
Why It Matters

Agent companies have a unique eval problem

Traditional A/B testing requires massive traffic and measures lagging indicators — conversion, churn, escalation rates. Human eval gives you leading indicators: you know a response sounds wrong before it costs you a customer. For agent companies where every output is a customer touchpoint, this isn't nice-to-have research — it's QA.

Run your first agent eval

Describe what you want to learn. Candor handles the rest.

$claude "compare our 3 support response templates — which one do customers trust most?"
$curl -fsSL https://candor.sh | bash
Or talk to us about setting up your first agent eval →