For Founders

You're shipping an AI product.
Do you know if it works?

Candor puts real people in front of what your AI produces — and tells you what's working, what's broken, and what to fix next. No eval infrastructure to build. One command from your terminal.

>Terminal

→

◆Candor

→

→Results

Every evaluation on Candor is completed by a real person. Not an LLM. Not a synthetic label. Human judgment.

The Problem

You're guessing whether your product works

👤

You're evaluating your own work

You and your cofounder test your AI by using it yourselves. But you built it — you know the right answers, you interpret ambiguous outputs charitably, you unconsciously avoid the edge cases. You need strangers.

🎯

You don't know what “good” means yet

Is your output accurate? Trustworthy? Clear? Useful? You have a gut sense but no data. Ten real people rating your outputs on a 1-5 scale will tell you more in an afternoon than a month of internal debate.

🔄

You're guessing what to improve

Is the problem your AI's accuracy, your UX, your copy, your onboarding? Without structured feedback from real users, every sprint is a coin flip on what to work on next.

Use Cases

Four questions, one command each

Is your AI's output correct?

Have real people — or domain experts like lawyers, clinicians, or analysts — judge whether what your AI produces is accurate. Works for any output type: text, images, audio, decisions, recommendations. Get per-item accuracy scores so you know exactly which outputs are strong and which are failing.

Participant view

1 = mostly wrong · 5 = completely accurate

“How accurate is this AI-generated output?”

How you'd run it

$ claude "have 10 people rate whether these AI responses are accurate"

What you get back

Accuracy ratings across 10 human evaluators:

  output_001     4.6 ±0.5  accurate

  output_002     4.2 ±0.7  accurate

  output_003     1.8 ±0.9  ← failing, reviewers flagged

                              "incorrect conclusion"

  output_004     4.4 ±0.4  accurate

  output_005     2.3 ±1.1  ← high variance, ambiguous

  2 outputs need immediate attention.

  Results written to accuracy_eval.json

Learn more about Rating Scale →

Which version is better?

You have two approaches, two prompts, two models, two designs. Stop debating internally. Put both in front of real people and get a ranked winner with agreement metrics. Pairwise comparison — the simplest, most decisive eval you can run.

Participant view

Tie

“Which output is better? Consider accuracy, clarity, and usefulness.”

How you'd run it

$ claude "compare version A and version B — which do users prefer?"

What you get back

Preference results (20 human evaluators):

  Version A wins    65%

  Version B wins    25%

  Tie               10%

  Agreement: strong (α = 0.72)

  Top reason for A: "clearer and more specific"

  Top reason against B: "felt generic, not tailored"

  Results written to comparison.json

Learn more about Pairwise Comparison →

Do users trust it?

Your AI might be technically correct but feel wrong to users. Have real people interact with your product while an AI voice moderator asks questions about trust, clarity, and confidence. Voice interview sessions that surface the “I don't believe this” moments you can't find in analytics.

Participant view

LIVE SESSION

your-product.com/results

ModeratorTalk me through what you just saw — did you trust that result?

ParticipantIt sounded confident but I'm not sure it's right—

ModeratorWhat would make you more confident?

AI Moderator

1:52

How you'd run it

$ claude "run 5 voice interview sessions testing our product with real users"

What you get back

Themes across 5 human sessions:

  Trust breaks on confident-sounding errors (4/5)

    When the AI was wrong but stated it confidently,

    participants said they would have acted on it

    without checking.

  Users want to see sources (5/5)

    Every participant asked "where did this come from?"

    at least once during the session.

  First impression is strong (4/5)

    Participants described the product as "fast" and

    "impressive" in the first 30 seconds.

  Transcripts: study/trust-eval/transcripts

Learn more about Voice Interviews →

What's confusing?

Show your product to 15 people and collect open-ended reactions. What do they think it does? Where do they get stuck? What surprises them? Free text feedback that tells you what your landing page, onboarding, or core experience actually communicates vs. what you intended.

Participant view

Free text

“After looking at this, what do you think this product does? What's clear and what's confusing?”

How you'd run it

$ claude "show 15 people our landing page and ask what's clear and what's confusing"

What you get back

Themes across 15 human evaluators:

  Clear: core value prop (12/15)

    Most people accurately described what the product

    does within 10 seconds.

  Confusing: pricing (9/15)

    "I couldn't tell if there's a free tier or not."

  Confusing: AI vs human involvement (7/15)

    "I wasn't sure if a real person is ever involved

    or if it's fully automated."

  Surprising: speed (11/15)

    "I didn't expect results that fast" — positive signal

    on time-to-value.

  Results written to landing_page_feedback.json

Learn more about Free Text →

Under the hood, every Candor study uses randomized ordering to prevent bias, attention checks to filter disengaged participants, and inter-rater agreement metrics so you know whether your evaluators agree. You don't need to configure any of this — it's built into every study automatically.

Find out if your AI product works

Real people. Real feedback. Results in hours.

$curl -fsSL https://candor.sh | bash

Or talk to us about your use case →