Getting started

Candor documentation

Candor runs AI-moderated user studies from your terminal. Install with one command, tell your coding agent what you want to learn, and get prioritized findings from real user sessions.
Candor integrates with Claude Code via a skill file, so your agent can create studies, recruit participants, and surface results — all without leaving the terminal.
Getting started

Quickstart

1. Install

bash
$curl -fsSL https://candor.sh | bash

This installs the Candor CLI via npm, authenticates you via the browser, and installs the Candor skill for Claude Code.

2. Run a study

bash
$claude "test the onboarding flow with 5 users"

Your agent reads your codebase for context, drafts a study script, shows you the cost estimate, and waits for your approval before recruiting.

3. Get results

bash
$candor study show study_a1b2c3

View prioritized findings (P0-P3) with key quotes and suggested fixes. Or ask your agent to open GitHub issues for the top findings.

Getting started

How it works

When you ask your coding agent to run a study, it uses the Candor CLI to create the study and recruit real participants from a research panel. Each participant lands on a Candor session page and interacts with whatever stimulus is configured — items to evaluate, a product to test, or a topic to discuss.

For moderated studies (session or study scope), an AI voice moderator conducts the interview — asking questions, assigning tasks, and adapting follow-ups in real time. For item-based studies, participants complete structured tasks (compare, rate, or label) in a lightweight UI with no moderator.

After sessions complete, Candor synthesizes the data into rankings and agreement metrics (item-based) or prioritized findings with severity ratings, key quotes, and suggested actions (moderated). Your agent can then act on the results — opening issues, proposing fixes, or running follow-up studies.

candor init
Set up Candor: authenticate and configure Claude Code integration.
bash
$candor init

This command:

  • Opens your browser to authenticate with candor.sh
  • Saves your API key to ~/.candor/config.json
  • Installs the Candor skill to ~/.claude/skills/candor/
If Claude Code is not installed, the skill step is skipped. You can still use the CLI directly.
candor login
Re-authenticate with Candor. Use this if your API key has expired or you need to switch accounts.
bash
$candor login
candor study
Create, view, and manage user studies. Running candor study with no subcommand lists all studies.
bash
$candor study

Output

text
Your Studies

  Onboarding flow test [completed] id:study_a1b2c3
    https://example.com | 5 participants

  Pricing page clarity [recruiting] id:study_d4e5f6
    https://example.com/pricing | 3 participants
candor study list
Explicit alias for listing all studies. Supports --json.
bash
$candor study list
candor study create
Create a new study. Supports three stimulus types: items (compare/rate/label), url (usability test), or topic (interview).
bash
$candor study create --goal "which sounds better" --items "*.mp3"
bash
$candor study create --goal "test onboarding" --url https://example.com
bash
$candor study create --goal "coaching experience" --topic "health coaching"
Options
--goalstringrequired
What you want to learn, in plain English.
--itemsstring
Items to evaluate: text labels, file paths, globs, or a CSV file.
--urlstring
Product URL to test (moderated usability study).
--topicstring
Discussion topic (moderated interview).
--taskstring
Task type: compare, rate, label, use, share. Auto-inferred if omitted.
--labelsstring
Comma-separated labels for label tasks.
--moderatorstringdefault: inferred
Moderator scope: none (no moderator), session (live interview per session), or study (adaptive cross-session intelligence). Pair with --moderator-output to choose voice or text output.
--audiencestring
Participant recruitment criteria in natural language.
--participantsnumberdefault: 5
Number of participants to recruit.
--jsonflag
Output as JSON (used by Claude Code).
--task and --moderator are auto-inferred from the stimulus type, so you usually don't need to specify them. --items infers compare/none, --url infers use/session, --topic infers share/session. Advanced options like --workers and --reward are available for item-based studies — run candor study create --help for details.
candor study show <id>
View study details and findings.
bash
$candor study show study_a1b2c3
candor study status <id>
Check progress for a study. Use --live for real-time monitoring with activity feed.
bash
$candor study status study_a1b2c3
bash
$candor study status study_a1b2c3 --live
candor study results <id>
View computed results for item-based studies: rankings, label distributions, or rating averages with agreement metrics.
bash
$candor study results study_a1b2c3
candor study cancel <id>
Cancel an active study and stop recruiting. Responses already collected are preserved and credits refunded proportionally.
bash
$candor study cancel study_a1b2c3
candor study findings <id>
Get prioritized findings (P0–P3) for a study.
bash
$candor study findings study_a1b2c3
json
{
  "findings": [
    {
      "priority": "P0",
      "title": "\"Add guests\" button hidden on mobile",
      "category": "usability",
      "description": "Users could not find the primary CTA on small screens",
      "timesMentioned": 3,
      "keyQuotes": ["I kept scrolling but couldn't find how to add people"],
      "suggestedAction": "Move CTA above the fold on mobile viewports"
    }
  ]
}
candor study coverage <id>
View the adaptive coverage report for study-scoped studies: themes explored, consensus strength, and gaps needing more sessions.
bash
$candor study coverage study_a1b2c3
candor study approve <id>
Approve a study to begin recruiting real participants. This is the human-in-the-loop gate — studies do not recruit until explicitly approved.
bash
$candor study approve study_a1b2c3
Your agent should always show the cost estimate and wait for your explicit confirmation before running this command.
candor doctor
Run diagnostic checks on your Candor installation.
bash
$candor doctor

Checks for:

  • Candor CLI installed and up to date
  • API key configured
  • Claude Code detected
  • Skill file installed
candor update
Update the Candor CLI to the latest version.
bash
$candor update
You can also re-run the install script to update: curl -fsSL https://candor.sh | bash
Concepts

Studies

A study is the core primitive in Candor. Every study is defined by six orthogonal dimensions: goal, stimulus (items, url, or topic), task (compare, rate, label, use, share), moderator scope (none, session, or study), output mode (voice or text), audience, and participant count.

Stimulus types

items
Compare, rate, or label items (files, text, media). Workers complete structured tasks. Results in minutes.
url
Moderated usability test on a product URL. AI moderator guides participants through the product.
topic
Moderated interview about a topic. AI moderator conducts voice interviews with participants.

Task types

compare
Workers compare pairs and pick a winner. Produces stack rankings with win rates.
rate
Workers rate items on a numeric scale (1–5).
label
Workers assign a label from predefined categories.
use
Participants use a product URL with AI moderator guidance.
share
Participants discuss a topic in a moderated interview.
Concepts

Study lifecycle

Every study moves through four states:
draft
Study created, script generated. Awaiting approval.
recruiting
Approved. Participants being recruited from research panel.
active
Sessions in progress. Participants are being interviewed.
completed
All sessions done. Findings synthesized and ready.
Concepts

AI moderator

The AI moderator is a voice agent that conducts live interviews with study participants. It follows your study script, asks open-ended questions, assigns tasks, and adapts follow-up questions based on participant responses — probing on confusion, frustration, or unexpected behavior in real time.

For moderated studies (session and study scope), the moderator records transcripts and audio, which are later synthesized into findings. See Moderator scope for the none/session/study distinction and voice vs. text output modes.

Concepts

Moderator scope

The moderator scope is the axis that determines how Candor participates in your study. It's what makes Candor more than a recruitment platform — the scope controls whether participants complete tasks on their own, are interviewed individually, or are part of an adaptive research program that builds understanding across sessions.
none
No moderator. Participants complete structured tasks (compare, rate, label) in a lightweight UI. Fast and asynchronous — results in minutes.
session
AI moderator conducts a live interview within each session. Follows the study script but adapts follow-up questions based on participant responses. Sessions capped at 35 minutes.
study
AI moderator operates across multiple sessions. Learns from prior sessions to adaptively focus future ones on gaps, emerging themes, or areas needing deeper exploration.

When to use each scope

none — Best for quantitative evaluation: A/B comparisons, ratings, labeling. No moderator overhead, results in minutes. Use this when you have concrete items to evaluate and need fast, structured signal.

session — Best for qualitative feedback on a specific product or topic. Each session is independent — the moderator follows the same script but adapts to each participant. Use this for usability tests, concept validation, or exploratory interviews.

study — Best for deep research. The moderator builds understanding across sessions, adaptively probing gaps and emerging themes. Use candor study coverage to see exploration state and decide when you have enough signal.

Output modes

For session and study scopes, the --moderator-output flag controls how the AI communicates with participants:

voice (default) — The AI speaks aloud. Full two-way voice conversation with the participant.

text — The AI monitors audio and sends text-based interventions only. Useful for think-aloud studies where you want the participant to narrate freely while the AI observes and prompts via on-screen messages.

Concepts

Findings

After all sessions complete, Candor analyzes the transcripts and produces prioritized findings:
P0
Critical finding — blocks core functionality or causes task failure
P1
Major finding — significant confusion or friction, affects most users
P2
Minor finding — noticeable issue but users can work around it
P3
Suggestion — nice-to-have improvement mentioned by users

Each finding includes the title, description, category, affected feature/page, number of times mentioned, key participant quotes, and a suggested action.

Concepts

Pricing

Moderated studies (url/topic): ~$14.50 per participant session, covering recruitment, AI-moderated interview, transcripts, and findings synthesis. A typical 5-participant study costs ~$65–$75.

Item-based studies (compare/rate/label): ~$0.05–0.15 per task assignment. A typical pairwise study with 6 items (15 pairs, 3 workers each) costs ~$4.50–$6.75.

Your agent will always show the cost estimate before you approve.

Guide

Evaluating AI agents

View raw .md

This guide walks through using Candor to evaluate the output of an AI agent against a rubric. It covers how to decompose a multi-criteria eval into a single study, how to pick the right task type, how to write a rubric that produces reliable labels, how content renders to your labelers, and how to preview cost before anything charges.

The worked example is a research assistant agent that reads source documents and writes summaries — but the same pattern applies to any multi-source AI agent: research assistants, customer support triage bots, GTM tools, code review agents, or anything else where you want human judgement on whether the output is good.

Scope. This guide focuses on item-based studies (your agent produced some outputs, and you want humans to label them). For moderated interviews where humans talk to a Candor AI moderator about a product, see the Studies concept section above.

Breaking down a complex eval

Most real agent evals are multi-dimensional. You don't just want to know "is this output good?" — you want to know why it's good or bad across several orthogonal axes. Take a research assistant that reads a set of source documents and produces a summary. Four reasonable things to measure:

  • Did the agent retrieve the right source documents for the question?
  • For every claim in the output, is the source correctly attributed?
  • Is the summary faithful to what the sources actually say?
  • Does the output surface the most important findings — or bury them?

The wrong instinct here is to run four separate studies, one per question. That costs 4× the labeler time (each labeler has to re-read the same output four times) and gives you four disjoint score streams that are hard to join.

The right pattern is one item per agent output, scored along multiple dimensions in a single pass. Each labeler sees one output, answers all four questions on it at once, and moves on. In Candor this is a scorecard task with four criteria.

If your dimensions are all orthogonal (measuring different things), fold them into one scorecard. If they aren't — if two of them are really asking the same thing — drop one.

Picking a task type

Candor supports five item-based task types. The choice is not about the mechanic — it's about the question you want answered.

  • "Is this output good across several dimensions?"scorecard — Multiple weighted criteria, behavior-anchored levels, a weighted overall score. The right tool for almost every agent-eval scenario.
  • "Is version A of my agent better than version B?"compare (pairwise) — Participant sees two outputs side-by-side, picks the winner. Use when absolute judgement is hard but relative is easy — e.g. comparing two model versions, two prompts, or two generation strategies.
  • "Which category does this output fall into?"label — Participant picks one label from a fixed set. Good for taxonomies (e.g. intent classification, sentiment, failure-mode tagging).
  • "On a 1–5 scale, how good is this?"rate — Single-dimension rating. Simpler than a scorecard, but you lose the per-dimension breakdown. Use only when you genuinely only care about one axis.
  • "What did you notice? Freeform."describe (free text) — Participant writes an open-ended response. Great as a companion task to catch failure modes you didn't anticipate in your rubric — but hard to aggregate, so don't use it as your primary signal.

For the research-assistant example — and for essentially any multi-criteria eval — scorecard is the right choice. The rest of this guide assumes scorecard.

A quick word on pairwise, since it's the least obvious one: pairwise is useful when you're comparing two versions of something (two model variants, two prompt templates, two retrieval strategies) and absolute quality is subjective. People are much better at saying "A is better than B" than at saying "A is a 4 out of 5". But pairwise only gives you relative rankings — it can't tell you whether any of the options are actually good.

Designing a rubric

A scorecard is a set of criteria. Each criterion has a name, a weight, and an ordered list of levels from worst to best. Good rubrics produce reliable labels; sloppy rubrics produce inter-rater disagreement and findings you can't act on.

Keep dimensions to 3–5

More than five and labelers get fatigued; their later answers get sloppier. Fewer than three and you lose the whole point of a multi-dim eval. If you catch yourself wanting six dimensions, usually two of them are measuring the same thing.

Use behavior-anchored levels

A level description should tell a labeler exactly what they're looking for — ideally something they can observe or verify without having to make a judgment call. Compare:

text
BAD:  "Summary is accurate"
GOOD: "Summary correctly names every key point in the source documents"

BAD:  "Good attribution"
GOOD: "Every claim in the output is traceable to a specific source document"

BAD:  "Useful"
GOOD: "Surfaces at least 3 actionable findings with concrete next steps"

When two labelers look at the same output and pick different levels, it's almost always because the level wording was ambiguous — not because the output was genuinely borderline. Invest in the wording.

Use 4 levels when you can

An even number of levels forces labelers off the fence. Three or five levels give them a comfortable middle option ("Satisfactory", "Neutral") that they'll reach for whenever they're unsure — which means your middle bucket absorbs noise and tells you nothing. Four levels makes them pick a side.

Weights are relative

Weights only matter in ratio to each other. [5, 3, 3, 1] and [10, 6, 6, 2] produce identical overall scores. Use small integers and pick weights that reflect how you'd actually explain the decision to a stakeholder: "accuracy is way more important than formatting" becomes weight: 5 for accuracy and weight: 1 for formatting.

Here's the criteria JSON for the research-assistant example:

json
[
  {
    "name": "Source retrieval",
    "weight": 4,
    "levels": [
      "Wrong sources entirely",
      "Missing critical sources",
      "Most sources retrieved with minor gaps",
      "All expected sources retrieved"
    ]
  },
  {
    "name": "Source attribution",
    "weight": 3,
    "levels": [
      "No attribution at all",
      "Some claims cited, others unsupported",
      "Most claims cited, one or two wrong",
      "Every claim correctly attributed to its source"
    ]
  },
  {
    "name": "Summary accuracy",
    "weight": 5,
    "levels": [
      "Inaccurate or fabricated details",
      "Partial or selective — omits key points",
      "Mostly accurate with minor detail errors",
      "Summary correctly names every key point in the sources"
    ]
  },
  {
    "name": "Findings utility",
    "weight": 3,
    "levels": [
      "Misleading findings that would confuse the reader",
      "Generic summary with no surfaced findings",
      "Some findings surfaced, no prioritization",
      "Clearly prioritized findings with next-step guidance"
    ]
  }
]

Note that levels are ordered worst to best, and accuracy has the highest weight because this team cares most about whether they can trust the summary. Your weights should reflect your own priorities.

Content formats — what labelers actually see

Before you prepare items, you need to know what your labelers will see on screen. Each item has three display slots:

  • label — short text. Plain text only — no Markdown or HTML.
  • assetUrl + mimeType — media. Supports image/*, audio/*, and video/* only. Not HTML, not text files.
  • metadata.description — optional small-gray caption rendered below the main content. Good for subtitles like "Q3 2026 run — v2.4 of the agent".

When you set both label and assetUrl, both render — the media shows prominently and the label renders as a caption right next to it. It's not one or the other.

Here's exactly what labelers see for each combination:

Image + label + description

Layout
  ┌──────────────────────────────┐
  │        ┌──────────┐          │
  │        │          │          │
  │        │   IMAGE  │   ← assetUrl (image/*)
  │        │          │          │
  │        └──────────┘          │
  │       item.label             │   ← small caption below
  │    metadata.description      │   ← smaller gray caption
  │      View product →          │   ← if metadata.url
  └──────────────────────────────┘

Text only (no assetUrl)

Layout
  ┌──────────────────────────────┐
  │  ┌────────────────────────┐  │
  │  │      item.label        │  │   ← in a gray bordered box
  │  └────────────────────────┘  │
  │    metadata.description      │
  └──────────────────────────────┘

Text-only items are fine for short strings (a headline, a single sentence, a filename) but not for anything longer. There's no wrapping control, no formatting, no line breaks — just a single styled span inside a gray box. A long paragraph will render as one ugly run-on.

Audio

Layout
  ┌──────────────────────────────┐
  │        item.label            │   ← label above the player
  │  ┌────────────────────────┐  │
  │  │  ▶ ═══════════  0:32   │  │   ← native <audio controls>
  │  └────────────────────────┘  │
  │    metadata.description      │
  └──────────────────────────────┘

Video

Layout
  ┌──────────────────────────────┐
  │  ┌────────────────────────┐  │
  │  │                        │  │
  │  │       VIDEO            │  │   ← native <video controls>
  │  │                        │  │
  │  └────────────────────────┘  │
  │       item.label             │   ← caption below
  │    metadata.description      │
  └──────────────────────────────┘

Pairwise (two items side-by-side)

Layout
  ┌──────────────────────────────────────────────┐
  │  Which one is better?                        │
  │  ┌────────────┐       ┌────────────┐         │
  │  │  Option A  │       │  Option B  │         │   ← subtitle
  │  │            │       │            │         │
  │  │  (item)    │       │  (item)    │         │   ← same ItemRenderer
  │  │            │       │            │         │
  │  └────────────┘       └────────────┘         │
  │   ( A is better )  ( Tie )  ( B is better )  │
  └──────────────────────────────────────────────┘

Pairwise is just ItemRenderer rendered twice in a grid with "Option A" / "Option B" labels on top. Whatever an item looks like on its own, it looks the same inside a pairwise comparison.

Rendering long-form text output

This is the practical gotcha for agent evals: your agent produces long structured text, and the item renderer has no way to show it cleanly. There is no HTML path, no Markdown path, no long-text path.

The reliable workaround is to render your agent's output to HTML yourself, screenshot the rendered page, host the PNG, and pass the PNG as an assetUrl. You get full control over typography, syntax highlighting, sections, and wrapping, and it renders exactly the same way to every labeler. Use label as a short caption identifying which run/version the screenshot came from.

The item label field does not support Markdown or HTML. If your content has any formatting — headings, bullets, code blocks, tables — you must screenshot it before uploading, or labelers will see a run-on string in a gray box.

Hosting your assets

Candor does not proxy your media. Whatever URL you pass as assetUrl must be publicly reachable — from both Candor's servers and every labeler's browser. There are two fetch paths to keep in mind.

Server-side probing (audio/video only)

When you create a study with audio or video items, Candor's server immediately calls fetch(assetUrl) to measure duration. This feeds the reward estimator so the auto-calculated payout matches the actual task length. If the URL isn't reachable from Candor's servers, the probe silently fails and your reward estimate will be off — but the study still creates. Image-only studies skip this step entirely.

Client-side loading (all media)

When a labeler actually takes the task, their browser loads assets directly via <img src=...>, <audio><source>, and <video><source>. There is no proxy. If the URL requires auth headers, a session cookie, a VPN, or is behind your corporate firewall, the labeler sees a broken-image icon and your data for that item is garbage.

Quick check: can you open the asset URL in an incognito browser window with no cookies and no VPN? If yes, labelers can see it. If no, they can't.

Practical hosting choices

  • Public CDN path — simplest and most reliable. Cloudflare, CloudFront, Bunny, Fastly, anything that returns your file on a plain GET.
  • S3 presigned URLs — work fine, but pick an expiration longer than your study duration. A study can run for days; presign for at least 7 days.
  • Google Drive / Dropbox share links — often don't work, because the share URL redirects through a preview page that <img> can't render. If you must use them, find the direct-download URL and use that instead.
  • Localhost / intranet / VPN-only URLs — will never work. Candor's servers can't reach them and neither can outside labelers.

Screenshots of your agent's output may contain sensitive data (customer names, internal documents, API keys that leaked into logs). Sanitize before you upload, or keep the study internal by using platform: "direct" and sharing only with your team. Once a URL is in an assetUrl, anyone who receives a participant link can open it.

Getting reliable labels

A scorecard is only as good as the people filling it out. Agent evals are usually specialist tasks — "is this retrieval correct?" or "is this attribution complete?" requires someone who understands the domain. A crowd worker recruited in 10 minutes almost certainly can't answer these correctly.

For agent evals, the right default is direct-link studies labeled by your own team. Set platform: "direct" in the create payload, then share the resulting URLs with a small group of domain experts — engineers, product folks, or whoever knows what a good output looks like. Candor's managed recruitment is better for broader-audience questions (does a marketing message resonate?) than for technical quality judgements.

How many labelers per item?

The participants field controls redundancy — how many people see each item. One labeler is cheapest but gives you zero agreement signal: if someone makes a mistake, you'll never notice. Three labelers is the de-facto standard — enough to compute agreement and catch outliers without being wildly expensive. Five or more for decisions that really matter (model launch / kill calls).

A sensible rollout is: start with 1–2 internal labelers for a dry run, make sure the rubric is sane, then scale up to 3 labelers for the real evaluation.

End-to-end walkthrough

Here's the whole flow for the research-assistant example — from rubric to results — using real API calls.

Step 1: Prepare items

For each agent output you want evaluated, render it to HTML (whatever rendering your agent already produces), screenshot it to a PNG, and upload the PNG to a public CDN. You end up with one URL per output. The label should be a short identifier the labeler can reference if they need to report a problem.

Step 2: Create a draft study

bash
curl https://candor.sh/api/studies \
  -H "Authorization: Bearer $CANDOR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "goal": "Evaluate research assistant summary quality",
    "task": "scorecard",
    "platform": "direct",
    "participants": 3,
    "items": [
      {
        "label": "Run 2026-04-12 · query #1",
        "assetUrl": "https://cdn.example.com/evals/run_2026-04-12/q1.png",
        "mimeType": "image/png",
        "metadata": { "description": "Agent v2.4 — 6 sources retrieved" }
      },
      {
        "label": "Run 2026-04-12 · query #2",
        "assetUrl": "https://cdn.example.com/evals/run_2026-04-12/q2.png",
        "mimeType": "image/png"
      }
    ],
    "criteria": [
      { "name": "Source retrieval",   "weight": 4,
        "levels": ["Wrong sources entirely", "Missing critical sources",
                   "Most sources retrieved with minor gaps",
                   "All expected sources retrieved"] },
      { "name": "Source attribution", "weight": 3,
        "levels": ["No attribution at all", "Some claims cited, others unsupported",
                   "Most claims cited, one or two wrong",
                   "Every claim correctly attributed to its source"] },
      { "name": "Summary accuracy",   "weight": 5,
        "levels": ["Inaccurate or fabricated details",
                   "Partial or selective — omits key points",
                   "Mostly accurate with minor detail errors",
                   "Summary correctly names every key point in the sources"] },
      { "name": "Findings utility",   "weight": 3,
        "levels": ["Misleading findings that would confuse the reader",
                   "Generic summary with no surfaced findings",
                   "Some findings surfaced, no prioritization",
                   "Clearly prioritized findings with next-step guidance"] }
    ]
  }'

A few things to notice about that payload: task: "scorecard" and criteria go together; platform: "direct" keeps this study internal (no external recruitment); participants: 3 gives you three labelers per item for agreement signal.

Step 3: Inspect the estimate

The create response gives you everything you need to decide whether to go ahead. Nothing has charged yet — this is a draft.

json
{
  "study": {
    "id": "study_a1b2c3",
    "name": "Evaluate research assistant summary quality",
    "status": "draft",
    "platform": "direct",
    ...
  },
  "items": [
    { "id": 101, "label": "Run 2026-04-12 · query #1" },
    { "id": 102, "label": "Run 2026-04-12 · query #2" }
  ],
  "estimate": {
    "totalTasks": 2,
    "totalAssignments": 1,
    "estimatedCostCents": 180,
    "totalCostCents": 252,
    "estimatedMinutes": 3
  },
  "message": "Study created in draft mode. Run candor study approve study_a1b2c3 to start."
}
  • totalTasks (number) — How many individual item scorings will happen. items.length × participants gives the upper bound; Candor may bundle multiple tasks into one assignment.
  • totalAssignments (number) — How many distinct labeler sessions. Divided by your participants field this tells you how many share URLs the approve step will give you.
  • estimatedCostCents (number) — Lower-bound cost estimate, mostly useful as context. Prefer totalCostCents when budgeting.
  • totalCostCents (number)This is the number you actually pay. For platform: "direct" item studies with internal labelers, this is typically very small — no recruitment fees apply.
  • estimatedMinutes (number) — How long Candor expects a labeler to take per assignment. Useful for setting expectations with your team.

Step 4: Preview what labelers will see

Before you approve, open the study in a browser as if you were a labeler:

text
$https://candor.sh/study/preview/study_a1b2c3

Walk through the task end-to-end, check that your asset URLs load, that the rubric reads the way you expected, and that the levels are unambiguous. This works on drafts and doesn't consume a real assignment. See Previewing a task for the programmatic flow if you want to embed the preview URL in your own tooling.

Step 5: Approve, or delete and retry

If the estimate looks fine and the preview checks out, approve the study. This is the only endpoint that debits your balance. For direct-link studies it costs nothing at approve time; for managed recruitment it deducts the full totalCostCents.

bash
curl -X POST https://candor.sh/api/studies/study_a1b2c3/approve \
  -H "Authorization: Bearer $CANDOR_KEY"

If the estimate is wrong, too expensive, or you want to tweak the rubric, delete the draft and create a new one. Drafts are free to create and destroy — iterate as much as you want before approving.

bash
curl -X DELETE https://candor.sh/api/studies/study_a1b2c3/delete \
  -H "Authorization: Bearer $CANDOR_KEY"

Step 6: Share links with labelers

For direct-link studies, the approve response includes a shareUrls array — one URL per labeler slot. Send each URL to someone on your team; they follow the link, complete the tasks, and submit. No accounts, no signup, no Candor-side recruitment.

json
{
  "message": "Study ready! Share these links with your team.",
  "shareUrls": [
    "https://candor.sh/study/a/asgn_f0e1d2c3b4a5",
    "https://candor.sh/study/a/asgn_a5b4c3d2e1f0",
    "https://candor.sh/study/a/asgn_9a8b7c6d5e4f"
  ],
  ...
}

Step 7: Read the results

Once labelers have submitted, GET /api/studies/:id/results returns a per-item, per-dimension breakdown. Each item shows its weighted overall score and the per-dimension detail.

json
{
  "status": "completed",
  "results": {
    "items": [
      {
        "label": "Run 2026-04-12 · query #1",
        "overallWeightedScore": 0.72,
        "totalResponses": 3,
        "dimensions": [
          {
            "name": "Source retrieval",
            "weight": 4,
            "meanScore": 0.92,
            "levelDistribution": {
              "All expected sources retrieved": 2,
              "Most sources retrieved with minor gaps": 1
            }
          },
          {
            "name": "Source attribution",
            "weight": 3,
            "meanScore": 0.50,
            "levelDistribution": {
              "Some claims cited, others unsupported": 2,
              "Most claims cited, one or two wrong": 1
            }
          },
          {
            "name": "Summary accuracy",
            "weight": 5,
            "meanScore": 0.83,
            "levelDistribution": {
              "Summary correctly names every key point in the sources": 2,
              "Mostly accurate with minor detail errors": 1
            }
          },
          {
            "name": "Findings utility",
            "weight": 3,
            "meanScore": 0.58,
            "levelDistribution": {
              "Some findings surfaced, no prioritization": 2,
              "Clearly prioritized findings with next-step guidance": 1
            }
          }
        ]
      }
    ]
  }
}

Two things to know about these numbers:

  • meanScore is normalized to 0.0 – 1.0, not 0–5. A 4-level rubric with unanimous top picks gives 1.0; unanimous worst gives 0.0.
  • overallWeightedScore is the weighted average of all meanScore values across dimensions, using the weight field. It's the single number you'll quote in dashboards.
  • levelDistribution is a raw count of how many labelers picked each level — useful for spotting disagreement (a 1/1/1 split across three labelers is a flashing red light).

In the example above, the agent is strong on retrieval and accuracy but weak on attribution and prioritizing findings. That's an actionable signal: fix the attribution chain and the findings-ranking step, then re-run with a new batch of items to see if the scores improve.

Previewing a task before you share it

Before you approve a study and send links to labelers, you almost always want to open it yourself and walk through the task the way a labeler would. Candor gives you a preview URL for this — it works on drafts, so you can see exactly what the task looks like before anything charges and without using up a real assignment slot.

The easy path: one URL, no API calls

Open this in any browser:

text
$https://candor.sh/study/preview/study_a1b2c3

That page creates a preview assignment, redirects you to the real participant UI, and appends ?preview=true so submissions aren't recorded. You see every task in order — the full rubric, the actual item renders, the submit button, everything. Share the URL with a teammate and they can review without needing a Candor account.

The programmatic path

If you want to embed the preview URL in your own tooling (e.g. show it next to the estimate in a custom dashboard), call the preview endpoint directly:

bash
curl -X POST https://candor.sh/api/studies/study_a1b2c3/preview \
  -H "Authorization: Bearer $CANDOR_KEY"

The response gives you a relative URL you can prefix with the API base:

json
{
  "type": "items",
  "url": "/study/a/preview_9f8e7d6c5b4a",
  "assignmentId": "preview_9f8e7d6c5b4a"
}

Append ?preview=true and hand it to whoever needs to see it. Moderated studies return { type: "moderated", url, sessionToken } instead — the URL drops you into a full moderated session so you can walk the interview script end-to-end.

What preview does and doesn't do

  • Works on drafts — you don't need to approve first.
  • Shows the exact same ItemRenderer your real labelers will see, loading your real asset URLs. If something renders weird in preview, it will render weird for labelers too.
  • Creates a real assignment row tagged preview_*. These are filtered out of progress, results, and the activity feed, so they won't pollute your real data.
  • Multiple calls create multiple preview assignments — harmless but they accumulate.

To clean up, delete them whenever you want:

bash
curl -X DELETE https://candor.sh/api/studies/study_a1b2c3/preview \
  -H "Authorization: Bearer $CANDOR_KEY"

This removes every preview_* assignment (and its tasks) for that study. Real participant data is never touched.

Previewing cost before you commit

To recap the money flow clearly, so you can build it into whatever tooling you use to drive Candor:

  • POST /api/studiesalways free. Creates a draft, returns estimate. Nothing charges.
  • DELETE /api/studies/:id/delete — also free. Deletes the draft; no residual state.
  • POST /api/studies/:id/approvethis is the only endpoint that debits your balance. For direct-link item studies it costs nothing (platform-internal); for managed recruitment it deducts the full totalCostCents you saw in the estimate.
  • POST /api/studies/:id/publish — also free. Just flips the study from ready-to-publish into active recruitment on the provider.

The safe iteration loop is:

Flow
create draft  →  inspect estimate  →  delete if too expensive
              ↑                                      │
              └──── adjust rubric / items ───────────┘

create draft  →  inspect estimate  →  approve  (money moves)

You can create and destroy as many drafts as you want while tuning your rubric — nothing charges until you explicitly approve. That's the whole cost-preview story.

API reference

API Reference

View raw .md

Every feature the CLI uses is available over HTTPS. Create an API key from the API Keys page in the dashboard, then call the same endpoints the CLI does. All responses are JSON.

Base URL: https://candor.sh/api

API reference

Authentication

Pass your API key as a bearer token in the Authorization header.

bash
curl https://candor.sh/api/studies \
  -H "Authorization: Bearer ck_your_api_key_here"

Keys look like ck_<48-hex>. Treat them like passwords — anyone with a key can read and modify your studies. Revoke compromised keys from the dashboard.

API reference

Studies

The three study types

Every study has a stimulus type that decides what participants see and how they respond. You pick a stimulus by sending exactly one of items, url, or topic in the create payload — Candor infers the rest.

  • Item studies — triggered by items. Participants evaluate a discrete set of things (images, audio clips, copy variants, model outputs). No voice — everything happens in a structured browser UI. Use when you have alternatives to compare or want to label/rate a collection.
  • URL studies — triggered by url. Participants visit a live website or product and talk through their experience with an AI moderator. You can pass an interview guide; Candor generates one from your goal if you don't. Use when you're testing a real product and want qualitative feedback on specific flows.
  • Topic studies — triggered by topic. Same moderated-interview format as URL studies, but without a product to test. Participants discuss a subject with the AI moderator. Use for discovery research, concept testing, or any interview that isn't tied to a live UI.

Create a study

POST /api/studies

goal is the only universally-required field. Everything else depends on the stimulus type. Candor fills in sensible defaults (task, moderator scope, reward, batch size) based on the combination you pass, so a minimal request is usually enough to start.

Parameters shared by all study types

  • goal (string, required) — Plain-language description of what you want to learn. Used as the study name if name is not provided, and fed to the moderator / task generator to shape the participant experience.
  • participants (number, default: 5) — For moderated studies (URL/topic): the number of sessions to run. For item studies on the direct platform: the number of independent shareable links. For item studies on a managed platform: the number of respondents per task batch.
  • audience (string) — Natural-language audience description used when Candor is recruiting for you (e.g. "US designers aged 25-40 who use Figma daily"). Ignored for recruitment: "direct".
  • platform ("direct" | "managed", default: "direct") — Controls how participants are sourced. direct gives you a URL you share yourself (free, no recruitment fees). managed has Candor recruit from a vetted pool using your audience string.
  • reward (number (cents)) — Per-session (moderated) or per-assignment (items) payout. If omitted, Candor auto-calculates based on task type, media duration, and platform fees. Required only if the auto-estimate feels wrong.
  • rewardMultiplier (number) — Multiplier applied to the auto-calculated reward. Use for hard-to-recruit audiences where you want to pay above-market without picking an exact number.

Item studies

Send a non-empty items array. Each item needs a label and optionally an assetUrl (for images, audio, video) plus mimeType. Candor probes audio and video durations at creation time to size the reward and batch correctly — if you pass assetUrl, make it publicly reachable.

bash
curl https://candor.sh/api/studies \
  -H "Authorization: Bearer $CANDOR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "goal": "Which onboarding screenshot feels more trustworthy?",
    "items": [
      { "label": "Variant A", "assetUrl": "https://cdn.example.com/a.png", "mimeType": "image/png" },
      { "label": "Variant B", "assetUrl": "https://cdn.example.com/b.png", "mimeType": "image/png" },
      { "label": "Variant C", "assetUrl": "https://cdn.example.com/c.png", "mimeType": "image/png" }
    ],
    "participants": 10
  }'

The task field decides what participants do with each item. If you omit it, Candor picks one by reading your goal (and whether you passed labels).

Task types for item studies

  • task: "compare" (or rank — 20 pairs per assignment) — Shows two items side by side; participant picks the winner (or ties). No extra fields required. Candor generates all pairs for small item sets, or samples for >100 items.
  • task: "rate" (or score — 20 items per assignment) — Shows one item; participant rates 1–5. No extra fields required.
  • task: "label" (or categorize — 20 items per assignment) — Shows one item; participant picks from a fixed label set. Pass labels: string[] — e.g. ["positive", "neutral", "negative"]. Required.
  • task: "describe" (or transcribe / respond / review — 20 items per assignment) — Shows one item; participant writes an open-ended response. No extra fields. Pick the verb that best matches your instructions: describe for observations, transcribe for audio/video, review for evaluative writing.
  • task: "scorecard" (10 items per assignment) — Shows one item; participant evaluates it across multiple rubric dimensions. Pass criteria: { name, weight, levels[] }[]. Required. Each criterion produces a dimension score; overall score is weighted.

Batch size (batchSize) is how many tasks go into a single worker's assignment — defaults above. Override it if you want shorter or longer sessions.

For very large item sets (more than a few hundred), post the first chunk in the create call and append the rest via POST /api/studies/:id/items in batches of up to 500.

URL studies

Send a url to the page you want tested. Candor automatically generates a short interview guide from your goal — or pass your own in interviewGuide (plain text / Markdown; Candor converts it to the internal script format via Claude).

bash
curl https://candor.sh/api/studies \
  -H "Authorization: Bearer $CANDOR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "goal": "Test whether new users understand the pricing page",
    "url": "https://example.com/pricing",
    "participants": 5,
    "durationMinutes": 8,
    "platform": "managed",
    "audience": "US adults 25-45 who subscribe to at least one SaaS product"
  }'
  • url (string, required) — The page participants will interact with.
  • displayMode ("iframe" | "tab", default: "iframe")iframe embeds the page inside Candor's session UI alongside the moderator panel. tab opens it in a new browser tab — use this for sites that refuse to be iframed.
  • durationMinutes (number, default: 5) — Target session length. Drives the auto-estimated reward and the generated interview script's section timing.
  • interviewGuide (string) — Your own interview script as plain text. Sections, questions, and tasks are preserved verbatim. If omitted, Candor generates a 2–3 section script from your goal.
  • moderator ("none" | "session" | "study", default: "session")session runs one moderator per participant (default). study runs one moderator across all sessions with adaptive coverage — cheaper but participants don't each get a full interview. none disables the moderator entirely (rarely useful for URL studies).
  • moderatorOutput ("voice" | "text", default: "voice")voice — AI moderator speaks aloud via TTS and expects participants to reply with their voice (a full spoken interview). text — AI monitors participant speech via STT but responds only with text prompts on screen. Default is voice for all moderated studies (URL, topic, and items follow-ups). Pass text to opt out of TTS.
  • inputModes (string[]) — Explicit list of input channels participants can use, e.g. ["voice", "text"]. Default is voice only.

Topic studies

Send a topic string. Everything else works the same as URL studies — same moderator options, same interview guide handling, same recruitment choices. The only difference is participants don't see a website during the session.

bash
curl https://candor.sh/api/studies \
  -H "Authorization: Bearer $CANDOR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "goal": "Understand how indie devs decide whether to adopt a new LLM",
    "topic": "Choosing an LLM for a side project",
    "participants": 8,
    "platform": "managed",
    "audience": "Software engineers who ship side projects"
  }'

Direct vs managed recruitment

Every study has a platform that controls how participants are sourced. It's a creation-time choice — you can't switch a study between modes later.

  • platform: "direct" — Candor gives you one shareable URL per participant slot. You send the links to whoever you want — teammates, your own user list, a beta group. No recruitment fees. The audience field is ignored.
  • platform: "managed" — Candor handles recruitment against a vetted participant pool using the audience string you provide. The total cost per participant is visible on the create response under totalCostCents — nothing charges until you approve the study.

After creating a study

Every create returns status: "draft". Nothing charges, nothing recruits, nothing is visible to participants until you explicitly approve and publish:

bash
POST /api/studies/:id/approve   # draft -> ready_to_publish (or active for direct studies)
POST /api/studies/:id/publish   # ready_to_publish -> live recruitment

Direct (link) studies skip the publish step — approve activates them and returns the shareable URLs inline.

List studies

GET /api/studies

Returns an array of studies. Append ?archived=true to include archived studies.

Get a study

GET /api/studies/:id

Append ?include=findings,participants to include related entities in the response. Example response body:

json
{
  "study": {
    "id": "study_a1b2c3",
    "name": "Onboarding walkthrough test",
    "goal": "Find friction in the signup flow",
    "stimulus": { "type": "url", "value": "https://example.com/signup" },
    "task": "use",
    "moderatorScope": "session",
    "moderatorOutput": "voice",
    "participants": 5,
    "status": "active",
    "platform": "managed",
    "estimatedCostCents": 6750,
    "createdAt": "2026-04-12T10:30:00.000Z",
    "updatedAt": "2026-04-12T10:45:00.000Z"
  },
  "findings": [],
  "participants": [],
  "activity": [
    { "event": "launched", "message": "Launching study...", "at": "..." },
    { "event": "published", "message": "Recruitment live", "at": "..." }
  ]
}

The platform field is direct for self-shared studies and managed when Candor handles recruitment.

Add items to a study

POST /api/studies/:id/items

Append additional items to an existing item-based study. Useful for large studies where the initial POST /api/studies body would exceed request size limits — send the first batch on create, then stream the rest here in chunks of 500.

Preview a task

bash
POST   /api/studies/:id/preview  # create a preview assignment
DELETE /api/studies/:id/preview  # clean up preview assignments

The POST endpoint creates a disposable assignment and returns a url that opens the real participant UI in preview mode (no responses saved, preview rows filtered out of results). Works on drafts — you can preview before approving. For the browser-friendly shortcut, just visit https://candor.sh/study/preview/:id and the page handles the POST and redirect for you. See Previewing a task in the guide for the full flow.

Lifecycle transitions

bash
POST   /api/studies/:id/approve   # move draft to ready-to-publish
POST   /api/studies/:id/publish   # go live with recruitment
POST   /api/studies/:id/pause     # temporarily stop recruiting
POST   /api/studies/:id/cancel    # permanently stop
POST   /api/studies/:id/archive   # archive
DELETE /api/studies/:id/delete    # permanent delete (billing records preserved)

The pause and archive endpoints also handle the inverse operation. Pause auto-toggles — calling it on a paused study resumes it — or pass { "action": "resume" } in the body to force it. Archive takes { "action": "unarchive" } to restore a study.

API reference

Results & findings

The output shape depends on the study type. Item studies return computed results that depend on the task; moderated (URL/topic) studies return prioritized findings extracted from session transcripts.

bash
GET /api/studies/:id/results       # computed results (item studies)
GET /api/studies/:id/findings      # synthesized findings (moderated)
GET /api/studies/:id/demographics  # participant demographics (managed recruitment only)
GET /api/studies/:id/coverage      # thematic coverage (moderated)

Append ?format=csv to any of these to download as CSV. Worker IDs in results and demographics are exposed as participantId (JSON) or external_participant_id (CSV) so you can correlate responses across calls without depending on any recruitment provider's ID format.

Item-study results — shape per task type

GET /api/studies/:id/results is valid for item studies only. The results object shape depends on the task you picked at creation time:

Pairwise comparison (task: "compare")

json
{
  "results": {
    "rankings": [
      { "rank": 1, "label": "Variant A", "winRate": 0.72, "totalWins": 18, "totalComparisons": 25 },
      { "rank": 2, "label": "Variant B", "winRate": 0.48, "totalWins": 12, "totalComparisons": 25 }
    ],
    "agreement": {
      "pairwiseAgreementRate": 0.84,
      "krippendorphAlpha": 0.68,
      "disagreedPairs": [ { "itemALabel": "Variant A", "itemBLabel": "Variant C" } ]
    },
    "totalResponses": 50,
    "totalPairs": 3
  },
  "status": "completed",
  "progress": { "totalTasks": 15, "completedTasks": 15, "totalResponses": 50,
                "uniqueParticipants": 10, "respondentsPerTask": 5 }
}

Rating scale (task: "rate")

json
{
  "results": {
    "items": [
      { "label": "Variant A", "meanRating": 4.2, "stdDev": 0.74, "median": 4, "totalRatings": 12 },
      { "label": "Variant B", "meanRating": 3.6, "stdDev": 0.92, "median": 4, "totalRatings": 12 }
    ]
  }
}

Categorical label (task: "label")

json
{
  "results": {
    "items": [
      {
        "label": "Screenshot 1",
        "assignedLabel": "positive",
        "confidence": 0.80,
        "totalVotes": 10,
        "labelDistribution": { "positive": 8, "neutral": 1, "negative": 1 }
      }
    ]
  }
}

Free text (task: "describe", "review", etc.)

json
{
  "results": {
    "items": [
      {
        "label": "Variant A",
        "responses": [
          { "text": "Felt cluttered — I didn't know where to look first.", "participantId": "p_3f9e..." },
          { "text": "Clean and direct. Would click.", "participantId": "p_7a22..." }
        ]
      }
    ]
  }
}

Scorecard (task: "scorecard")

json
{
  "results": {
    "items": [
      {
        "label": "Model output A",
        "overallWeightedScore": 0.74,
        "totalResponses": 8,
        "dimensions": [
          { "name": "Accuracy",     "meanScore": 0.85, "weight": 5 },
          { "name": "Helpfulness",  "meanScore": 0.68, "weight": 3 }
        ]
      }
    ]
  }
}

Moderated-study findings

GET /api/studies/:id/findings returns prioritized findings (P0–P3) that Candor extracts from session transcripts after sessions complete. The same P0–P3 scale used for the CLI's findings command — see the Findings concept section above for what each priority means.

json
{
  "findings": [
    {
      "id": 42,
      "priority": "P0",
      "title": "Pricing page hides the free tier below the fold",
      "description": "4 out of 5 participants scrolled past the paid tiers...",
      "category": "information-architecture",
      "affectedFeature": "pricing",
      "timesMentioned": 4,
      "keyQuotes": ["I thought everything cost money — I was about to leave."],
      "suggestedAction": "Move the free-tier card above the comparison table.",
      "status": "open",
      "createdAt": "2026-04-12T11:05:00.000Z"
    }
  ]
}

Coverage (moderated studies only)

GET /api/studies/:id/coverage returns the themes participants have explored so far and which expected themes are still missing — useful mid-study to decide whether to keep running sessions or stop early.

Demographics (managed recruitment only)

GET /api/studies/:id/demographics works only for studies with managed recruitment. Returns one row per participant with demographic fields reported by the recruitment provider.

Account balance

GET /api/billing/balance

Returns your organization's prepaid balance in cents and whether a payment method is on file. Useful for showing a top-up prompt before creating an expensive study.

json
{
  "balanceCents": 12500,
  "hasPaymentMethod": true
}
API reference

Webhooks

Instead of polling, subscribe to events and Candor will POST them to your endpoint as they happen. Create an endpoint from the Webhooks page — you'll get back a signing secret that you should store securely.

Payload format

json
{
  "id": "evt_a1b2c3d4e5f6",
  "type": "study.completed",
  "createdAt": "2026-04-12T10:30:00.000Z",
  "data": {
    "studyId": "study_a1b2c3",
    "message": "All participants have submitted — study complete"
  }
}

Webhook bodies are intentionally minimal. Use GET /api/studies/:id to fetch the full state — this keeps payloads small and lets you process events in any order.

Event types

Events are grouped into four namespaces. Subscribe to the ones you need.

  • study.* — lifecycle transitions (launched, published, paused, completed, cancelled)
  • participant.* — participant events (joined, session_started, submitted, no_show)
  • interaction.* — clicks, media playback, responses. High volume.
  • transcript.* — per-utterance events from the AI moderator and participant. Very high volume.

Verifying the signature

Every request includes an X-Candor-Signature header with an HMAC-SHA256 of the raw body using your endpoint secret. Verify it before trusting the payload.

javascript
// Node.js
import { createHmac, timingSafeEqual } from "crypto";

function verify(rawBody, headerSignature, secret) {
  const expected = "sha256=" + createHmac("sha256", secret)
    .update(rawBody)
    .digest("hex");
  const a = Buffer.from(headerSignature);
  const b = Buffer.from(expected);
  return a.length === b.length && timingSafeEqual(a, b);
}
python
# Python
import hmac, hashlib

def verify(raw_body: bytes, header_signature: str, secret: str) -> bool:
    expected = "sha256=" + hmac.new(
        secret.encode(), raw_body, hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(header_signature, expected)

Delivery & retries

Your endpoint must respond with 2xx within 10 seconds. Non-2xx responses and timeouts are retried with exponential backoff (up to 3 retries = 4 attempts total). Persistently failing endpoints are flagged in the dashboard as Failing but not automatically disabled — you'll see the state change on the Webhooks page and can pause or delete the endpoint from there.

Webhook delivery is at-least-once. Use X-Candor-Delivery-Id to deduplicate on your side if you're doing anything non-idempotent.

Other headers you'll see on each delivery:

  • X-Candor-Event — event type (e.g. study.completed)
  • X-Candor-Delivery-Id — unique delivery ID, useful for idempotency
  • X-Candor-Timestamp — when the event was emitted
API reference

Errors

All error responses use standard HTTP status codes and return a JSON body with an error field.

json
{ "error": "Study not found" }
  • 400 — Bad request — missing or invalid parameters
  • 401 — Unauthorized — missing or invalid API key
  • 402 — Payment required — insufficient account balance
  • 404 — Not found — study, endpoint, or resource does not exist
  • 422 — Unprocessable — validation failed (e.g. pre-flight check)
  • 500 — Server error — try again or contact support