# Evaluating AI agents

This guide walks through using Candor to evaluate the output of an AI agent against a rubric. It covers how to decompose a multi-criteria eval into a single study, how to pick the right task type, how to write a rubric that produces reliable labels, how content renders to your labelers, and how to preview cost before anything charges.

The worked example is a research assistant agent that reads source documents and writes summaries — but the same pattern applies to any multi-source AI agent: research assistants, customer support triage bots, GTM tools, code review agents, or anything else where you want human judgement on whether the output is good.

**Scope.** This guide focuses on *item-based* studies (your agent produced some outputs, and you want humans to label them). For moderated interviews where humans talk to a Candor AI moderator about a product, see the [Studies](#studies) concept section above.

## Breaking down a complex eval

Most real agent evals are multi-dimensional. You don't just want to know *"is this output good?"* — you want to know *why* it's good or bad across several orthogonal axes. Take a research assistant that reads a set of source documents and produces a summary. Four reasonable things to measure:

- Did the agent retrieve the right source documents for the question?
- For every claim in the output, is the source correctly attributed?
- Is the summary faithful to what the sources actually say?
- Does the output surface the most important findings — or bury them?

The wrong instinct here is to run four separate studies, one per question. That costs 4× the labeler time (each labeler has to re-read the same output four times) and gives you four disjoint score streams that are hard to join.

The right pattern is **one item per agent output, scored along multiple dimensions in a single pass**. Each labeler sees one output, answers all four questions on it at once, and moves on. In Candor this is a `scorecard` task with four `criteria`.

> [!INFO]
> If your dimensions are all orthogonal (measuring different things), fold them into one scorecard. If they aren't — if two of them are really asking the same thing — drop one.

## Picking a task type

Candor supports five item-based task types. The choice is not about the mechanic — it's about the *question* you want answered.

- *"Is this output good across several dimensions?"* — `scorecard` — Multiple weighted criteria, behavior-anchored levels, a weighted overall score. The right tool for almost every agent-eval scenario.
- *"Is version A of my agent better than version B?"* — `compare (pairwise)` — Participant sees two outputs side-by-side, picks the winner. Use when absolute judgement is hard but relative is easy — e.g. comparing two model versions, two prompts, or two generation strategies.
- *"Which category does this output fall into?"* — `label` — Participant picks one label from a fixed set. Good for taxonomies (e.g. intent classification, sentiment, failure-mode tagging).
- *"On a 1–5 scale, how good is this?"* — `rate` — Single-dimension rating. Simpler than a scorecard, but you lose the per-dimension breakdown. Use only when you genuinely only care about one axis.
- *"What did you notice? Freeform."* — `describe (free text)` — Participant writes an open-ended response. Great as a companion task to catch failure modes you didn't anticipate in your rubric — but hard to aggregate, so don't use it as your primary signal.

For the research-assistant example — and for essentially any multi-criteria eval — **scorecard** is the right choice. The rest of this guide assumes scorecard.

A quick word on pairwise, since it's the least obvious one: pairwise is useful when you're comparing two versions of something (two model variants, two prompt templates, two retrieval strategies) and absolute quality is subjective. People are much better at saying *"A is better than B"* than at saying *"A is a 4 out of 5"*. But pairwise only gives you relative rankings — it can't tell you whether any of the options are actually good.

## Designing a rubric

A scorecard is a set of `criteria`. Each criterion has a `name`, a `weight`, and an ordered list of `levels` from worst to best. Good rubrics produce reliable labels; sloppy rubrics produce inter-rater disagreement and findings you can't act on.

### Keep dimensions to 3–5

More than five and labelers get fatigued; their later answers get sloppier. Fewer than three and you lose the whole point of a multi-dim eval. If you catch yourself wanting six dimensions, usually two of them are measuring the same thing.

### Use behavior-anchored levels

A level description should tell a labeler *exactly* what they're looking for — ideally something they can observe or verify without having to make a judgment call. Compare:

```text
BAD:  "Summary is accurate"
GOOD: "Summary correctly names every key point in the source documents"

BAD:  "Good attribution"
GOOD: "Every claim in the output is traceable to a specific source document"

BAD:  "Useful"
GOOD: "Surfaces at least 3 actionable findings with concrete next steps"
```

When two labelers look at the same output and pick different levels, it's almost always because the level wording was ambiguous — not because the output was genuinely borderline. Invest in the wording.

### Use 4 levels when you can

An even number of levels forces labelers off the fence. Three or five levels give them a comfortable middle option ("Satisfactory", "Neutral") that they'll reach for whenever they're unsure — which means your middle bucket absorbs noise and tells you nothing. Four levels makes them pick a side.

### Weights are relative

Weights only matter in ratio to each other. `[5, 3, 3, 1]` and `[10, 6, 6, 2]` produce identical overall scores. Use small integers and pick weights that reflect how you'd actually explain the decision to a stakeholder: "accuracy is way more important than formatting" becomes `weight: 5` for accuracy and `weight: 1` for formatting.

Here's the criteria JSON for the research-assistant example:

```json
[
  {
    "name": "Source retrieval",
    "weight": 4,
    "levels": [
      "Wrong sources entirely",
      "Missing critical sources",
      "Most sources retrieved with minor gaps",
      "All expected sources retrieved"
    ]
  },
  {
    "name": "Source attribution",
    "weight": 3,
    "levels": [
      "No attribution at all",
      "Some claims cited, others unsupported",
      "Most claims cited, one or two wrong",
      "Every claim correctly attributed to its source"
    ]
  },
  {
    "name": "Summary accuracy",
    "weight": 5,
    "levels": [
      "Inaccurate or fabricated details",
      "Partial or selective — omits key points",
      "Mostly accurate with minor detail errors",
      "Summary correctly names every key point in the sources"
    ]
  },
  {
    "name": "Findings utility",
    "weight": 3,
    "levels": [
      "Misleading findings that would confuse the reader",
      "Generic summary with no surfaced findings",
      "Some findings surfaced, no prioritization",
      "Clearly prioritized findings with next-step guidance"
    ]
  }
]
```

Note that levels are ordered *worst to best*, and accuracy has the highest weight because this team cares most about whether they can trust the summary. Your weights should reflect your own priorities.

## Content formats — what labelers actually see

Before you prepare items, you need to know what your labelers will see on screen. Each item has three display slots:

- `label` — short text. **Plain text only — no Markdown or HTML.**
- `assetUrl` + `mimeType` — media. Supports `image/*`, `audio/*`, and `video/*` only. **Not HTML, not text files.**
- `metadata.description` — optional small-gray caption rendered below the main content. Good for subtitles like "Q3 2026 run — v2.4 of the agent".

When you set both `label` and `assetUrl`, both render — the media shows prominently and the label renders as a caption right next to it. It's not one or the other.

Here's exactly what labelers see for each combination:

**Image + label + description**

```diagram
  ┌──────────────────────────────┐
  │        ┌──────────┐          │
  │        │          │          │
  │        │   IMAGE  │   ← assetUrl (image/*)
  │        │          │          │
  │        └──────────┘          │
  │       item.label             │   ← small caption below
  │    metadata.description      │   ← smaller gray caption
  │      View product →          │   ← if metadata.url
  └──────────────────────────────┘
```

**Text only (no assetUrl)**

```diagram
  ┌──────────────────────────────┐
  │  ┌────────────────────────┐  │
  │  │      item.label        │  │   ← in a gray bordered box
  │  └────────────────────────┘  │
  │    metadata.description      │
  └──────────────────────────────┘
```

Text-only items are fine for short strings (a headline, a single sentence, a filename) but not for anything longer. There's no wrapping control, no formatting, no line breaks — just a single styled span inside a gray box. A long paragraph will render as one ugly run-on.

**Audio**

```diagram
  ┌──────────────────────────────┐
  │        item.label            │   ← label above the player
  │  ┌────────────────────────┐  │
  │  │  ▶ ═══════════  0:32   │  │   ← native <audio controls>
  │  └────────────────────────┘  │
  │    metadata.description      │
  └──────────────────────────────┘
```

**Video**

```diagram
  ┌──────────────────────────────┐
  │  ┌────────────────────────┐  │
  │  │                        │  │
  │  │       VIDEO            │  │   ← native <video controls>
  │  │                        │  │
  │  └────────────────────────┘  │
  │       item.label             │   ← caption below
  │    metadata.description      │
  └──────────────────────────────┘
```

**Pairwise (two items side-by-side)**

```diagram
  ┌──────────────────────────────────────────────┐
  │  Which one is better?                        │
  │  ┌────────────┐       ┌────────────┐         │
  │  │  Option A  │       │  Option B  │         │   ← subtitle
  │  │            │       │            │         │
  │  │  (item)    │       │  (item)    │         │   ← same ItemRenderer
  │  │            │       │            │         │
  │  └────────────┘       └────────────┘         │
  │   ( A is better )  ( Tie )  ( B is better )  │
  └──────────────────────────────────────────────┘
```

Pairwise is just *ItemRenderer* rendered twice in a grid with "Option A" / "Option B" labels on top. Whatever an item looks like on its own, it looks the same inside a pairwise comparison.

### Rendering long-form text output

This is the practical gotcha for agent evals: your agent produces long structured text, and the item renderer has no way to show it cleanly. There is no HTML path, no Markdown path, no long-text path.

The reliable workaround is to **render your agent's output to HTML yourself, screenshot the rendered page, host the PNG, and pass the PNG as an `assetUrl`**. You get full control over typography, syntax highlighting, sections, and wrapping, and it renders exactly the same way to every labeler. Use `label` as a short caption identifying which run/version the screenshot came from.

> [!WARNING]
> The item label field does not support Markdown or HTML. If your content has any formatting — headings, bullets, code blocks, tables — you must screenshot it before uploading, or labelers will see a run-on string in a gray box.

## Hosting your assets

Candor does not proxy your media. Whatever URL you pass as `assetUrl` must be publicly reachable — from both Candor's servers and every labeler's browser. There are two fetch paths to keep in mind.

### Server-side probing (audio/video only)

When you create a study with audio or video items, Candor's server immediately calls `fetch(assetUrl)` to measure duration. This feeds the reward estimator so the auto-calculated payout matches the actual task length. If the URL isn't reachable from Candor's servers, the probe silently fails and your reward estimate will be off — but the study still creates. Image-only studies skip this step entirely.

### Client-side loading (all media)

When a labeler actually takes the task, their browser loads assets directly via `<img src=...>`, `<audio><source>`, and `<video><source>`. There is no proxy. If the URL requires auth headers, a session cookie, a VPN, or is behind your corporate firewall, the labeler sees a broken-image icon and your data for that item is garbage.

> [!INFO]
> Quick check: can you open the asset URL in an incognito browser window with no cookies and no VPN? If yes, labelers can see it. If no, they can't.

### Practical hosting choices

- **Public CDN path** — simplest and most reliable. Cloudflare, CloudFront, Bunny, Fastly, anything that returns your file on a plain GET.
- **S3 presigned URLs** — work fine, but pick an expiration longer than your study duration. A study can run for days; presign for at least 7 days.
- **Google Drive / Dropbox share links** — often don't work, because the share URL redirects through a preview page that `<img>` can't render. If you must use them, find the direct-download URL and use that instead.
- **Localhost / intranet / VPN-only URLs** — will never work. Candor's servers can't reach them and neither can outside labelers.

> [!WARNING]
> Screenshots of your agent's output may contain sensitive data (customer names, internal documents, API keys that leaked into logs). Sanitize before you upload, or keep the study internal by using `platform: "direct"` and sharing only with your team. Once a URL is in an `assetUrl`, anyone who receives a participant link can open it.

## Getting reliable labels

A scorecard is only as good as the people filling it out. Agent evals are usually specialist tasks — "is this retrieval correct?" or "is this attribution complete?" requires someone who understands the domain. A crowd worker recruited in 10 minutes almost certainly can't answer these correctly.

For agent evals, the right default is **direct-link studies labeled by your own team**. Set `platform: "direct"` in the create payload, then share the resulting URLs with a small group of domain experts — engineers, product folks, or whoever knows what a good output looks like. Candor's managed recruitment is better for broader-audience questions (does a marketing message resonate?) than for technical quality judgements.

### How many labelers per item?

The `participants` field controls redundancy — how many people see each item. One labeler is cheapest but gives you zero agreement signal: if someone makes a mistake, you'll never notice. Three labelers is the de-facto standard — enough to compute agreement and catch outliers without being wildly expensive. Five or more for decisions that really matter (model launch / kill calls).

A sensible rollout is: start with 1–2 internal labelers for a dry run, make sure the rubric is sane, then scale up to 3 labelers for the real evaluation.

## End-to-end walkthrough

Here's the whole flow for the research-assistant example — from rubric to results — using real API calls.

### Step 1: Prepare items

For each agent output you want evaluated, render it to HTML (whatever rendering your agent already produces), screenshot it to a PNG, and upload the PNG to a public CDN. You end up with one URL per output. The `label` should be a short identifier the labeler can reference if they need to report a problem.

### Step 2: Create a draft study

```bash
curl https://candor.sh/api/studies \
  -H "Authorization: Bearer $CANDOR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "goal": "Evaluate research assistant summary quality",
    "task": "scorecard",
    "platform": "direct",
    "participants": 3,
    "items": [
      {
        "label": "Run 2026-04-12 · query #1",
        "assetUrl": "https://cdn.example.com/evals/run_2026-04-12/q1.png",
        "mimeType": "image/png",
        "metadata": { "description": "Agent v2.4 — 6 sources retrieved" }
      },
      {
        "label": "Run 2026-04-12 · query #2",
        "assetUrl": "https://cdn.example.com/evals/run_2026-04-12/q2.png",
        "mimeType": "image/png"
      }
    ],
    "criteria": [
      { "name": "Source retrieval",   "weight": 4,
        "levels": ["Wrong sources entirely", "Missing critical sources",
                   "Most sources retrieved with minor gaps",
                   "All expected sources retrieved"] },
      { "name": "Source attribution", "weight": 3,
        "levels": ["No attribution at all", "Some claims cited, others unsupported",
                   "Most claims cited, one or two wrong",
                   "Every claim correctly attributed to its source"] },
      { "name": "Summary accuracy",   "weight": 5,
        "levels": ["Inaccurate or fabricated details",
                   "Partial or selective — omits key points",
                   "Mostly accurate with minor detail errors",
                   "Summary correctly names every key point in the sources"] },
      { "name": "Findings utility",   "weight": 3,
        "levels": ["Misleading findings that would confuse the reader",
                   "Generic summary with no surfaced findings",
                   "Some findings surfaced, no prioritization",
                   "Clearly prioritized findings with next-step guidance"] }
    ]
  }'
```

A few things to notice about that payload: `task: "scorecard"` and `criteria` go together; `platform: "direct"` keeps this study internal (no external recruitment); `participants: 3` gives you three labelers per item for agreement signal.

### Step 3: Inspect the estimate

The create response gives you everything you need to decide whether to go ahead. Nothing has charged yet — this is a draft.

```json
{
  "study": {
    "id": "study_a1b2c3",
    "name": "Evaluate research assistant summary quality",
    "status": "draft",
    "platform": "direct",
    ...
  },
  "items": [
    { "id": 101, "label": "Run 2026-04-12 · query #1" },
    { "id": 102, "label": "Run 2026-04-12 · query #2" }
  ],
  "estimate": {
    "totalTasks": 2,
    "totalAssignments": 1,
    "estimatedCostCents": 180,
    "totalCostCents": 252,
    "estimatedMinutes": 3
  },
  "message": "Study created in draft mode. Run candor study approve study_a1b2c3 to start."
}
```

- **`totalTasks`** *(number)* — How many individual item scorings will happen. `items.length × participants` gives the upper bound; Candor may bundle multiple tasks into one assignment.
- **`totalAssignments`** *(number)* — How many distinct labeler sessions. Divided by your `participants` field this tells you how many share URLs the approve step will give you.
- **`estimatedCostCents`** *(number)* — Lower-bound cost estimate, mostly useful as context. Prefer `totalCostCents` when budgeting.
- **`totalCostCents`** *(number)* — **This is the number you actually pay.** For `platform: "direct"` item studies with internal labelers, this is typically very small — no recruitment fees apply.
- **`estimatedMinutes`** *(number)* — How long Candor expects a labeler to take per assignment. Useful for setting expectations with your team.

### Step 4: Preview what labelers will see

Before you approve, open the study in a browser as if you were a labeler:

```text
https://candor.sh/study/preview/study_a1b2c3
```

Walk through the task end-to-end, check that your asset URLs load, that the rubric reads the way you expected, and that the levels are unambiguous. This works on drafts and doesn't consume a real assignment. See [Previewing a task](#guide-preview) for the programmatic flow if you want to embed the preview URL in your own tooling.

### Step 5: Approve, or delete and retry

If the estimate looks fine and the preview checks out, approve the study. **This is the only endpoint that debits your balance.** For direct-link studies it costs nothing at approve time; for managed recruitment it deducts the full `totalCostCents`.

```bash
curl -X POST https://candor.sh/api/studies/study_a1b2c3/approve \
  -H "Authorization: Bearer $CANDOR_KEY"
```

If the estimate is wrong, too expensive, or you want to tweak the rubric, delete the draft and create a new one. Drafts are free to create and destroy — iterate as much as you want before approving.

```bash
curl -X DELETE https://candor.sh/api/studies/study_a1b2c3/delete \
  -H "Authorization: Bearer $CANDOR_KEY"
```

### Step 6: Share links with labelers

For direct-link studies, the approve response includes a `shareUrls` array — one URL per labeler slot. Send each URL to someone on your team; they follow the link, complete the tasks, and submit. No accounts, no signup, no Candor-side recruitment.

```json
{
  "message": "Study ready! Share these links with your team.",
  "shareUrls": [
    "https://candor.sh/study/a/asgn_f0e1d2c3b4a5",
    "https://candor.sh/study/a/asgn_a5b4c3d2e1f0",
    "https://candor.sh/study/a/asgn_9a8b7c6d5e4f"
  ],
  ...
}
```

### Step 7: Read the results

Once labelers have submitted, `GET /api/studies/:id/results` returns a per-item, per-dimension breakdown. Each item shows its weighted overall score and the per-dimension detail.

```json
{
  "status": "completed",
  "results": {
    "items": [
      {
        "label": "Run 2026-04-12 · query #1",
        "overallWeightedScore": 0.72,
        "totalResponses": 3,
        "dimensions": [
          {
            "name": "Source retrieval",
            "weight": 4,
            "meanScore": 0.92,
            "levelDistribution": {
              "All expected sources retrieved": 2,
              "Most sources retrieved with minor gaps": 1
            }
          },
          {
            "name": "Source attribution",
            "weight": 3,
            "meanScore": 0.50,
            "levelDistribution": {
              "Some claims cited, others unsupported": 2,
              "Most claims cited, one or two wrong": 1
            }
          },
          {
            "name": "Summary accuracy",
            "weight": 5,
            "meanScore": 0.83,
            "levelDistribution": {
              "Summary correctly names every key point in the sources": 2,
              "Mostly accurate with minor detail errors": 1
            }
          },
          {
            "name": "Findings utility",
            "weight": 3,
            "meanScore": 0.58,
            "levelDistribution": {
              "Some findings surfaced, no prioritization": 2,
              "Clearly prioritized findings with next-step guidance": 1
            }
          }
        ]
      }
    ]
  }
}
```

Two things to know about these numbers:

- `meanScore` is normalized to **0.0 – 1.0**, not 0–5. A 4-level rubric with unanimous top picks gives 1.0; unanimous worst gives 0.0.
- `overallWeightedScore` is the weighted average of all `meanScore` values across dimensions, using the `weight` field. It's the single number you'll quote in dashboards.
- `levelDistribution` is a raw count of how many labelers picked each level — useful for spotting disagreement (a 1/1/1 split across three labelers is a flashing red light).

In the example above, the agent is strong on retrieval and accuracy but weak on attribution and prioritizing findings. That's an actionable signal: fix the attribution chain and the findings-ranking step, then re-run with a new batch of items to see if the scores improve.

## Previewing a task before you share it

Before you approve a study and send links to labelers, you almost always want to open it yourself and walk through the task the way a labeler would. Candor gives you a preview URL for this — it works on drafts, so you can see exactly what the task looks like *before* anything charges and without using up a real assignment slot.

### The easy path: one URL, no API calls

Open this in any browser:

```text
https://candor.sh/study/preview/study_a1b2c3
```

That page creates a preview assignment, redirects you to the real participant UI, and appends `?preview=true` so submissions aren't recorded. You see every task in order — the full rubric, the actual item renders, the submit button, everything. Share the URL with a teammate and they can review without needing a Candor account.

### The programmatic path

If you want to embed the preview URL in your own tooling (e.g. show it next to the estimate in a custom dashboard), call the preview endpoint directly:

```bash
curl -X POST https://candor.sh/api/studies/study_a1b2c3/preview \
  -H "Authorization: Bearer $CANDOR_KEY"
```

The response gives you a relative URL you can prefix with the API base:

```json
{
  "type": "items",
  "url": "/study/a/preview_9f8e7d6c5b4a",
  "assignmentId": "preview_9f8e7d6c5b4a"
}
```

Append `?preview=true` and hand it to whoever needs to see it. Moderated studies return `{ type: "moderated", url, sessionToken }` instead — the URL drops you into a full moderated session so you can walk the interview script end-to-end.

### What preview does and doesn't do

- Works on drafts — you don't need to approve first.
- Shows the exact same *ItemRenderer* your real labelers will see, loading your real asset URLs. If something renders weird in preview, it will render weird for labelers too.
- Creates a real assignment row tagged `preview_*`. These are filtered out of `progress`, `results`, and the activity feed, so they won't pollute your real data.
- Multiple calls create multiple preview assignments — harmless but they accumulate.

To clean up, delete them whenever you want:

```bash
curl -X DELETE https://candor.sh/api/studies/study_a1b2c3/preview \
  -H "Authorization: Bearer $CANDOR_KEY"
```

This removes every `preview_*` assignment (and its tasks) for that study. Real participant data is never touched.

## Previewing cost before you commit

To recap the money flow clearly, so you can build it into whatever tooling you use to drive Candor:

- `POST /api/studies` — **always free.** Creates a draft, returns `estimate`. Nothing charges.
- `DELETE /api/studies/:id/delete` — also free. Deletes the draft; no residual state.
- `POST /api/studies/:id/approve` — **this is the only endpoint that debits your balance.** For direct-link item studies it costs nothing (platform-internal); for managed recruitment it deducts the full `totalCostCents` you saw in the estimate.
- `POST /api/studies/:id/publish` — also free. Just flips the study from ready-to-publish into active recruitment on the provider.

The safe iteration loop is:

```diagram-flow
create draft  →  inspect estimate  →  delete if too expensive
              ↑                                      │
              └──── adjust rubric / items ───────────┘

create draft  →  inspect estimate  →  approve  (money moves)
```

You can create and destroy as many drafts as you want while tuning your rubric — nothing charges until you explicitly approve. That's the whole cost-preview story.