Compound AI Systems

Your pipeline has 4 stages.
Which one is failing?

Real humans evaluate each stage of your pipeline independently, so you know which component to fix. You built a multi-stage AI system and when the output is wrong, you don't know which stage broke. Was it bad transcription? Irrelevant retrieval? A hallucinated conclusion? Without per-stage human eval, you're debugging a black box.

⊳Input

→

◉Perception

accurate?

→

⟐Retrieval

relevant?

→

◈Reasoning

correct?

→

⊲Output

Every evaluation on Candor is completed by a real person. Not an LLM. Not a synthetic label. Human judgment.

The Problem

You're optimizing blind

🔍

End-to-end testing hides per-stage failures

Your pipeline looks great on your 10 favorite test cases. But you have no idea if your retrieval is surfacing the right documents or if your LLM is reasoning correctly over the context it receives. A correct final output can mask a broken intermediate step.

👥

You're two people. You can't evaluate everything yourselves.

You and your cofounder review outputs, but you're biased — you built the prompts, you know the expected answers, you unconsciously fill in gaps. You need strangers evaluating each stage to find the failures you can't see.

🎯

You don't know which stage to improve next

Should you fine-tune your transcription model, re-rank your retrieval, or rewrite your generation prompt? Without per-stage quality metrics, you're guessing where to spend your engineering time. One Candor study per stage tells you exactly where the ceiling is.

Use Cases

One study per stage. Fix the right one.

Is your perception stage accurate?

Your pipeline starts by converting raw input into structured data — audio to text, images to data, documents to fields. Have human evaluators rate the accuracy of this first stage in isolation. Catch errors before they cascade downstream.

Participant view

1 = inaccurate · 5 = perfect

“Rate the accuracy of this transcription compared to the original audio.”

How you'd run it

$ claude "have 20 people rate the accuracy of these transcription outputs on a 1-5 scale"

What you get back

Accuracy ratings across 20 human evaluators:

  Item                       Mean   Std   n

  call_recording_041.wav      4.3   0.7  20

  call_recording_042.wav      4.6   0.5  20

  call_recording_043.wav      2.1   1.2  20  ⚠ high variance

  call_recording_044.wav      4.5   0.6  20

  call_recording_045.wav      1.8   0.9  20  ⚠ low score

  2 items flagged: rater disagreement > 1.0 std dev.

  Common failure: speaker overlap and accented speech.

  Results written to perception_eval.json

Learn more about Rating Scale →

Is your retrieval surfacing the right context?

Your system retrieves context from a document store before generating an answer. But is it pulling the right documents? Have human evaluators look at query-document pairs and categorize each retrieved result. Find out if your retrieval is the bottleneck before you blame the LLM.

Participant view

query → doc

relevant

partial

irrelevant

“Is this retrieved document relevant to the query?”

How you'd run it

$ claude "categorize these 50 retrieval results as relevant, partial, or irrelevant for each query"

What you get back

Relevance ratings across 50 query-document pairs:

  Overall distribution:

    Relevant  58%  |  Partial  24%  |  Irrelevant  18%

  Queries where top-3 results all irrelevant:

    "side effects of metformin with renal impairment"

    "contraindications for elderly patients on warfarin"

    "dosage adjustment for pediatric amoxicillin"

  Your retrieval is failing silently on medical queries

  with compound conditions. Consider re-ranking.

  Results written to retrieval_eval.json

Learn more about Categorization →

Is your reasoning stage correct?

Your LLM receives context and produces a conclusion — a summary, a classification, a detected contradiction, a recommended action. Have human evaluators judge whether the conclusion is correct given the provided context. Pairwise comparison works well here: show two different model outputs and ask which is more accurate.

Participant view

Tie

“Given the source context, which conclusion is more accurate? Explain your reasoning.”

How you'd run it

$ claude "compare these two model outputs for correctness given the source context — collect winner + rationale"

What you get back

Pairwise correctness judgments (30 human comparisons):

  Model A preferred   63%   (19/30)

  Model B preferred   27%   (8/30)

  Tie                 10%   (3/30)

  Top rationale themes:

    "Model A correctly identified the contradiction"  (12x)

    "Model B cited the wrong supporting evidence"     (7x)

    "Model A hedged appropriately on uncertain cases" (5x)

  Krippendorff's α = 0.74 (acceptable agreement)

  Results written to reasoning_eval.json

Learn more about Pairwise Comparison →

Does the full pipeline actually help your user?

Once you've evaluated individual stages, test the complete experience. Put real domain experts — lawyers, doctors, analysts, whoever your actual users are — in front of the full pipeline output and run an AI-moderated voice session. The moderator probes on trust, usefulness, and moments of confusion.

Participant view

LIVE SESSION

legal-review.app/contradictions

ModeratorHow would you use this output in your workflow?

ParticipantI'd need to see the source docs side by side—

ModeratorWhat would you need to trust this flag?

AI Moderator

3:12

How you'd run it

$ claude "run 5 voice interview sessions with attorneys reviewing our contradiction detection output"

What you get back

Themes across 5 human voice sessions (attorneys):

  Trusted: contradiction flagging (4/5)

    Participants found the flags accurate and useful

    for prioritizing document review.

  Friction: source attribution (3/5)

    "I need to see the source documents side by side,

     not summarized. I can't cite a summary."

  Unexpected: workflow integration (3/5)

    Attorneys wanted to export flags directly to their

    case management system, not copy-paste.

  Full transcripts and theme analysis attached.

  Results written to e2e_eval.json

Learn more about Voice Interviews →

Methodology

The eval stack for compound AI systems

Compound AI systems need compound evaluation. You wouldn't ship a traditional software product by only testing the final output — you test units, integration, and end-to-end. AI pipelines deserve the same discipline. Candor makes it possible to run per-stage human eval without building internal tooling for each step. One command per stage. Results in hours, not weeks.

Find the weak link in your pipeline

Fix the right stage. Ship with confidence.

$curl -fsSL https://candor.sh | bash

Or talk to us about your use case →