ML Evaluation

Human evaluation infrastructure for ML teams. Collect RLHF preference data, detect hallucinations, evaluate instruction following, assess reasoning chains, and benchmark multimodal models. Scale from internal reviewers to hundreds of recruited evaluators.

LLM Output Comparison

You have outputs from two or more LLMs and want to know which one people prefer. Human reviewers pick winners on criteria like helpfulness, accuracy, or tone. Produces rankings with win rates and agreement scores — essential for RLHF data collection and model selection.

Participant view
vs
A
Tie
B
Learn more about Pairwise Comparison
How you'd run it
$ candor study create --goal "compare GPT-4o vs Claude responses" \
--items "eval-pairs.csv" --task compare \
--recruit --participants 15
What you get back
Model Win Rate Ties Comparisons
Claude 61.3% 12.0% 92/150
GPT-4o 26.7% — 40/150
Claude preferred on: accuracy (71%), helpfulness (64%)
GPT-4o preferred on: conciseness (58%)
Agreement: 0.69 (substantial) · 15 evaluators

Generated Image Evaluation

You have AI-generated images and want human quality scores. Recruit evaluators to rate outputs from diffusion models, GANs, or image editors on realism, prompt adherence, or any criteria you define. Get per-image scores with distributions and inter-rater reliability.

Participant view
1
2
3
4
5
Learn more about Rating Scale
How you'd run it
$ claude "rate these diffusion model outputs on realism and prompt adherence"
What you get back
Image Realism Prompt Adherence Overall
gen_001.png 4.2 3.8 4.0
gen_002.png 3.6 4.5 4.1
gen_003.png 4.8 4.7 4.8
gen_004.png 2.1 3.2 2.7
50 images · 12 evaluators · ICC: 0.81 (excellent)

TTS Model Benchmarking

You want to know which TTS model sounds most natural to real listeners. Recruit evaluators to compare voice samples in randomized pairs and produce a ranked leaderboard with statistical confidence and agreement metrics.

Participant view
vs
A
Tie
B
Learn more about Pairwise Comparison
How you'd run it
$ candor study create --goal "benchmark TTS models" \
--items "tts-samples/" --task compare \
--recruit --participants 20
What you get back
Rank Model Win Rate 95% CI Elo
#1 ElevenLabs v2 72.4% [68.1, 76.7] 1247
#2 PlayHT 3.0 61.8% [57.2, 66.4] 1183
#3 OpenAI TTS-1 48.2% [43.5, 52.9] 1102
#4 Google WaveNet 17.6% [13.9, 21.3] 968
20 evaluators · 480 comparisons · κ: 0.77

Model Selection & RLHF

You need pairwise preference data for reward model training. Compare model outputs at scale with recruited evaluators, producing structured preference datasets ready for your RLHF pipeline. Agreement metrics let you filter noisy labels before training.

Participant view
vs
A
Tie
B
Learn more about Pairwise Comparison
How you'd run it
$ claude "collect RLHF preference data on 50 response pairs from 20 evaluators"
What you get back
Export: preferences_study_ms7k1.jsonl
1,000 preference pairs (50 prompts × 20 evaluators)
Format: {prompt, chosen, rejected, annotator_id}
Quality metrics:
Agreement: 0.74 (substantial)
Avg response time: 34s per pair
Flagged as low-effort: 2.1% (auto-filtered)

Hallucination Detection

You want to measure how often your model makes things up. Evaluators flag factual errors, fabricated citations, and unsupported claims, labeling each response as grounded, partially hallucinated, or fully hallucinated. Critical for measuring factuality in production LLMs.

Participant view
Positive
Neutral
Negative
Learn more about Categorization
How you'd run it
$ candor study create --goal "flag hallucinations" \
--items "responses.csv" --task categorize \
--labels "grounded,partial,hallucinated" --recruit --participants 8
What you get back
Distribution (200 responses):
Grounded: 62.0% (124) ████████████░░░░░░░░
Partial: 23.5% (47) █████░░░░░░░░░░░░░░░
Hallucinated: 14.5% (29) ███░░░░░░░░░░░░░░░░░
Most hallucinated topics: citations (38%), dates (24%)
Fleiss' κ: 0.72 (substantial) · 8 evaluators

Instruction Following Evaluation

You want to know how well your model follows complex instructions. Recruited evaluators rate responses on constraint satisfaction, format adherence, and completeness. Essential for post-training evaluation at frontier labs.

Participant view
1
2
3
4
5
Learn more about Rating Scale
How you'd run it
$ claude "evaluate instruction following on 30 test cases, rate 1-5"
What you get back
Constraint Mean Score Pass Rate (≥4)
Format adherence 4.3 87%
Length compliance 3.9 72%
Tone matching 4.1 81%
All constraints 3.6 63%
30 test cases · 10 evaluators · ICC: 0.76 (good)

Reasoning Chain Assessment

You have step-by-step reasoning traces and want domain experts to evaluate them. Compare reasoning quality across model checkpoints or prompting strategies for logical correctness, completeness, and coherence — so you know what to ship.

Participant view
vs
A
Tie
B
Learn more about Pairwise Comparison
How you'd run it
$ candor study create --goal "compare reasoning quality" \
--items "traces.csv" --task compare \
--recruit --participants 10
What you get back
Model Win Rate Logical Errors Completeness
checkpoint-7 64.2% 8% of traces 91%
checkpoint-6 35.8% 19% of traces 84%
Biggest gains: multi-step math (+22%), code logic (+18%)
Regression: common-sense reasoning (-3%)
10 expert evaluators · 40 trace pairs · κ: 0.71