ML Evaluation

Human evaluation infrastructure for ML teams. Collect RLHF preference data, detect hallucinations, evaluate instruction following, assess reasoning chains, and benchmark multimodal models. Scale from internal reviewers to hundreds of recruited evaluators.

LLM Output Comparison

You have outputs from two or more LLMs and want to know which one people prefer. Human reviewers pick winners on criteria like helpfulness, accuracy, or tone. Produces rankings with win rates and agreement scores — essential for RLHF data collection and model selection.

Participant view

Tie

Learn more about Pairwise Comparison →

How you'd run it

$ candor study create --goal "compare GPT-4o vs Claude responses" \

    --items "eval-pairs.csv" --task compare \

    --recruit --participants 15

What you get back

Model       Win Rate   Ties    Comparisons

Claude      61.3%      12.0%   92/150

GPT-4o      26.7%      —       40/150

Claude preferred on: accuracy (71%), helpfulness (64%)

GPT-4o preferred on: conciseness (58%)

Agreement: 0.69 (substantial) · 15 evaluators

Generated Image Evaluation

You have AI-generated images and want human quality scores. Recruit evaluators to rate outputs from diffusion models, GANs, or image editors on realism, prompt adherence, or any criteria you define. Get per-image scores with distributions and inter-rater reliability.

Participant view

Learn more about Rating Scale →

How you'd run it

$ claude "rate these diffusion model outputs on realism and prompt adherence"

What you get back

Image            Realism  Prompt Adherence  Overall

gen_001.png      4.2      3.8               4.0

gen_002.png      3.6      4.5               4.1

gen_003.png      4.8      4.7               4.8

gen_004.png      2.1      3.2               2.7

50 images · 12 evaluators · ICC: 0.81 (excellent)

TTS Model Benchmarking

You want to know which TTS model sounds most natural to real listeners. Recruit evaluators to compare voice samples in randomized pairs and produce a ranked leaderboard with statistical confidence and agreement metrics.

Participant view

Tie

Learn more about Pairwise Comparison →

How you'd run it

$ candor study create --goal "benchmark TTS models" \

    --items "tts-samples/" --task compare \

    --recruit --participants 20

What you get back

Rank  Model            Win Rate   95% CI         Elo

#1    ElevenLabs v2    72.4%      [68.1, 76.7]   1247

#2    PlayHT 3.0       61.8%      [57.2, 66.4]   1183

#3    OpenAI TTS-1     48.2%      [43.5, 52.9]   1102

#4    Google WaveNet   17.6%      [13.9, 21.3]    968

20 evaluators · 480 comparisons · κ: 0.77

Model Selection & RLHF

You need pairwise preference data for reward model training. Compare model outputs at scale with recruited evaluators, producing structured preference datasets ready for your RLHF pipeline. Agreement metrics let you filter noisy labels before training.

Participant view

Tie

Learn more about Pairwise Comparison →

How you'd run it

$ claude "collect RLHF preference data on 50 response pairs from 20 evaluators"

What you get back

Export: preferences_study_ms7k1.jsonl

  1,000 preference pairs (50 prompts × 20 evaluators)

  Format: {prompt, chosen, rejected, annotator_id}

Quality metrics:

  Agreement: 0.74 (substantial)

  Avg response time: 34s per pair

  Flagged as low-effort: 2.1% (auto-filtered)

Hallucination Detection

You want to measure how often your model makes things up. Evaluators flag factual errors, fabricated citations, and unsupported claims, labeling each response as grounded, partially hallucinated, or fully hallucinated. Critical for measuring factuality in production LLMs.

Learn more about Categorization →

How you'd run it

$ candor study create --goal "flag hallucinations" \

    --items "responses.csv" --task categorize \

    --labels "grounded,partial,hallucinated" --recruit --participants 8

What you get back

Distribution (200 responses):

  Grounded:       62.0%  (124)  ████████████░░░░░░░░

  Partial:        23.5%  (47)   █████░░░░░░░░░░░░░░░

  Hallucinated:   14.5%  (29)   ███░░░░░░░░░░░░░░░░░

Most hallucinated topics: citations (38%), dates (24%)

Fleiss' κ: 0.72 (substantial) · 8 evaluators

Instruction Following Evaluation

You want to know how well your model follows complex instructions. Recruited evaluators rate responses on constraint satisfaction, format adherence, and completeness. Essential for post-training evaluation at frontier labs.

Participant view

Learn more about Rating Scale →

How you'd run it

$ claude "evaluate instruction following on 30 test cases, rate 1-5"

What you get back

Constraint          Mean Score  Pass Rate (≥4)

Format adherence    4.3         87%

Length compliance    3.9         72%

Tone matching       4.1         81%

All constraints     3.6         63%

30 test cases · 10 evaluators · ICC: 0.76 (good)

Reasoning Chain Assessment

You have step-by-step reasoning traces and want domain experts to evaluate them. Compare reasoning quality across model checkpoints or prompting strategies for logical correctness, completeness, and coherence — so you know what to ship.

Participant view

Tie

Learn more about Pairwise Comparison →

How you'd run it

$ candor study create --goal "compare reasoning quality" \

    --items "traces.csv" --task compare \

    --recruit --participants 10

What you get back

Model         Win Rate   Logical Errors  Completeness

checkpoint-7  64.2%      8% of traces    91%

checkpoint-6  35.8%      19% of traces   84%

Biggest gains: multi-step math (+22%), code logic (+18%)

Regression: common-sense reasoning (-3%)

10 expert evaluators · 40 trace pairs · κ: 0.71