ML Evaluation
Human evaluation infrastructure for ML teams. Collect RLHF preference data, detect hallucinations, evaluate instruction following, assess reasoning chains, and benchmark multimodal models. Scale from internal reviewers to hundreds of recruited evaluators.
LLM Output Comparison
You have outputs from two or more LLMs and want to know which one people prefer. Human reviewers pick winners on criteria like helpfulness, accuracy, or tone. Produces rankings with win rates and agreement scores — essential for RLHF data collection and model selection.
Participant viewLearn more about Pairwise Comparison →$ candor study create --goal "compare GPT-4o vs Claude responses" \
--items "eval-pairs.csv" --task compare \
--recruit --participants 15
What you get back
Model Win Rate Ties Comparisons
Claude 61.3% 12.0% 92/150
GPT-4o 26.7% — 40/150
Claude preferred on: accuracy (71%), helpfulness (64%)
GPT-4o preferred on: conciseness (58%)
Agreement: 0.69 (substantial) · 15 evaluators
Generated Image Evaluation
You have AI-generated images and want human quality scores. Recruit evaluators to rate outputs from diffusion models, GANs, or image editors on realism, prompt adherence, or any criteria you define. Get per-image scores with distributions and inter-rater reliability.
Participant viewLearn more about Rating Scale →$ claude "rate these diffusion model outputs on realism and prompt adherence"
What you get back
Image Realism Prompt Adherence Overall
gen_001.png 4.2 3.8 4.0
gen_002.png 3.6 4.5 4.1
gen_003.png 4.8 4.7 4.8
gen_004.png 2.1 3.2 2.7
50 images · 12 evaluators · ICC: 0.81 (excellent)
TTS Model Benchmarking
You want to know which TTS model sounds most natural to real listeners. Recruit evaluators to compare voice samples in randomized pairs and produce a ranked leaderboard with statistical confidence and agreement metrics.
Participant viewLearn more about Pairwise Comparison →$ candor study create --goal "benchmark TTS models" \
--items "tts-samples/" --task compare \
--recruit --participants 20
What you get back
Rank Model Win Rate 95% CI Elo
#1 ElevenLabs v2 72.4% [68.1, 76.7] 1247
#2 PlayHT 3.0 61.8% [57.2, 66.4] 1183
#3 OpenAI TTS-1 48.2% [43.5, 52.9] 1102
#4 Google WaveNet 17.6% [13.9, 21.3] 968
20 evaluators · 480 comparisons · κ: 0.77
Model Selection & RLHF
You need pairwise preference data for reward model training. Compare model outputs at scale with recruited evaluators, producing structured preference datasets ready for your RLHF pipeline. Agreement metrics let you filter noisy labels before training.
Participant viewLearn more about Pairwise Comparison →$ claude "collect RLHF preference data on 50 response pairs from 20 evaluators"
What you get back
Export: preferences_study_ms7k1.jsonl
1,000 preference pairs (50 prompts × 20 evaluators)
Format: {prompt, chosen, rejected, annotator_id}
Quality metrics:
Agreement: 0.74 (substantial)
Avg response time: 34s per pair
Flagged as low-effort: 2.1% (auto-filtered)
Hallucination Detection
You want to measure how often your model makes things up. Evaluators flag factual errors, fabricated citations, and unsupported claims, labeling each response as grounded, partially hallucinated, or fully hallucinated. Critical for measuring factuality in production LLMs.
Participant viewLearn more about Categorization →$ candor study create --goal "flag hallucinations" \
--items "responses.csv" --task categorize \
--labels "grounded,partial,hallucinated" --recruit --participants 8
What you get back
Distribution (200 responses):
Grounded: 62.0% (124) ████████████░░░░░░░░
Partial: 23.5% (47) █████░░░░░░░░░░░░░░░
Hallucinated: 14.5% (29) ███░░░░░░░░░░░░░░░░░
Most hallucinated topics: citations (38%), dates (24%)
Fleiss' κ: 0.72 (substantial) · 8 evaluators
Instruction Following Evaluation
You want to know how well your model follows complex instructions. Recruited evaluators rate responses on constraint satisfaction, format adherence, and completeness. Essential for post-training evaluation at frontier labs.
Participant viewLearn more about Rating Scale →$ claude "evaluate instruction following on 30 test cases, rate 1-5"
What you get back
Constraint Mean Score Pass Rate (≥4)
Format adherence 4.3 87%
Length compliance 3.9 72%
Tone matching 4.1 81%
All constraints 3.6 63%
30 test cases · 10 evaluators · ICC: 0.76 (good)
Reasoning Chain Assessment
You have step-by-step reasoning traces and want domain experts to evaluate them. Compare reasoning quality across model checkpoints or prompting strategies for logical correctness, completeness, and coherence — so you know what to ship.
Participant viewLearn more about Pairwise Comparison →$ candor study create --goal "compare reasoning quality" \
--items "traces.csv" --task compare \
--recruit --participants 10
What you get back
Model Win Rate Logical Errors Completeness
checkpoint-7 64.2% 8% of traces 91%
checkpoint-6 35.8% 19% of traces 84%
Biggest gains: multi-step math (+22%), code logic (+18%)
Regression: common-sense reasoning (-3%)
10 expert evaluators · 40 trace pairs · κ: 0.71