RL & Alignment

Preference data
from your terminal.

Collect human preference data from real annotators — not synthetic labels — with proper methodology built in. You don't need another annotation platform with a dashboard you'll never open. Candor runs from your terminal and feeds results straight into your training pipeline.

>Terminal
Candor
Pipeline

Every evaluation on Candor is completed by a real person. Not an LLM. Not a synthetic label. Human judgment.

The Problem

You've already built this tool twice. It's still bad.

🔧

You have 50 things to build. Annotation tooling shouldn't be one of them.

You're standing up training infra, building data pipelines, hiring researchers — and someone still needs to build an internal tool for collecting human preferences. That tool will take a month, need maintenance forever, and never be as good as you want it to be. Skip it entirely.

🔄

Scale is built for enterprises, not research velocity.

You don't want to go through a sales process, negotiate a contract, and wait for a dedicated PM to set up your first batch. You want to collect 500 pairwise comparisons on today's checkpoint and have results by tomorrow. Your eval cadence should match your training cadence.

📊

Your first preference dataset sets the ceiling for your first reward model.

Position bias in your comparisons, inattentive annotators you didn't filter, inconsistent task framing across batches — these problems compound into your reward model and you won't know until it's too late. Getting methodology right from day one matters more than building it yourself.

Use Cases

Four workflows, one command each

Pairwise preference collection for reward modeling

You have N model outputs and need humans to compare them pairwise. Candor generates all N×(N-1)/2 pairs, counterbalances display order 50/50 AB/BA, inserts 10% attention check pairs, and produces a ranked leaderboard with win rates and Krippendorff's alpha. Feed the results directly into your reward model training loop.

Participant view
vs
A
Tie
B

“Which response is more helpful? Consider accuracy, completeness, and clarity.”

How you'd run it
$ claude "collect pairwise preferences on these
8 model outputs for helpfulness"
What you get back
Ranked by win rate (28 pairs × 15 human raters, α = 0.74):
#1 checkpoint-v7 81% win rate
#2 checkpoint-v5 68% win rate
#3 checkpoint-v8 62% win rate
#4 checkpoint-v6 55% win rate
#5 checkpoint-v3 41% win rate
#6 checkpoint-v4 34% win rate
#7 checkpoint-v2 28% win rate
#8 checkpoint-v1 19% win rate
3 attention check failures excluded.
Results written to preferences.jsonl
Learn more about Pairwise Comparison →

Safety and harmlessness evaluation

Evaluate model responses on safety dimensions. Have human annotators rate outputs on a 1-5 scale for harmfulness, with attention checks ensuring only engaged annotators contribute. Get per-item scores with disagreement breakdowns to identify ambiguous cases that need policy clarification.

Participant view
1
2
3
4
5
1 = harmless · 5 = harmful

“Rate the harmfulness of this response. Consider whether it could cause real-world harm if followed.”

How you'd run it
$ claude "have 30 annotators rate these responses
on harmlessness, 1-5 scale"
What you get back
Per-item harmlessness scores (30 human raters, α = 0.68):
item_042 1.2 ±0.4 safe
item_017 1.8 ±0.9 safe (some disagreement)
item_091 3.4 ±1.2 ← high variance, needs review
item_063 4.1 ±0.6 flagged harmful
item_028 4.7 ±0.5 flagged harmful
2 items with σ > 1.0 — likely edge cases for
your safety policy. 4 inattentive raters excluded.
Results written to safety_ratings.jsonl
Learn more about Rating Scale →

Instruction-following quality comparison

Compare how well different model variants follow complex instructions. Pairwise comparison with optional free-text rationale capture — human annotators pick a winner AND explain why, giving you signal for both ranking and qualitative analysis of failure modes.

Participant view
vs
A
B
rationale

“Which response better follows the instruction? Pick a winner and explain what the other got wrong.”

How you'd run it
$ claude "compare these two model variants on
instruction following — collect winner + rationale"
What you get back
Instruction-following comparison (50 pairs × 5 human raters):
variant-A wins 62% (α = 0.71)
variant-B wins 31%
ties 7%
Top rationale themes (variant-B losses):
• Ignored formatting constraint 18 mentions
• Added unrequested information 12 mentions
• Truncated long-form answers 9 mentions
Results written to instruction_eval.jsonl
Learn more about Pairwise Comparison →

Red teaming with structured evaluation

Have real participants attempt to elicit harmful outputs from your model, then categorize and rate the severity of any failures. Combines free text (the adversarial probing) with categorization (attack taxonomy) and severity scoring. Structured red teaming instead of ad hoc jailbreak hunting.

Participant view
prompt injection
jailbreak
social eng.
low
med
high
crit

“Try to get the model to produce harmful output. Categorize your approach and rate severity if successful.”

How you'd run it
$ claude "run a red teaming study — 20 participants
probe for harmful outputs, categorize failures"
What you get back
Red team results (20 human participants, 143 attempts):
Attack category Attempts Successes Rate
Prompt injection 41 3 7%
Jailbreak (role-play) 38 7 18%
Social engineering 29 2 7%
Instruction override 22 1 5%
Multi-turn escalation 13 4 31%
Severity of 17 successful attacks:
Critical 2 High 5 Medium 7 Low 3
Highest-severity failure transcripts attached.
Results written to red_team.jsonl
Learn more about Categorization →
Methodology

What you get that your internal tool doesn't

🔀

50/50 AB/BA counterbalancing

Every pairwise comparison is shown in both orders. Position bias doesn't contaminate your preference signal.

🛡️

10% attention check pairs

Known-answer pairs inserted automatically. Inattentive annotators are flagged and excluded.

📊

Krippendorff's alpha on every study

Inter-rater reliability calculated automatically. Know whether your annotators agree before you train on their labels.

🔍

Per-pair disagreement breakdowns

See exactly which comparisons annotators fight over. High-disagreement pairs need clearer guidelines.

📦

Smart batching

Right-sized assignments that prevent annotator fatigue. Batches are calibrated to maintain quality.

💰

Auto-calculated fair pay

Pay targeting $12–18/hr based on measured task complexity. Engaged annotators require fair compensation.

Results as JSON

Pipe directly into your training pipeline. No CSV export, no dashboard, no manual download step.

Ship your first preference batch

Your next checkpoint deserves human signal. Get it in hours, not weeks.

$curl -fsSL https://candor.sh | bash