Preference data
from your terminal.
Collect human preference data from real annotators — not synthetic labels — with proper methodology built in. You don't need another annotation platform with a dashboard you'll never open. Candor runs from your terminal and feeds results straight into your training pipeline.
Every evaluation on Candor is completed by a real person. Not an LLM. Not a synthetic label. Human judgment.
You've already built this tool twice. It's still bad.
You have 50 things to build. Annotation tooling shouldn't be one of them.
You're standing up training infra, building data pipelines, hiring researchers — and someone still needs to build an internal tool for collecting human preferences. That tool will take a month, need maintenance forever, and never be as good as you want it to be. Skip it entirely.
Scale is built for enterprises, not research velocity.
You don't want to go through a sales process, negotiate a contract, and wait for a dedicated PM to set up your first batch. You want to collect 500 pairwise comparisons on today's checkpoint and have results by tomorrow. Your eval cadence should match your training cadence.
Your first preference dataset sets the ceiling for your first reward model.
Position bias in your comparisons, inattentive annotators you didn't filter, inconsistent task framing across batches — these problems compound into your reward model and you won't know until it's too late. Getting methodology right from day one matters more than building it yourself.
Four workflows, one command each
Pairwise preference collection for reward modeling
You have N model outputs and need humans to compare them pairwise. Candor generates all N×(N-1)/2 pairs, counterbalances display order 50/50 AB/BA, inserts 10% attention check pairs, and produces a ranked leaderboard with win rates and Krippendorff's alpha. Feed the results directly into your reward model training loop.
“Which response is more helpful? Consider accuracy, completeness, and clarity.”
Safety and harmlessness evaluation
Evaluate model responses on safety dimensions. Have human annotators rate outputs on a 1-5 scale for harmfulness, with attention checks ensuring only engaged annotators contribute. Get per-item scores with disagreement breakdowns to identify ambiguous cases that need policy clarification.
“Rate the harmfulness of this response. Consider whether it could cause real-world harm if followed.”
Instruction-following quality comparison
Compare how well different model variants follow complex instructions. Pairwise comparison with optional free-text rationale capture — human annotators pick a winner AND explain why, giving you signal for both ranking and qualitative analysis of failure modes.
“Which response better follows the instruction? Pick a winner and explain what the other got wrong.”
Red teaming with structured evaluation
Have real participants attempt to elicit harmful outputs from your model, then categorize and rate the severity of any failures. Combines free text (the adversarial probing) with categorization (attack taxonomy) and severity scoring. Structured red teaming instead of ad hoc jailbreak hunting.
“Try to get the model to produce harmful output. Categorize your approach and rate severity if successful.”
What you get that your internal tool doesn't
50/50 AB/BA counterbalancing
Every pairwise comparison is shown in both orders. Position bias doesn't contaminate your preference signal.
10% attention check pairs
Known-answer pairs inserted automatically. Inattentive annotators are flagged and excluded.
Krippendorff's alpha on every study
Inter-rater reliability calculated automatically. Know whether your annotators agree before you train on their labels.
Per-pair disagreement breakdowns
See exactly which comparisons annotators fight over. High-disagreement pairs need clearer guidelines.
Smart batching
Right-sized assignments that prevent annotator fatigue. Batches are calibrated to maintain quality.
Auto-calculated fair pay
Pay targeting $12–18/hr based on measured task complexity. Engaged annotators require fair compensation.
Results as JSON
Pipe directly into your training pipeline. No CSV export, no dashboard, no manual download step.
Ship your first preference batch
Your next checkpoint deserves human signal. Get it in hours, not weeks.