Higher-dimensional
preference data
from your terminal.
Collect human preference data from real annotators — not synthetic labels — with proper methodology built in. You don't need another annotation platform with a dashboard you'll never open. The signal flows back as labelled training data.
RLAIF gets you scale. Candor gets you the human ground truth that calibrates your AI judge — and tells you where it's wrong.
You've already built this tool twice. It's still bad.
You have 50 things to build. Annotation tooling shouldn't be one of them.
Internal tools collect the signal that was easiest to build, not the signal the model actually needs. They do A vs B but never ask why. They calcify around one task type and cost a week of engineering to modify. That tool will take a month, need maintenance forever, and never be as good as you want it to be. Skip it entirely.
Scale is built for enterprises, not research velocity.
You don't want to go through a sales process, negotiate a contract, and wait for a dedicated PM to set up your first batch. You want to collect 500 pairwise comparisons on today's checkpoint and have results by tomorrow. Your eval cadence should match your training cadence.
Your first preference dataset sets the ceiling for your first reward model.
Position bias in your comparisons, inattentive annotators you didn't filter, inconsistent task framing across batches — these problems compound into your reward model and you won't know until it's too late. Getting methodology right from day one matters more than building it yourself.
From preferences to training data
Pairwise preference collection for reward modeling
You have N model outputs and need humans to compare them pairwise. Candor generates all N×(N-1)/2 pairs, counterbalances display order 50/50 AB/BA, inserts 10% attention check pairs, and produces a ranked leaderboard with win rates and Krippendorff's alpha. Feed the results directly into your reward model training loop.
“Which response is more helpful? Consider accuracy, completeness, and clarity.”
Safety and harmlessness evaluation
Evaluate model responses on safety dimensions. Have human annotators rate outputs on a 1-5 scale for harmfulness, with attention checks ensuring only engaged annotators contribute. Get per-item scores with disagreement breakdowns to identify ambiguous cases that need policy clarification.
“Rate the harmfulness of this response. Consider whether it could cause real-world harm if followed.”
Pairwise preferences and safety ratings tell you what people prefer and flag. The next two workflows tell you why — and that reasoning data is what you need to calibrate an LLM-as-judge, build an evaluation rubric, or train a more targeted reward model.
Multi-dimension ratings with reasoning — for LLM judge calibration
Go beyond a single preference signal. Have external evaluators rate each model output across multiple dimensions — helpfulness, harmlessness, coherence, instruction-following, tone — AND write detailed reasoning explaining their scores. Why was this response unhelpful? What specifically made the tone feel off? Where did the reasoning break down?
This reasoning data has multiple downstream uses: use it as ground truth to calibrate an LLM-as-judge, as training signal for a more granular reward model that optimizes per-dimension rather than a single scalar, or to build a structured rubric your team uses for internal evaluation going forward.
“Rate each dimension, then explain your reasoning. Be specific about what works and what fails.”
One human evaluation session generates multiple DPO training pairs. When a human prefers Response A for readability but Response B for correctness, Candor generates separate training pairs for each dimension:
Each pair teaches the model the contours of a specific quality dimension. 5 dimensions × 20 comparisons = 100 DPO training pairs from a single study. Export directly to TRL's prompt / chosen / rejected format.
AI-moderated adversarial probing
Traditional red teaming gives you attack categories and success rates. AI-moderated adversarial sessions go deeper — the moderator adapts in real time based on what the evaluator finds, probing on why the model failed, how the failure could be exploited, and what policy decision would prevent it. “You got the model to role-play a harmful scenario — what made that approach work?” “If you were a real user, would you have stumbled into this accidentally or did it require deliberate effort?”
The transcripts from these sessions produce more than just a failure taxonomy: they surface rubric dimensions for safety evaluation, calibration data for automated red teaming tools, edge case test sets, and concrete policy recommendations.
“Probe the model for failures while our AI moderator asks about your approach and findings.”
What you get that your internal tool doesn't
50/50 AB/BA counterbalancing
Every pairwise comparison is shown in both orders. Position bias doesn't contaminate your preference signal.
10% attention check pairs
Known-answer pairs inserted automatically. Inattentive annotators are flagged and excluded.
Krippendorff's alpha on every study
Inter-rater reliability calculated automatically. Know whether your annotators agree before you train on their labels.
Per-pair disagreement breakdowns
See exactly which comparisons annotators fight over. High-disagreement pairs need clearer guidelines.
Smart batching
Right-sized assignments that prevent annotator fatigue. Batches are calibrated to maintain quality.
Auto-calculated fair pay
Pay targeting $12–18/hr based on measured task complexity. Engaged annotators require fair compensation.
Verified human panels
LLMs now pass attention checks at 99.8% rates, making unverified crowd platforms unreliable. Candor recruits through verified participant pools — real humans with validated identities, not bots gaming your labels.
Results as JSON
Pipe directly into your training pipeline. No CSV export, no dashboard, no manual download step.
Fits your existing stack
Candor exports to the formats your training pipeline already consumes. No vendor lock-in — just labelled data that flows into what you already use.
TRL DPO Format
prompt / chosen / rejected JSONL
SageMaker
Training job–ready datasets
Vertex AI
Compatible JSONL format
Any JSONL Pipeline
Standard format, pipe anywhere
Ship your first preference batch
Your next checkpoint deserves human signal. The signal flows back as labelled training data.