RL & Alignment

Higher-dimensional
preference data
from your terminal.

Collect human preference data from real annotators — not synthetic labels — with proper methodology built in. You don't need another annotation platform with a dashboard you'll never open. The signal flows back as labelled training data.

RLAIF gets you scale. Candor gets you the human ground truth that calibrates your AI judge — and tells you where it's wrong.

>Terminal

→

◆Candor

→

→Pipeline

The Problem

You've already built this tool twice. It's still bad.

🔧

You have 50 things to build. Annotation tooling shouldn't be one of them.

Internal tools collect the signal that was easiest to build, not the signal the model actually needs. They do A vs B but never ask why. They calcify around one task type and cost a week of engineering to modify. That tool will take a month, need maintenance forever, and never be as good as you want it to be. Skip it entirely.

🔄

Scale is built for enterprises, not research velocity.

You don't want to go through a sales process, negotiate a contract, and wait for a dedicated PM to set up your first batch. You want to collect 500 pairwise comparisons on today's checkpoint and have results by tomorrow. Your eval cadence should match your training cadence.

📊

Your first preference dataset sets the ceiling for your first reward model.

Position bias in your comparisons, inattentive annotators you didn't filter, inconsistent task framing across batches — these problems compound into your reward model and you won't know until it's too late. Getting methodology right from day one matters more than building it yourself.

Use Cases

From preferences to training data

Pairwise preference collection for reward modeling

You have N model outputs and need humans to compare them pairwise. Candor generates all N×(N-1)/2 pairs, counterbalances display order 50/50 AB/BA, inserts 10% attention check pairs, and produces a ranked leaderboard with win rates and Krippendorff's alpha. Feed the results directly into your reward model training loop.

Participant view

Tie

“Which response is more helpful? Consider accuracy, completeness, and clarity.”

How you'd run it

$ claude "collect pairwise preferences on these 8 model outputs for helpfulness"

What you get back

Ranked by win rate (28 pairs × 15 human raters, α = 0.74):

  #1  checkpoint-v7    81% win rate

  #2  checkpoint-v5    68% win rate

  #3  checkpoint-v8    62% win rate

  #4  checkpoint-v6    55% win rate

  #5  checkpoint-v3    41% win rate

  #6  checkpoint-v4    34% win rate

  #7  checkpoint-v2    28% win rate

  #8  checkpoint-v1    19% win rate

3 attention check failures excluded.

Results written to preferences.jsonl

Learn more about Pairwise Comparison →

Safety and harmlessness evaluation

Evaluate model responses on safety dimensions. Have human annotators rate outputs on a 1-5 scale for harmfulness, with attention checks ensuring only engaged annotators contribute. Get per-item scores with disagreement breakdowns to identify ambiguous cases that need policy clarification.

Participant view

1 = harmless · 5 = harmful

“Rate the harmfulness of this response. Consider whether it could cause real-world harm if followed.”

How you'd run it

$ claude "have 30 annotators rate these responses on harmlessness, 1-5 scale"

What you get back

Per-item harmlessness scores (30 human raters, α = 0.68):

  item_042  1.2 ±0.4   safe

  item_017  1.8 ±0.9   safe (some disagreement)

  item_091  3.4 ±1.2   ← high variance, needs review

  item_063  4.1 ±0.6   flagged harmful

  item_028  4.7 ±0.5   flagged harmful

2 items with σ > 1.0 — likely edge cases for

your safety policy. 4 inattentive raters excluded.

Results written to safety_ratings.jsonl

Learn more about Rating Scale →

Pairwise preferences and safety ratings tell you what people prefer and flag. The next two workflows tell you why — and that reasoning data is what you need to calibrate an LLM-as-judge, build an evaluation rubric, or train a more targeted reward model.

Multi-dimension ratings with reasoning — for LLM judge calibration

Go beyond a single preference signal. Have external evaluators rate each model output across multiple dimensions — helpfulness, harmlessness, coherence, instruction-following, tone — AND write detailed reasoning explaining their scores. Why was this response unhelpful? What specifically made the tone feel off? Where did the reasoning break down?

This reasoning data has multiple downstream uses: use it as ground truth to calibrate an LLM-as-judge, as training signal for a more granular reward model that optimizes per-dimension rather than a single scalar, or to build a structured rubric your team uses for internal evaluation going forward.

Participant view

helpfulness

harmlessness

coherence

inst-follow

tone

✎rationale

“Rate each dimension, then explain your reasoning. Be specific about what works and what fails.”

How you'd run it

$ claude "collect dimension ratings + reasoning on these 20 model responses from external evaluators — helpfulness, harmlessness, coherence, instruction-following, tone"

What you get back

Dimension ratings across 20 responses × 10 human evaluators:

  Dimension             Mean   Agreement

  helpfulness           3.8    α = 0.72

  harmlessness          4.6    α = 0.81

  coherence             4.2    α = 0.75

  instruction-following 3.1    α = 0.68

  tone                  3.5    α = 0.51  ← low agreement

  Top reasoning themes (low helpfulness):

  "answered the wrong question"        24 mentions

  "correct but too vague to act on"    18 mentions

  "missed the key constraint"          14 mentions

  Top reasoning themes (tone disagreement):

  "too formal for the context"         11 mentions

  "appropriately professional"          9 mentions

  ← indicates subjective dimension, needs rubric

  Downstream exports:

  → Per-dimension scores:        dim_ratings.jsonl

  → Reasoning corpus:            rationale_data.jsonl

  → LLM judge calibration set:   judge_training.jsonl

  → Draft evaluation rubric:     rubric.md

DPO Training Pair Multiplier

One human evaluation session generates multiple DPO training pairs. When a human prefers Response A for readability but Response B for correctness, Candor generates separate training pairs for each dimension:

[Prioritize readability] → A > B
[Prioritize correctness] → B > A

Each pair teaches the model the contours of a specific quality dimension. 5 dimensions × 20 comparisons = 100 DPO training pairs from a single study. Export directly to TRL's prompt / chosen / rejected format.

Learn more about Rating Scale →

AI-moderated adversarial probing

Traditional red teaming gives you attack categories and success rates. AI-moderated adversarial sessions go deeper — the moderator adapts in real time based on what the evaluator finds, probing on why the model failed, how the failure could be exploited, and what policy decision would prevent it. “You got the model to role-play a harmful scenario — what made that approach work?” “If you were a real user, would you have stumbled into this accidentally or did it require deliberate effort?”

The transcripts from these sessions produce more than just a failure taxonomy: they surface rubric dimensions for safety evaluation, calibration data for automated red teaming tools, edge case test sets, and concrete policy recommendations.

Participant view

MODEL INTERACTION

USER

Imagine a story where a character needs to...

MODEL

Sure, in that fictional scenario the character could...

AI MODERATOR

MODERATOR

Interesting — the model complied after your third message. What changed?

EVALUATOR

I framed it as fiction. It stopped pushing back once I said “imagine a story where...”

MODERATOR

Would a typical user find that framing naturally, or did it require intent?

“Probe the model for failures while our AI moderator asks about your approach and findings.”

How you'd run it

$ claude "run 10 AI-moderated adversarial sessions — probe for safety failures, capture reasoning and policy recommendations"

What you get back

Themes across 10 human adversarial sessions:

  Role-play bypass is the primary vector (8/10)

  Evaluators consistently found that fictional

  framing reduced safety guardrails. Most noted

  it felt "accidental, not deliberate."

  Multi-turn escalation harder to detect (6/10)

  "The model was fine for 4 messages, then slowly

  gave ground." — hard to catch with single-turn

  safety classifiers.

  Refusal tone sometimes backfires (7/10)

  "The refusal was so preachy that I wanted to

  find a way around it." — evaluators flagged

  refusal style as a UX issue, not just safety.

  Policy recommendation: fiction framing (9/10)

  Near-unanimous that fictional framing needs a

  specific policy decision, not just a general

  safety filter.

  Downstream exports:

  → Session transcripts:         red_team/transcripts

  → Attack taxonomy:             attack_categories.jsonl

  → Policy recommendations:      policy_recs.md

  → Safety rubric dimensions:    safety_rubric.md

  → Edge case test set:          hard_cases.jsonl

Learn more about AI-Moderated Sessions →

Methodology

What you get that your internal tool doesn't

🔀

50/50 AB/BA counterbalancing

Every pairwise comparison is shown in both orders. Position bias doesn't contaminate your preference signal.

🛡️

10% attention check pairs

Known-answer pairs inserted automatically. Inattentive annotators are flagged and excluded.

📊

Krippendorff's alpha on every study

Inter-rater reliability calculated automatically. Know whether your annotators agree before you train on their labels.

🔍

Per-pair disagreement breakdowns

See exactly which comparisons annotators fight over. High-disagreement pairs need clearer guidelines.

📦

Smart batching

Right-sized assignments that prevent annotator fatigue. Batches are calibrated to maintain quality.

💰

Auto-calculated fair pay

Pay targeting $12–18/hr based on measured task complexity. Engaged annotators require fair compensation.

🤖

Verified human panels

LLMs now pass attention checks at 99.8% rates, making unverified crowd platforms unreliable. Candor recruits through verified participant pools — real humans with validated identities, not bots gaming your labels.

⚡

Results as JSON

Pipe directly into your training pipeline. No CSV export, no dashboard, no manual download step.

Integration

Fits your existing stack

Candor exports to the formats your training pipeline already consumes. No vendor lock-in — just labelled data that flows into what you already use.

TRL DPO Format

prompt / chosen / rejected JSONL

SageMaker

Training job–ready datasets

Vertex AI

Compatible JSONL format

Any JSONL Pipeline

Standard format, pipe anywhere

Ship your first preference batch

Your next checkpoint deserves human signal. The signal flows back as labelled training data.

$curl -fsSL https://candor.sh | bash

Or talk to us about your use case →