Generative Video

Human preference data and quality evaluation for teams training generative video models. Collect A/B preference pairs for DPO training, compare model checkpoints for go/no-go decisions, diagnose quality across dimensions like motion and realism, and bootstrap VLM judges with richly annotated human reasoning — all exported as structured data your training pipeline can consume.

Preference Data for Model Training

You have generated video clips and want A/B preference data for your training pipeline. Evaluators watch two clips and pick the better one, producing structured JSONL preference pairs ready for DPO or RLHF. Agreement metrics let you filter noisy labels before they hit your reward model.

Participant view

Tie

Learn more about Pairwise Comparison →

How you'd run it

$ claude "collect A/B preferences on generated videos for DPO training"

What you get back

Export: preferences_study_vp4d2.jsonl

  360 preference pairs (24 prompts × 15 evaluators)

  Format: {prompt, chosen_video, rejected_video, annotator_id}

Quality metrics:

  Agreement: 0.78 (substantial)

  Flagged as low-effort: 1.4% (auto-filtered)

Model Checkpoint Evaluation

You want to know if your new model checkpoint is better than the last one. Evaluators judge randomized pairs across your prompt set, producing win rates with statistical confidence. Detect regressions on specific prompt categories before they reach production.

Participant view

Tie

Learn more about Pairwise Comparison →

How you'd run it

$ candor study create --goal "compare checkpoint-v2 vs v3" \

    --items-a "v2/*.mp4" --items-b "v3/*.mp4" \

    --task compare --recruit --participants 10

What you get back

Checkpoint   Win Rate   95% CI         Verdict

v3           63.7%      [58.2, 69.1]   ✓ Ship

v2           36.3%      [30.9, 41.8]

Regressions detected:

  Slow-motion prompts: v2 preferred 58% (p=0.04)

All other categories: v3 preferred · 10 evaluators

Quality Dimension Diagnostics

You want to know exactly where your video model is weakest. Rate generated videos on individual quality axes — motion smoothness, visual realism, temporal consistency, aesthetics, and prompt adherence. Per-dimension score distributions tell you what to prioritize in your next training run.

Participant view

Learn more about Rating Scale →

How you'd run it

$ claude "rate videos on motion, realism, aesthetics, prompt adherence"

What you get back

Dimension          Mean   Std Dev  Weakest Prompt Category

Motion smoothness  3.2    1.1      Fast action scenes

Visual realism     4.1    0.7      Close-up faces

Aesthetics         4.3    0.6      (none — consistent)

Prompt adherence   3.8    0.9      Multi-object scenes

20 videos · 12 evaluators · Priority: motion smoothness

Bootstrapping an AI Judge

You want to train a VLM that evaluates video quality at scale. Reviewers pick winners and explain their reasoning, producing annotated preference data you can use to fine-tune or prompt-engineer an automated judge. Validate judge accuracy against the human baseline with built-in agreement metrics.

Participant view

Tie

Learn more about Pairwise Comparison →

How you'd run it

$ claude "collect preferences with reasoning to calibrate a VLM judge"

What you get back

Export: annotated_prefs_study_jb3w5.jsonl

  800 annotated pairs (40 prompts × 20 evaluators)

  Fields: {prompt, chosen, rejected, reasoning_text}

Reasoning quality: avg 38 words, 94% mention specific criteria

Top cited criteria: motion (67%), realism (54%), coherence (41%)

Ready for VLM fine-tuning or few-shot prompt calibration