Generative Video

Human preference data and quality evaluation for teams training generative video models. Collect A/B preference pairs for DPO training, compare model checkpoints for go/no-go decisions, diagnose quality across dimensions like motion and realism, and bootstrap VLM judges with richly annotated human reasoning — all exported as structured data your training pipeline can consume.

Preference Data for Model Training

You have generated video clips and want A/B preference data for your training pipeline. Evaluators watch two clips and pick the better one, producing structured JSONL preference pairs ready for DPO or RLHF. Agreement metrics let you filter noisy labels before they hit your reward model.

Participant view
vs
A
Tie
B
Learn more about Pairwise Comparison
How you'd run it
$ claude "collect A/B preferences on generated videos for DPO training"
What you get back
Export: preferences_study_vp4d2.jsonl
360 preference pairs (24 prompts × 15 evaluators)
Format: {prompt, chosen_video, rejected_video, annotator_id}
Quality metrics:
Agreement: 0.78 (substantial)
Flagged as low-effort: 1.4% (auto-filtered)

Model Checkpoint Evaluation

You want to know if your new model checkpoint is better than the last one. Evaluators judge randomized pairs across your prompt set, producing win rates with statistical confidence. Detect regressions on specific prompt categories before they reach production.

Participant view
vs
A
Tie
B
Learn more about Pairwise Comparison
How you'd run it
$ candor study create --goal "compare checkpoint-v2 vs v3" \
--items-a "v2/*.mp4" --items-b "v3/*.mp4" \
--task compare --recruit --participants 10
What you get back
Checkpoint Win Rate 95% CI Verdict
v3 63.7% [58.2, 69.1] ✓ Ship
v2 36.3% [30.9, 41.8]
Regressions detected:
Slow-motion prompts: v2 preferred 58% (p=0.04)
All other categories: v3 preferred · 10 evaluators

Quality Dimension Diagnostics

You want to know exactly where your video model is weakest. Rate generated videos on individual quality axes — motion smoothness, visual realism, temporal consistency, aesthetics, and prompt adherence. Per-dimension score distributions tell you what to prioritize in your next training run.

Participant view
1
2
3
4
5
Learn more about Rating Scale
How you'd run it
$ claude "rate videos on motion, realism, aesthetics, prompt adherence"
What you get back
Dimension Mean Std Dev Weakest Prompt Category
Motion smoothness 3.2 1.1 Fast action scenes
Visual realism 4.1 0.7 Close-up faces
Aesthetics 4.3 0.6 (none — consistent)
Prompt adherence 3.8 0.9 Multi-object scenes
20 videos · 12 evaluators · Priority: motion smoothness

Bootstrapping an AI Judge

You want to train a VLM that evaluates video quality at scale. Reviewers pick winners and explain their reasoning, producing annotated preference data you can use to fine-tune or prompt-engineer an automated judge. Validate judge accuracy against the human baseline with built-in agreement metrics.

Participant view
vs
A
Tie
B
Learn more about Pairwise Comparison
How you'd run it
$ claude "collect preferences with reasoning to calibrate a VLM judge"
What you get back
Export: annotated_prefs_study_jb3w5.jsonl
800 annotated pairs (40 prompts × 20 evaluators)
Fields: {prompt, chosen, rejected, reasoning_text}
Reasoning quality: avg 38 words, 94% mention specific criteria
Top cited criteria: motion (67%), realism (54%), coherence (41%)
Ready for VLM fine-tuning or few-shot prompt calibration