Generative Video
Human preference data and quality evaluation for teams training generative video models. Collect A/B preference pairs for DPO training, compare model checkpoints for go/no-go decisions, diagnose quality across dimensions like motion and realism, and bootstrap VLM judges with richly annotated human reasoning — all exported as structured data your training pipeline can consume.
Preference Data for Model Training
You have generated video clips and want A/B preference data for your training pipeline. Evaluators watch two clips and pick the better one, producing structured JSONL preference pairs ready for DPO or RLHF. Agreement metrics let you filter noisy labels before they hit your reward model.
Model Checkpoint Evaluation
You want to know if your new model checkpoint is better than the last one. Evaluators judge randomized pairs across your prompt set, producing win rates with statistical confidence. Detect regressions on specific prompt categories before they reach production.
Quality Dimension Diagnostics
You want to know exactly where your video model is weakest. Rate generated videos on individual quality axes — motion smoothness, visual realism, temporal consistency, aesthetics, and prompt adherence. Per-dimension score distributions tell you what to prioritize in your next training run.
Bootstrapping an AI Judge
You want to train a VLM that evaluates video quality at scale. Reviewers pick winners and explain their reasoning, producing annotated preference data you can use to fine-tune or prompt-engineer an automated judge. Validate judge accuracy against the human baseline with built-in agreement metrics.