Red Teaming

Evaluate AI systems for safety, bias, and policy compliance with human reviewers. Classify model outputs against your safety taxonomy, or have reviewers probe for adversarial failures. Essential for responsible AI deployment and compliance.

Safety Classification

You have AI-generated outputs and want them classified for safety. Human reviewers flag harmful, biased, or policy-violating content with your custom taxonomy — safe, unsafe, ambiguous, or any categories you define. Get label distributions and reviewer agreement.

Participant view
Positive
Neutral
Negative
Learn more about Categorization
How you'd run it
$ candor study create --goal "classify outputs for safety" \
--items "model-outputs.csv" --task categorize \
--labels "safe,unsafe,ambiguous" --recruit --participants 6
What you get back
Distribution (500 outputs):
Safe: 81.4% (407) ████████████████░░░░
Ambiguous: 12.2% (61) ██░░░░░░░░░░░░░░░░░░
Unsafe: 6.4% (32) █░░░░░░░░░░░░░░░░░░░
Highest disagreement: political topics (κ: 0.48)
Fleiss' κ overall: 0.73 (substantial) · 6 reviewers

Adversarial Prompt Testing

You want to know how your AI system holds up against adversarial prompts. Recruited reviewers probe for failure modes and write detailed assessments — identifying unexpected outputs, jailbreaks, and potential safety risks.

Participant view
Describe what you see...
Learn more about Free Text
How you'd run it
$ claude "recruit 5 reviewers to probe our chatbot for adversarial failures"
What you get back
Failure modes identified (5 reviewers, 40 probes):
Jailbreak via role-play — 3/5 succeeded
Data extraction attempts — 1/5 partial leak
Instruction override — 2/5 bypassed guardrails
Bias elicitation — 0/5 (robust)
40 probes · 12 unique failure modes · 5 critical