Red Teaming

Evaluate AI systems for safety, bias, and policy compliance with human reviewers. Classify model outputs against your safety taxonomy, or have reviewers probe for adversarial failures. Essential for responsible AI deployment and compliance.

Safety Classification

You have AI-generated outputs and want them classified for safety. Human reviewers flag harmful, biased, or policy-violating content with your custom taxonomy — safe, unsafe, ambiguous, or any categories you define. Get label distributions and reviewer agreement.

Participant view

Positive

Neutral

Negative

Learn more about Categorization →

How you'd run it

$ candor study create --goal "classify outputs for safety" \

    --items "model-outputs.csv" --task categorize \

    --labels "safe,unsafe,ambiguous" --recruit --participants 6

What you get back

Distribution (500 outputs):

  Safe:        81.4%  (407)  ████████████████░░░░

  Ambiguous:   12.2%  (61)   ██░░░░░░░░░░░░░░░░░░

  Unsafe:       6.4%  (32)   █░░░░░░░░░░░░░░░░░░░

Highest disagreement: political topics (κ: 0.48)

Fleiss' κ overall: 0.73 (substantial) · 6 reviewers

Adversarial Prompt Testing

You want to know how your AI system holds up against adversarial prompts. Recruited reviewers probe for failure modes and write detailed assessments — identifying unexpected outputs, jailbreaks, and potential safety risks.

Participant view

Describe what you see...

Learn more about Free Text →

How you'd run it

$ claude "recruit 5 reviewers to probe our chatbot for adversarial failures"

What you get back

Failure modes identified (5 reviewers, 40 probes):

  Jailbreak via role-play     — 3/5 succeeded

  Data extraction attempts    — 1/5 partial leak

  Instruction override        — 2/5 bypassed guardrails

  Bias elicitation            — 0/5 (robust)

40 probes · 12 unique failure modes · 5 critical