Annotation & Labeling

Build training datasets, audit model predictions, or label content for compliance. Define your own taxonomy and get human-verified classifications with label distributions and agreement metrics. Supports text, images, audio, and video.

Content Moderation

You have a batch of user-generated content and need it labeled for policy compliance. Reviewers categorize each item into your custom label set — safe, borderline, violation, or whatever you define. Get label distributions and agreement metrics per item, and scale to thousands when you need production-grade moderation data.

Participant view

Positive

Neutral

Negative

Learn more about Categorization →

How you'd run it

$ candor study create --goal "label content for policy compliance" \

    --items "posts.csv" --task categorize \

    --labels "safe,borderline,violation" --recruit --participants 5

What you get back

Item          Safe   Borderline  Violation  Agreement

post_1041     100%   0%          0%         1.00

post_1042     20%    60%         20%        0.40

post_1043     0%     20%         80%        0.73

post_1044     80%    20%         0%         0.73

240 items labeled · Fleiss' κ: 0.68 (substantial)

Image Classification

You have a set of images and want them sorted into categories. Define your label taxonomy and get human-verified classifications with confidence scores. Scales to large datasets with inter-annotator agreement for ML training data, QA, or content tagging.

Learn more about Categorization →

How you'd run it

$ claude "classify these product images: electronics, clothing, home, food"

What you get back

Item             Label         Confidence  Agreement

IMG_2041.jpg     electronics   100%        5/5

IMG_2042.jpg     clothing      80%         4/5

IMG_2043.jpg     home          60%         3/5

IMG_2044.jpg     food          100%        5/5

500 images · Fleiss' κ: 0.84 (near-perfect)

Sentiment Analysis

You have a CSV of customer reviews or support tickets and want them labeled by sentiment. Reviewers classify each one as positive, neutral, or negative — or whatever labels you define. Great for building training data or auditing ML model predictions at any scale.

Learn more about Categorization →

How you'd run it

$ candor study create --goal "label sentiment" \

    --items "reviews.csv" --task categorize \

    --labels "positive,neutral,negative" --recruit --participants 5

What you get back

Corpus Summary (1,200 reviews)

  Positive:   58.3%  (700)

  Neutral:    25.0%  (300)

  Negative:   16.7%  (200)

Fleiss' κ: 0.81 (near-perfect) · 5 reviewers

Audio Transcription

You have audio clips and want accurate text for each one. Reviewers listen and write out what they hear, giving you human transcriptions you can trust. Supports mp3, wav, ogg, flac, m4a, and other common formats.

Participant view

Describe what you see...

Learn more about Free Text →

How you'd run it

$ claude "transcribe these 12 audio clips with 3 reviewers each"

What you get back

clip_007.mp3 (3 transcriptions):

  R1: "The quarterly results exceeded expectations by twelve percent"

  R2: "The quarterly results exceeded expectations by 12 percent"

  R3: "The quarterly results exceeded expectations by 12%"

12 clips · 36 transcriptions · Avg WER across reviewers: 0.03

Image Description

You have images and want human-written descriptions of what's in them. Useful for alt-text generation, accessibility audits, or training vision models. Each reviewer describes what they see, and you get multiple perspectives per image.

Participant view

Describe what you see...

Learn more about Free Text →

How you'd run it

$ candor study create --goal "describe what you see" \

    --items "screenshots/*.png" --task freetext \

    --recruit --participants 5

What you get back

dashboard-home.png:

  "A dark-themed analytics dashboard showing a line chart of

   monthly revenue trending upward, with a sidebar navigation

   listing Reports, Settings, and Team sections."

20 images · 100 descriptions · Avg length: 42 words