Evaluation & Ranking
Run pairwise comparisons, numeric ratings, and A/B tests on any type of content. Candor produces ranked leaderboards with win rates, agreement scores, and statistical confidence — whether you're comparing audio samples, images, videos, or text.
Pairwise Audio Comparison
You have a few audio samples and want to know which one sounds best. Candor pairs them up, randomizes the order, and gives you a ranked leaderboard with win rates — and it scales to hundreds of items with full inter-rater agreement when you need it.
Participant viewLearn more about Pairwise Comparison →$ claude "compare these 4 TTS samples and rank them by naturalness"
What you get back
Rank Sample Win Rate Comparisons
#1 eleven-v2.mp3 78.3% 23/30
#2 playht-v3.mp3 63.3% 19/30
#3 openai-tts.mp3 41.7% 12/30
#4 google-wavenet.mp3 16.7% 5/30
Agreement: 0.82 (substantial) · 10 reviewers
Image Quality Ranking
You have a handful of images — mockups, product photos, AI-generated art — and want to know which looks best. Reviewers pick winners in randomized pairs, producing a ranked leaderboard. Scales to large asset libraries with statistical confidence for design reviews and visual quality benchmarking.
Participant viewLearn more about Pairwise Comparison →$ claude "rank these UI mockups by visual quality with 10 reviewers"
What you get back
Rank Mockup Win Rate Comparisons
#1 dashboard-v3.png 82.5% 33/40
#2 dashboard-v2.png 57.5% 23/40
#3 dashboard-v1.png 35.0% 14/40
#4 dashboard-old.png 25.0% 10/40
Agreement: 0.76 (substantial) · 10 reviewers
A/B Copy Testing
You have a few headline options and want to know which one resonates most. Run pairwise comparisons across reviewers — no files needed, just text labels. Get ranked results with agreement metrics so you can pick with confidence.
Participant viewLearn more about Pairwise Comparison →$ candor study create --goal "pick the best headline" \
--items "Get started free,Start your trial,Try it now" \
--task compare --recruit --participants 12
What you get back
Rank Headline Win Rate Comparisons
#1 "Get started free" 70.8% 17/24
#2 "Start your trial" 54.2% 13/24
#3 "Try it now" 25.0% 6/24
Agreement: 0.71 (substantial) · 12 reviewers
Rating on a Scale
You want a numeric score on a few designs or creative assets. Have reviewers rate them on a scale, and get per-item averages, standard deviations, and distributions. Useful when you need absolute scores rather than relative rankings.
Participant viewLearn more about Rating Scale →$ candor study create --goal "rate design mockups 1-5" \
--items "mockups/*.png" --task rate --recruit --participants 8
What you get back
Item Mean Std Dev Distribution
hero-redesign.png 4.3 0.46 ▁▁▂▅█
hero-minimal.png 3.8 0.71 ▁▂▅█▃
hero-gradient.png 2.9 0.83 ▂▅█▃▁
hero-original.png 2.1 0.64 ▅█▃▁▁
8 reviewers · ICC: 0.79 (good)
Video Quality Scoring
You have a few video clips and want to know how they stack up on quality. Reviewers watch and rate each one, producing aggregate scores with inter-rater reliability. Supports mp4, mov, avi, mkv, and wmv formats.
Participant viewLearn more about Rating Scale →$ claude "score these product videos on quality, 5 reviewers"
What you get back
Item Mean Std Dev Distribution
demo-a.mp4 4.6 0.55 ▁▁▁▃█
explainer.mp4 4.0 0.71 ▁▁▃█▅
testimonial.mp4 3.4 0.89 ▁▃█▅▂
demo-b.mp4 2.8 0.84 ▂▅█▃▁
5 reviewers · ICC: 0.74 (good)