ποΈ Judge Arena β A/B Testing Evaluators
What you'll learn
How to pit multiple LLM judges against each other β and against human labels β to find the most reliable and cost-effective evaluator for your use case.
The Judge Arena is FlowyML's evaluator benchmarking system. It runs every judge on the same data, computes Elo-style rankings, and tells you which judge best correlates with human labels.
Why Judge Arena? π€
| Problem |
Arena Solution |
| "Which LLM judge should I use?" |
Elo rankings by accuracy |
| "Is my custom judge reliable?" |
Compare against human baselines |
| "GPT-4o is expensive for evals" |
Cost analysis per evaluation |
| "Do different judges agree?" |
Inter-judge correlation matrix |
Quick Start π
| from flowyml.evals import JudgeArena, EvalDataset, Relevance, Faithfulness, make_judge
# 1. Prepare evaluation data with human scores
data = EvalDataset.create_genai("golden_set", examples=[
{"inputs": {"query": "What is ML?"}, "expected": "Machine learning is...",
"context": ["ML is a subset of AI..."]},
{"inputs": {"query": "Explain RAG"}, "expected": "RAG combines retrieval...",
"context": ["Retrieval-Augmented Generation..."]},
])
human_labels = [0.9, 0.8] # Human quality scores for each example
# 2. Create arena with judges to compare
arena = JudgeArena(
judges=[
Relevance(model="openai:/gpt-4o-mini"),
Faithfulness(model="openai:/gpt-4o-mini"),
make_judge("custom_quality", "Rate the quality and helpfulness of the response."),
],
)
# 3. Run the arena
result = arena.evaluate(data=data, human_labels=human_labels)
|
Arena Results API π
The JudgeArenaResult object provides rich analysis:
| # Best performing judge
best = result.best_judge()
print(f"Winner: {best}")
# Full rankings (Elo-style)
for rank in result.rankings:
print(f" #{rank['rank']} {rank['judge']}: Elo={rank['elo']:.0f}")
# How well judges agree with humans
for judge, score in result.agreement_scores().items():
print(f" {judge}: {score:.2%} agreement")
# How well judges agree with each other
matrix = result.correlation_matrix()
print(matrix)
# Cost comparison
for judge, cost in result.cost_analysis().items():
print(f" {judge}: ${cost:.4f}/eval")
|
Available Methods
| Method |
Returns |
Description |
best_judge() |
str |
Name of the top-performing judge |
rankings |
list[dict] |
Elo-style rankings with scores |
correlation_matrix() |
dict |
Inter-judge agreement matrix |
agreement_scores() |
dict[str, float] |
Human-agreement score per judge |
cost_analysis() |
dict[str, float] |
USD cost per evaluation per judge |
Real-World Examples π
Choosing Between LLM Providers
| arena = JudgeArena(
judges=[
Relevance(model="openai:/gpt-4o"),
Relevance(model="openai:/gpt-4o-mini"),
Relevance(model="anthropic:/claude-3-haiku"),
],
)
result = arena.evaluate(data=golden_set, human_labels=human_scores)
# Find the cheapest judge with >90% human agreement
for judge, agreement in result.agreement_scores().items():
cost = result.cost_analysis()[judge]
if agreement > 0.9:
print(f"β
{judge}: {agreement:.1%} agreement, ${cost:.4f}/eval")
|
Validating a Custom Judge
| custom = make_judge(
"domain_expert",
"You are a medical AI evaluator. Score the clinical accuracy of the response.",
model="openai:/gpt-4o",
)
arena = JudgeArena(
judges=[custom, Relevance(), Faithfulness()],
)
result = arena.evaluate(data=medical_data, human_labels=doctor_scores)
# Check if custom judge outperforms generic ones
print(result.rankings)
|
Best Practices π‘
Always include human labels
Without human labels, the arena can only measure inter-judge agreement β not actual accuracy. Even 50 human-labeled examples make a big difference.
Mix general and specialized judges
Include at least one general-purpose judge (Relevance) and one domain-specific judge to see if specialization helps.
Cost awareness
Running N judges on M examples costs N Γ M LLM calls. Start with a small golden set (50β100 examples).