π Model Leaderboard
What you'll learn
How to compare models across experiments and automatically pick the winner. Don't track model performance in spreadsheets β let the framework do it.
FlowyML automatically tracks and ranks your models, so you always know which one performs best β across experiments, teams, and time.
Why Leaderboards Matter π€
| Without a Leaderboard |
With Model Leaderboard |
| "Was run_42 better than run_38?" |
Automated rankings by any metric |
| "What hyperparameters did we use?" |
Full lineage: click any score to see the run |
| Picking models on gut feeling |
Metric-driven, reproducible decisions |
| Spreadsheets and notebooks |
Centralized, always up-to-date |
π Basic Usage
| from flowyml.tracking import ModelLeaderboard
# Initialize leaderboard for a specific metric
leaderboard = ModelLeaderboard(metric="accuracy", higher_is_better=True)
# Add scores from training runs
leaderboard.add_score(
model_name="resnet50",
run_id="run_123",
score=0.95,
metadata={"epochs": 10, "lr": 0.001},
)
leaderboard.add_score(
model_name="efficientnet",
run_id="run_124",
score=0.97,
metadata={"epochs": 20, "lr": 0.0005},
)
leaderboard.add_score(
model_name="vit_base",
run_id="run_125",
score=0.93,
metadata={"epochs": 15, "lr": 0.001},
)
|
π Querying the Leaderboard
| # Get the current champion
best = leaderboard.get_best_run()
print(f"π Champion: {best.model_name} (Acc: {best.score:.4f}, Run: {best.run_id})")
# Get top N models
top_3 = leaderboard.get_top(n=3)
for rank, entry in enumerate(top_3, 1):
print(f" #{rank} {entry.model_name}: {entry.score:.4f}")
# Get full rankings
rankings = leaderboard.rankings
|
Automatically promote the best model to production:
| from flowyml.tracking import ModelLeaderboard
leaderboard = ModelLeaderboard(metric="accuracy")
# After training a new model
new_score = evaluate_model(new_model)
leaderboard.add_score("new_model", run_id="run_200", score=new_score)
# Compare against current champion
best = leaderboard.get_best_run()
if new_score > best.score:
print(f"π New Champion! {new_score:.4f} > {best.score:.4f}")
deploy_to_production(new_model)
else:
print(f"β Failed to beat baseline ({best.score:.4f})")
|
Run Comparison π
Compare specific runs side-by-side:
| from flowyml.tracking import compare_runs
diff = compare_runs(["run_123", "run_124"])
print(diff)
# Shows metric differences, config changes, and delta analysis
|
What compare_runs Shows
| Field |
Description |
| Metric deltas |
Difference in each tracked metric |
| Config diff |
Hyperparameter changes between runs |
| Timing |
Training duration comparison |
| Data version |
Dataset versions used |
π₯οΈ Dashboard View
The FlowyML Dashboard provides an interactive leaderboard where you can:
- Sort by any metric β accuracy, F1, loss, latency, cost
- Filter by tags β model type, team, date range
- Click to inspect β see full run config, training curves, and artifacts
- Compare side-by-side β select runs and view metric deltas
Real-World Example: Multi-Metric Leaderboard π
| from flowyml.tracking import ModelLeaderboard
# Track multiple metrics
accuracy_board = ModelLeaderboard(metric="accuracy", higher_is_better=True)
latency_board = ModelLeaderboard(metric="latency_ms", higher_is_better=False)
# After each training run
accuracy_board.add_score("model_v3", run_id="run_300", score=0.96)
latency_board.add_score("model_v3", run_id="run_300", score=45.2)
# Find best accuracy
best_accuracy = accuracy_board.get_best_run()
# Find fastest model
fastest = latency_board.get_best_run()
# Find best trade-off (manual Pareto analysis)
print(f"Best accuracy: {best_accuracy.model_name} ({best_accuracy.score:.4f})")
print(f"Fastest: {fastest.model_name} ({fastest.score:.1f}ms)")
|
Best Practices π‘
One leaderboard per metric
Create separate leaderboards for different metrics β what's "best" depends on the metric you're optimizing.
Always record metadata
Pass metadata={"epochs": 10, "lr": 0.001} when adding scores. This makes it easy to reproduce the winning configuration.
Beware overfitting to the leaderboard
If you keep comparing against the same test set, you risk overfitting. Rotate your golden set periodically.