π‘οΈ Evaluation CI/CD & Continuous Monitoring
What you'll learn
How to use EvalAssert for quality gates, EvalSchedule for continuous monitoring, EvalStep for pipeline-integrated evaluation, and TraceBridge for evaluating traced LLM interactions.
FlowyML provides a complete toolkit for integrating evaluations into your ML lifecycle β from blocking bad models in CI to detecting regressions in production overnight.
Architecture Overview
graph LR
A["CI/CD Pipeline"] --> B["EvalAssert<br/>Quality Gates"]
C["Cron / Scheduler"] --> D["EvalSchedule<br/>Nightly Evals"]
E["ML Pipeline"] --> F["EvalStep<br/>Inline Eval"]
G["LLM Traces"] --> H["TraceBridge<br/>Trace β Eval"]
B --> I["β
Pass / β Block"]
D --> J["π Regression Alerts"]
F --> K["π¦ Pipeline Gate"]
H --> L["π Eval Dataset"]
EvalAssert β Quality Gates
Block bad models from reaching production with assertion-based gates in your test suite:
Available Assertions
| Method | Description |
|---|---|
assert_min_score(name, threshold) |
Fail if scorer average is below threshold |
assert_max_score(name, threshold) |
Fail if scorer average exceeds threshold (e.g., toxicity) |
assert_pass_rate(rate) |
Fail if overall pass rate is below rate |
assert_no_regression(baseline) |
Fail if any scorer dropped vs. baseline |
CLI for CI Pipelines
GitHub Actions Example
EvalSchedule β Continuous Evaluation
Schedule evaluations to run automatically on a cron schedule with regression alerts:
Configuration
| Parameter | Type | Description |
|---|---|---|
name |
str |
Schedule identifier |
dataset_name |
str |
Name of the EvalDataset to use |
scorers |
list[Scorer] |
Scorers to run |
cron |
str |
Cron expression for scheduling |
baseline_experiment |
str |
Experiment to compare against |
alert_on_regression |
bool |
Send alerts when scores drop |
EvalStep β Pipeline Integration
Add quality gates directly into your ML pipelines:
TraceBridge β Evaluate LLM Traces
Convert traced LLM interactions into evaluation datasets automatically:
Programmatic TraceBridge
Best Practices
Start with EvalAssert in CI
Add flowyml eval assert to your CI pipeline first β it's the easiest win. Block bad models before they reach production.
Golden sets matter
Create a curated golden set of 50-200 examples with human labels. This is the foundation for all evaluation gates.
Schedule costs
Nightly evaluations with GenAI scorers cost LLM API calls. Monitor your EvalSchedule costs with flowyml eval cost-report.