Skip to content

πŸ›‘οΈ Evaluation CI/CD & Continuous Monitoring

What you'll learn

How to use EvalAssert for quality gates, EvalSchedule for continuous monitoring, EvalStep for pipeline-integrated evaluation, and TraceBridge for evaluating traced LLM interactions.

FlowyML provides a complete toolkit for integrating evaluations into your ML lifecycle β€” from blocking bad models in CI to detecting regressions in production overnight.


Architecture Overview

graph LR
    A["CI/CD Pipeline"] --> B["EvalAssert<br/>Quality Gates"]
    C["Cron / Scheduler"] --> D["EvalSchedule<br/>Nightly Evals"]
    E["ML Pipeline"] --> F["EvalStep<br/>Inline Eval"]
    G["LLM Traces"] --> H["TraceBridge<br/>Trace β†’ Eval"]
    B --> I["βœ… Pass / ❌ Block"]
    D --> J["πŸ“Š Regression Alerts"]
    F --> K["🚦 Pipeline Gate"]
    H --> L["πŸ“‹ Eval Dataset"]

EvalAssert β€” Quality Gates

Block bad models from reaching production with assertion-based gates in your test suite:

from flowyml.evals import EvalAssert, evaluate, Accuracy, F1Score

def test_model_quality():
    """Run as a pytest test β€” blocks deployment if assertions fail."""
    result = evaluate(data=golden_set, scorers=[Accuracy(), F1Score()])

    assertion = EvalAssert(result=result)
    assertion.assert_min_score("accuracy", 0.9)    # Must exceed 90%
    assertion.assert_min_score("f1_score", 0.85)   # Must exceed 85%
    assertion.assert_pass_rate(0.95)                # 95% of samples must pass

Available Assertions

Method Description
assert_min_score(name, threshold) Fail if scorer average is below threshold
assert_max_score(name, threshold) Fail if scorer average exceeds threshold (e.g., toxicity)
assert_pass_rate(rate) Fail if overall pass rate is below rate
assert_no_regression(baseline) Fail if any scorer dropped vs. baseline

CLI for CI Pipelines

1
2
3
4
5
6
7
8
# Run evaluation and print results
flowyml eval run --data golden_set.csv --scorers accuracy,f1_score

# Assert thresholds (exits with code 1 on failure)
flowyml eval assert -d golden_set.csv -s accuracy --min-score accuracy 0.9

# Compare two experiment runs
flowyml eval compare --baseline v1 --current v2

GitHub Actions Example

1
2
3
4
5
6
7
- name: Evaluate Model Quality
  run: |
    flowyml eval assert \
      -d tests/golden_set.csv \
      -s accuracy,f1_score \
      --min-score accuracy 0.9 \
      --min-score f1_score 0.85

EvalSchedule β€” Continuous Evaluation

Schedule evaluations to run automatically on a cron schedule with regression alerts:

from flowyml.evals import EvalSchedule, Relevance, Faithfulness

schedule = EvalSchedule(
    name="nightly_rag_eval",
    dataset_name="production_golden_set",
    scorers=[Relevance(), Faithfulness()],
    cron="0 2 * * *",              # Daily at 2am
    baseline_experiment="rag_v2",   # Compare against this baseline
    alert_on_regression=True,       # Send alerts if scores drop
)
schedule.start()

Configuration

Parameter Type Description
name str Schedule identifier
dataset_name str Name of the EvalDataset to use
scorers list[Scorer] Scorers to run
cron str Cron expression for scheduling
baseline_experiment str Experiment to compare against
alert_on_regression bool Send alerts when scores drop

EvalStep β€” Pipeline Integration

Add quality gates directly into your ML pipelines:

from flowyml import Pipeline
from flowyml.evals import EvalStep, Accuracy, F1Score

pipeline = Pipeline("training_with_eval")
pipeline.add_step(train_step)

# Evaluation gate β€” fails the pipeline if quality drops
eval_step = EvalStep(
    name="quality_gate",
    scorers=[Accuracy(threshold=0.9), F1Score(threshold=0.85)],
    fail_on_regression=True,
    baseline_experiment="model_v1",
)
pipeline.add_step(eval_step)

pipeline.add_step(deploy_step)  # Only runs if eval_step passes

TraceBridge β€” Evaluate LLM Traces

Convert traced LLM interactions into evaluation datasets automatically:

from flowyml.evals import evaluate_traces, Relevance, Toxicity

# Evaluate specific traces
results = evaluate_traces(
    trace_ids=["trace-001", "trace-002", "trace-003"],
    scorers=[Relevance(), Toxicity()],
    experiment="trace_quality_audit",
)

# Or evaluate all traces from a project
results = evaluate_traces(
    project="chatbot",
    scorers=[Relevance(), Toxicity()],
    limit=100,
)

Programmatic TraceBridge

from flowyml.evals import TraceBridge

bridge = TraceBridge()

# Convert traces to EvalDataset
eval_data = bridge.traces_to_dataset(
    trace_ids=["trace-001", "trace-002"],
    dataset_name="chatbot_golden_set",
)

# Now evaluate like any other dataset
result = evaluate(data=eval_data, scorers=[Relevance()])

Best Practices

Start with EvalAssert in CI

Add flowyml eval assert to your CI pipeline first β€” it's the easiest win. Block bad models before they reach production.

Golden sets matter

Create a curated golden set of 50-200 examples with human labels. This is the foundation for all evaluation gates.

Schedule costs

Nightly evaluations with GenAI scorers cost LLM API calls. Monitor your EvalSchedule costs with flowyml eval cost-report.