π¨ Monitoring & Alerts
What you'll learn
How to monitor system health and get instant alerts for pipeline failures, resource issues, and drift detection. Silent failures are the worst kind β alerts turn invisible problems into actionable notifications.
Know when pipelines fail before your users do with real-time system and pipeline monitoring.
Why Monitoring Matters
| Without Monitoring | With Monitoring |
|---|---|
| Silent failures go unnoticed for hours | Instant Slack notification on failure |
| Resource waste (100% CPU unknown) | Real-time CPU/memory visibility |
| "When did this start failing?" | Historical success rates and patterns |
π₯οΈ System Monitor
The SystemMonitor tracks CPU and memory usage to detect resource issues:
What It Checks
| Metric | Threshold | Action |
|---|---|---|
| CPU usage | > 90% sustained | Warning alert |
| Memory usage | > 85% | Warning alert |
| Disk space | < 10% free | Critical alert |
β‘ Pipeline Monitor
The PipelineMonitor tracks pipeline health β consecutive failures, success rates, and execution trends:
π AlertManager
FlowyML uses an AlertManager to dispatch alerts to configured handlers. By default, alerts go to the console, but you can add Slack, email, or custom handlers.
Sending Alerts
Alert Levels
| Level | Use Case | Default Behavior |
|---|---|---|
INFO |
Informational updates | Console only |
WARNING |
Non-critical issues | Console + configured handlers |
ERROR |
Step or pipeline failure | All handlers |
CRITICAL |
Production outage | All handlers + escalation |
Real-World Example: Production Alert Setup
Send critical alerts to Slack, warnings to email:
π» CLI Monitoring
Check status and health from the command line:
Beta Feature
CLI monitoring commands are currently in beta and may change in future releases.
Best Practices
Use alert levels wisely
Reserve CRITICAL for production outages only. Over-alerting causes alert fatigue β your team will start ignoring notifications.
Combine monitors
Use SystemMonitor for infrastructure health AND PipelineMonitor for business logic health. Both are needed for full coverage.
Don't alert on expected failures
If a pipeline legitimately fails during development, exclude dev environments from alerting to reduce noise.