Checkpointing & Experiment Tracking πΎ
flowyml ensures you never lose progress. Save pipeline state automatically and track every experiment detail.
What you'll learn
How to resume failed pipelines and track model performance over time. Long-running pipelines will fail eventually β checkpointing turns a catastrophe into a minor annoyance.
Why Checkpointing Matters π‘
Without checkpointing: - Lost time: A crash at hour 9 of a 10-hour job means restarting from hour 0 - Wasted compute: Re-computing expensive intermediate steps - Frustration: "It worked on my machine, why did it fail now?"
With flowyml checkpointing: - Resume instantly: Restart exactly where it failed - Inspect state: Load the checkpoint to debug what went wrong - Skip redundant work: Re-use successful steps
πΎ Checkpointing
Checkpointing allows you to save the intermediate results of your pipeline steps. This is crucial for long-running pipelines, as it enables you to resume execution from the point of failure or to skip expensive steps that have already been computed.
Automatic Checkpointing (Default)
Checkpointing is enabled by default! FlowyML automatically saves pipeline state after each step, allowing you to resume from failures without any additional setup.
Tip
Checkpointing is enabled by default. If you want to disable it for a specific pipeline, use Pipeline("name", enable_checkpointing=False).
Tip
Always enable checkpointing for pipelines that take longer than 10 minutes. The storage cost is negligible compared to the compute time saved.
Manual Checkpointing
For finer control, you can use the PipelineCheckpoint object within your steps.
π§ͺ Experiment Tracking
flowyml automatically tracks every pipeline run when you use Metrics objects, capturing parameters, metrics, and artifacts. This allows you to compare experiments and reproduce results without any additional setup.
Automatic Experiment Tracking
Experiment tracking is enabled by default! Simply use Metrics objects in your pipeline, and flowyml will automatically:
- Extract all metrics from
Metricsobjects - Capture context parameters (learning_rate, epochs, etc.)
- Log everything to the experiment tracking system
- Create an experiment named after your pipeline
Example:
Manual Experiment Tracking
If you want more control, you can manually create and manage experiments:
Comparing Experiments π
You can compare runs using the CLI or the Python API.
CLI:
Python:
Disabling Automatic Tracking βΈοΈ
If you want to disable automatic experiment tracking:
Visualizing Experiments π
The flowyml UI provides a dedicated Experiments view where you can: - View a table of all runs - Filter by parameters or metrics - Plot metric trends over time - Compare side-by-side details of selected runs
Access it at http://localhost:8080/experiments when the UI is running.
π Selective Re-Execution β‘NEW
Resume pipelines from a specific point, re-using cached results from previous steps.
Automatic Resume
When checkpointing is enabled (default), FlowyML automatically detects completed steps and skips them on retry:
The pipeline.rerun() API
Explicitly resume a pipeline from a previous checkpoint:
If no checkpoint exists for the given run_id, a clear ValueError is raised.
How It Works
- Each completed step's output is saved to the checkpoint store
- On resume, the orchestrator checks the checkpoint before each step
- Completed steps are skipped and their cached outputs are injected into context
- Execution continues from the first non-completed step
Pipeline Snapshots for Reproducibility
Each run automatically captures an immutable snapshot of the pipeline definition:
The snapshot hash is also stored in PipelineResult.snapshot_hash for each run.
Tip
Use PipelineSnapshot.verify() to confirm a pipeline hasn't been modified since a previous run β essential for auditing and compliance.