Parallel Execution π
flowyml allows you to execute independent steps concurrently, slashing pipeline runtime by running tasks in parallel.
What you'll learn
How to run multiple steps at once to speed up execution. Most ML pipelines have independent branches β running them sequentially is a waste of time.
Why Parallelism Matters β‘
Without parallelism: - Slow execution: Steps run one after another (A β B β C) - Idle resources: CPU cores sit idle while one core works - Long feedback loops: Waiting hours for independent tasks
With flowyml parallelism: - Faster results: Run A, B, and C at the same time - Resource efficiency: Utilize all CPU cores - Scalability: Process 10x data in the same amount of time
Enabling Parallelism π§
You can enable parallel execution by using the ParallelExecutor. It automatically detects independent steps in your DAG and runs them concurrently.
Basic Usage
In-Step Parallelism
You can also parallelize work within a single step using standard Python libraries or flowyml's utilities.
Decision Guide: Execution Backends βοΈ
| Backend | Best For | Why |
|---|---|---|
process (Default) |
CPU-bound tasks | Bypasses Python's GIL. Good for data processing, training. |
thread |
I/O-bound tasks | Lightweight. Good for API calls, downloading files, DB queries. |
When to use process (CPU)
- Data transformation (pandas, numpy)
- Image processing
- Model training (sklearn)
When to use thread (I/O)
- Fetching data from APIs
- Uploading files to S3
- Querying databases
Real-World Pattern: Batch Processing π
Process a large dataset by splitting it into chunks and processing them in parallel.
Performance Tip
Set max_workers to cpu_count() - 1 to keep one core free for the system.