Assets & Artifacts π
In flowyml, data lineage and artifact tracking are first-class features. Every piece of data flowing through your pipeline is tracked, versioned, and queryable.
What you'll learn
How to work with typed assets (Datasets, Models, Metrics) and track complete data lineage. Reproducibility requires lineage β flowyml tracks not just what models you trained, but what data created them.
Why Assets Matter π‘οΈ
Without structured assets, teams face: - "Which data trained this model?" β Unknown, guesswork - "Can we reproduce this result?" β Maybe, if you kept notes - "Where did this artifact come from?" β Lost in the pipeline - "What changed between runs?" β Manual diffing, error-prone
With flowyml assets, you get: - Automatic lineage tracking: Every asset knows its parents - Version control for data: Not just code, but datasets and models - Audit trails: Full provenance from raw data to predictions - Reproducibility: Re-create any result on demand
For regulated industries
In finance, healthcare, and legal β asset lineage isn't optional. flowyml provides audit-ready traceability out of the box.
The Asset Hierarchy ποΈ
flowyml provides specialized classes for different ML artifact types:
- Asset: The base class for all versioned objects.
- Dataset: Represents data (DataFrames, file paths, tensors).
- Model: Represents trained ML models.
- UI Guide: Learn how to visualize assets in the FlowyML dashboard.
- FeatureSet: Represents engineered features.
- Report: Represents generated reports and documentation.
- Artifact: Generic artifact for configs, checkpoints, files, etc.
- Prompt: Represents LLM prompt templates with versioning and rendering.
- Checkpoint: Represents training checkpoints with resumability metadata.
Type-Based Routing & Infrastructure π
Assets are the primary mechanism for Type-Based Routing. When a step returns an Asset (like a Model), FlowyML inspects its type and routes it according to your Stack Configuration.
| Asset Type | Primary Storage | Default Registry |
|---|---|---|
Dataset |
Artifact Store (GCS/S3) | Feature Store (optional) |
Model |
Artifact Store (GCS/S3) | Model Registry (Vertex/SageMaker) |
Metrics |
Metadata Store (SQL) | External Trackers (MLflow/W&B) |
FeatureSet |
Artifact Store | Feature Store (optional) |
Prompt |
Artifact Store | Prompt Registry (version-tracked) |
Checkpoint |
Artifact Store | Checkpoint Directory |
Report |
Artifact Store | β |
Artifact |
Artifact Store | β |
The Benefit
You can switch from local JSON storage to a production-grade Vertex AI Model Registry without changing a single line of your model training code.
Creating Assets π¨
You can create assets explicitly using the .create() factory method. This automatically handles versioning, metadata generation, and lineage tracking.
Assets interface data field
The data field is not about passing the data to the asset, but rather about the asset's interface on which data to register. This can be a model (Keras) or dataset (Pandas) etc.
Datasets π
FlowyML automatically extracts statistics and metadata from various data formats!
Convenience methods for common formats:
Supported data types:
- Pandas DataFrames
- NumPy arrays
- Python dictionaries
- TensorFlow tf.data.Dataset
- Lists of dictionaries
Models π€
FlowyML automatically extracts model metadata from all major ML frameworks!
Convenience methods for specific frameworks:
Supported frameworks: - Keras/TensorFlow (full extraction: layers, optimizer, loss, metrics) - PyTorch (parameters, layers, device, dtype) - Scikit-learn (hyperparameters, feature importance, is_fitted) - XGBoost/LightGBM/CatBoost - Hugging Face Transformers
Metrics π
Prompts π€
First-class prompt asset for LLM and GenAI workflows. Supports text templates with {variable} substitution and chat-style message lists.
Checkpoints πΎ
Training checkpoint asset with epoch/step tracking and framework-agnostic persistence.
Reports π
Report assets for generated documentation, analysis summaries, or pipeline outputs.
Generic Artifacts π¦
For anything that doesn't fit other categories β configs, files, intermediate outputs.
Lineage Tracking π
flowyml automatically tracks the lineage of every asset.
- Parents: The assets that were used to create this asset.
- Children: The assets that were created using this asset.
- Producer: The pipeline step that generated this asset.
When you pass an asset from one step to another, flowyml records this relationship.
Visualize It
You can visualize this lineage graph in the flowyml UI.
Storage πΎ
Assets are stored in the Artifact Store. By default, this is the .flowyml/artifacts directory in your project.
flowyml supports pluggable storage backends (S3, GCS, Azure) via fsspec. Configuration is handled in flowyml.yaml.
Automatic Materialization π¦
When running a pipeline with a Stack that has an Artifact Store configured, flowyml automatically materializes step outputs.
The artifacts are stored in a structured path:
{project_name}/{date}/{run_id}/data/{step_name}/{artifact_name}
This ensures that every run is reproducible and all intermediate data is persisted. flowyml uses Materializers to handle serialization for different data types (Pandas, NumPy, Keras, PyTorch, etc.).