🗃️ Artifact Catalog
What you'll learn
How to discover, version, tag, and trace lineage across all your ML artifacts — models, datasets, features, and more — using FlowyML's centralized catalog.
The Artifact Catalog provides centralized artifact management with search, tagging, content-hash deduplication, and full lineage tracking. It works identically in local development (SQLite) and production (remote API).
Why Artifact Catalog? 🤔
| Without Catalog | With Artifact Catalog |
|---|---|
| Artifacts scattered across filesystems | Centralized discovery and search |
| No version history | Full tagging and versioning |
| "Where did this model come from?" | Complete lineage tracking |
| Duplicate artifacts wasting storage | Content-hash deduplication |
Quick Start 🚀
Backend Selection ⚙️
The catalog auto-selects its backend based on your environment:
| Condition | Backend | Storage |
|---|---|---|
| Default (no config) | LocalCatalogBackend |
SQLite (.flowyml/catalog.db) |
Stack has catalog_endpoint |
RemoteCatalogBackend |
FlowyML API server |
FLOWYML_CATALOG_ENDPOINT env var |
RemoteCatalogBackend |
FlowyML API server |
Local Development — Zero Config
Remote / Production
Explicit Backend
Content-Hash Deduplication 🔐
Pass data to register() — the catalog computes a SHA-256 hash and warns if a duplicate exists:
Lineage Tracking 🔗
Register parent → child relationships to build a full artifact graph:
Lineage Visualization
graph TD
A["📊 raw_transactions<br/>(Dataset)"] --> B["🧮 fraud_features<br/>(FeatureSet)"]
B --> C["🤖 fraud_detector<br/>(Model)"]
Catalog API Reference 📚
ArtifactCatalog
| Method | Returns | Description |
|---|---|---|
register(name, artifact_type, ...) |
str |
Register artifact, returns ID |
search(query) |
list |
Full-text search across artifacts |
list(artifact_type, limit) |
list |
List artifacts with optional filters |
get(artifact_id) |
dict |
Get artifact metadata by ID |
tag(artifact_id, **tags) |
None |
Add/update tags on artifact |
get_lineage(artifact_id) |
dict |
Get parent and child relationships |
delete(artifact_id) |
None |
Remove artifact from catalog |
register() Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
name |
str |
✅ | Artifact name |
artifact_type |
str |
✅ | Type: Model, Dataset, FeatureSet, etc. |
data |
Any |
❌ | Raw data for content-hash deduplication |
source_step |
str |
❌ | Step that produced this artifact |
source_run_id |
str |
❌ | Pipeline run ID |
source_pipeline |
str |
❌ | Pipeline name |
tags |
dict |
❌ | Key-value tags for filtering |
parent_ids |
list[str] |
❌ | Parent artifact IDs for lineage |
Real-World Example: Model Promotion 🌍
Best Practices 💡
Tag everything
Use tags like stage, team, experiment to make artifacts discoverable: catalog.search("stage:production team:fraud").
Always record lineage
Pass parent_ids when registering to build a full artifact graph. This makes "where did this model come from?" trivially answerable.
Content hashing
Content-hash deduplication only works when you pass data to register(). Without it, duplicate artifacts are still allowed.