🗃️ Artifact Catalog

What you'll learn

How to discover, version, tag, and trace lineage across all your ML artifacts — models, datasets, features, and more — using FlowyML's centralized catalog.

The Artifact Catalog provides centralized artifact management with search, tagging, content-hash deduplication, and full lineage tracking. It works identically in local development (SQLite) and production (remote API).

Why Artifact Catalog? 🤔

Without Catalog	With Artifact Catalog
Artifacts scattered across filesystems	Centralized discovery and search
No version history	Full tagging and versioning
"Where did this model come from?"	Complete lineage tracking
Duplicate artifacts wasting storage	Content-hash deduplication

Quick Start 🚀

from flowyml import ArtifactCatalog

catalog = ArtifactCatalog()  # Auto-selects local or remote backend

# Register an artifact
artifact_id = catalog.register(
    name="fraud_detector_v2",
    artifact_type="Model",
    source_step="train_model",
    source_run_id="run-abc-123",
    source_pipeline="fraud_detection",
    tags={"stage": "staging", "team": "fraud"},
)

# Search and discover
models = catalog.search("fraud")
recent = catalog.list(artifact_type="Model", limit=10)

# Tag for promotion
catalog.tag(artifact_id, stage="production")

# Trace lineage
lineage = catalog.get_lineage(artifact_id)
print(lineage["parents"])   # What inputs produced this
print(lineage["children"])  # What downstream consumed this

Backend Selection ⚙️

The catalog auto-selects its backend based on your environment:

Condition	Backend	Storage
Default (no config)	`LocalCatalogBackend`	SQLite (`.flowyml/catalog.db`)
Stack has `catalog_endpoint`	`RemoteCatalogBackend`	FlowyML API server
`FLOWYML_CATALOG_ENDPOINT` env var	`RemoteCatalogBackend`	FlowyML API server

Local Development — Zero Config

catalog = ArtifactCatalog()  # Uses local SQLite automatically

Remote / Production

export FLOWYML_CATALOG_ENDPOINT=https://flowyml.example.com/api/v1/catalog
export FLOWYML_CATALOG_API_KEY=your-api-key

Explicit Backend

from flowyml.storage.catalog import LocalCatalogBackend, RemoteCatalogBackend

# Force local
catalog = ArtifactCatalog(backend=LocalCatalogBackend("/path/to/db"))

# Force remote
catalog = ArtifactCatalog(backend=RemoteCatalogBackend("https://api.example.com"))

Content-Hash Deduplication 🔐

Pass data to register() — the catalog computes a SHA-256 hash and warns if a duplicate exists:

catalog.register(
    name="training_features",
    artifact_type="Dataset",
    data=my_dataframe,  # Content is hashed for dedup
)

Lineage Tracking 🔗

Register parent → child relationships to build a full artifact graph:

# 1. Register raw data
raw_id = catalog.register(name="raw_transactions", artifact_type="Dataset")

# 2. Register features (child of raw data)
feat_id = catalog.register(
    name="fraud_features",
    artifact_type="FeatureSet",
    parent_ids=[raw_id],
)

# 3. Register model (child of features)
model_id = catalog.register(
    name="fraud_detector",
    artifact_type="Model",
    parent_ids=[feat_id],
)

# 4. Query lineage
lineage = catalog.get_lineage(model_id)
print(lineage["parents"])   # → [feat_id]
print(lineage["children"])  # → []

Lineage Visualization

graph TD
    A["📊 raw_transactions<br/>(Dataset)"] --> B["🧮 fraud_features<br/>(FeatureSet)"]
    B --> C["🤖 fraud_detector<br/>(Model)"]

Catalog API Reference 📚