Introduction

LLM data lineage is the practice of proving which exact dataset snapshot (and transformations) produced a specific model artifact, with run metadata that makes the training reproducible. PROV provides a standard conceptual model for provenance (entities, activities, and agents).

Why it matters: When incidents happen, you need evidence—not guesses—about what data and code produced the deployed model.

Core building blocks

Dataset manifest (the “snapshot contract”)

A manifest should lock:

  • Snapshot identifier + integrity (hashes / object versions)
  • Schema fingerprint
  • Preprocessing version (code commit + container digest)
  • Filters (PII/redaction, opt-out list version)
  • Sampling/splits (seed, strategy)
  • License and constraints (dataset documentation practice aligns with this)

Why it matters: Without a manifest, “same dataset” becomes an untestable statement.

Run metadata (artifact–execution–event graph)

MLMD defines lineage using Artifacts, Executions, and Events, enabling upstream recursion from any artifact to its inputs. Vertex ML Metadata describes the same graph view (nodes and edges).

Why it matters: Data versioning alone does not guarantee reproducibility—run context is required.

Execution event standard (optional): OpenLineage

OpenLineage is an open framework with an interoperable specification (JsonSchema/OpenAPI) to emit lineage events from many systems to a collector/UI.

Why it matters: Standards reduce lock-in and survive orchestration changes.

Determinism and realistic reproducibility levels

  • PyTorch provides deterministic controls and warns that not all operations have deterministic alternatives.
  • TensorFlow offers op determinism and notes possible performance trade-offs.

Why it matters: Decide whether you target “re-runnable,” “metric-close,” or “bitwise deterministic”—and document it.

Verification

  • Lineage integrity: model artifact must always link to the dataset manifest
  • Run completeness: commit, image digest, parameters, seeds must be present
  • Rebuild tests: periodically reconstruct a “repro bundle” (manifest + run config + outputs)

MLflow Tracking logs parameters, code versions, metrics, and artifacts—use it to anchor run metadata to the manifest via a shared run_id.

Why it matters: Automation turns lineage from “nice diagrams” into operational proof.

Conclusion

  • Build LLM data lineage around an unbreakable chain: model artifact → run → dataset manifest.
  • Combine: DVC-style reproducible pipelines + MLflow run tracking + (MLMD/OpenLineage) lineage storage/standardization.

Summary

  • A dataset manifest is the reproducibility anchor.
  • MLMD/Vertex capture lineage as an artifact–execution–event graph.
  • OpenLineage standardizes event emission across the ecosystem.
  • Determinism is a tiered decision, not a default.

#llm #mlops #datalineage #reproducibility #openlineage #mlmd #mlflow #dvc #provenance #datagovernance

References