LLM data lineage design: dataset manifest and reproducibility

Introduction

LLM data lineage is the practice of proving which exact dataset snapshot (and transformations) produced a specific model artifact, with run metadata that makes the training reproducible. PROV provides a standard conceptual model for provenance (entities, activities, and agents).

Why it matters: When incidents happen, you need evidence—not guesses—about what data and code produced the deployed model.

Core building blocks

Dataset manifest (the “snapshot contract”)

A manifest should lock:

Snapshot identifier + integrity (hashes / object versions)
Schema fingerprint
Preprocessing version (code commit + container digest)
Filters (PII/redaction, opt-out list version)
Sampling/splits (seed, strategy)
License and constraints (dataset documentation practice aligns with this)

Why it matters: Without a manifest, “same dataset” becomes an untestable statement.

Run metadata (artifact–execution–event graph)

MLMD defines lineage using Artifacts, Executions, and Events, enabling upstream recursion from any artifact to its inputs. Vertex ML Metadata describes the same graph view (nodes and edges).

Why it matters: Data versioning alone does not guarantee reproducibility—run context is required.

Execution event standard (optional): OpenLineage

OpenLineage is an open framework with an interoperable specification (JsonSchema/OpenAPI) to emit lineage events from many systems to a collector/UI.

Why it matters: Standards reduce lock-in and survive orchestration changes.

Determinism and realistic reproducibility levels

PyTorch provides deterministic controls and warns that not all operations have deterministic alternatives.
TensorFlow offers op determinism and notes possible performance trade-offs.

Why it matters: Decide whether you target “re-runnable,” “metric-close,” or “bitwise deterministic”—and document it.

Verification

Lineage integrity: model artifact must always link to the dataset manifest
Run completeness: commit, image digest, parameters, seeds must be present
Rebuild tests: periodically reconstruct a “repro bundle” (manifest + run config + outputs)

MLflow Tracking logs parameters, code versions, metrics, and artifacts—use it to anchor run metadata to the manifest via a shared run_id.

Why it matters: Automation turns lineage from “nice diagrams” into operational proof.

Conclusion

Build LLM data lineage around an unbreakable chain: model artifact → run → dataset manifest.
Combine: DVC-style reproducible pipelines + MLflow run tracking + (MLMD/OpenLineage) lineage storage/standardization.

Summary

A dataset manifest is the reproducibility anchor.
MLMD/Vertex capture lineage as an artifact–execution–event graph.
OpenLineage standardizes event emission across the ecosystem.
Determinism is a tiered decision, not a default.

Recommended Hashtags

#llm #mlops #datalineage #reproducibility #openlineage #mlmd #mlflow #dvc #provenance #datagovernance

References

(PROV-DM: The PROV Data Model, 2013-04-30)[https://www.w3.org/TR/prov-dm/]
(MLMD Guide, 2024-09-06)[https://www.tensorflow.org/tfx/guide/mlmd]
(MLMD Tutorial, 2024-04-30)[https://www.tensorflow.org/tfx/tutorials/mlmd/mlmd_tutorial]
(OpenLineage Docs, 2026-02-01)[https://openlineage.io/docs/]
(OpenLineage Spec, 2026-02-01)[https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.md]
(Vertex ML Metadata Intro, 2026-02-01)[https://docs.cloud.google.com/vertex-ai/docs/ml-metadata/introduction]
(DVC Data Pipelines, 2026-02-01)[https://doc.dvc.org/start/data-pipelines/data-pipelines]
(MLflow Tracking Docs, 2026-02-01)[https://mlflow.org/docs/latest/ml/tracking/]
(PyTorch Reproducibility, 2018-09-11)[https://docs.pytorch.org/docs/stable/notes/randomness.html]
(TensorFlow Op Determinism, 2026-02-01)[https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_op_determinism]
(Dataset Cards, 2026-02-01)[https://huggingface.co/docs/hub/en/datasets-cards] [11]: https://docs.datahub.com/docs/lineage/openlineage?utm_source=chatgpt.com “OpenLineage” [12]: https://mlflow.org/classical-ml/experiment-tracking?utm_source=chatgpt.com “Experiment tracking” [13]: https://developer.atlan.com/reference/specs/openlineage/?utm_source=chatgpt.com “OpenLineage spec - Developer” [14]: https://www.tensorflow.org/tfx/tutorials/mlmd/mlmd_tutorial?utm_source=chatgpt.com “Better ML Engineering with ML Metadata | TFX”

Introduction#

Core building blocks#

Dataset manifest (the “snapshot contract”)#

Run metadata (artifact–execution–event graph)#

Execution event standard (optional): OpenLineage#

Determinism and realistic reproducibility levels#

Verification#

Conclusion#

Summary#

Recommended Hashtags#

References#