Introduction

  • TL;DR: On 2025-12-22, investigative journalist and author John Carreyrou and five other authors filed a copyright lawsuit in the Northern District of California against OpenAI, Google, Meta, xAI, Anthropic, and Perplexity.
  • The complaint alleges the companies used pirated copies of copyrighted books—sourced from “shadow libraries”—to train and optimize large language models.
  • This case sharpens legal scrutiny not only on “fair use in training,” but also on upstream data acquisition, storage, and multi-stage copying across LLM pipelines.

Context: The keywords here—copyright, LLM training data, and data governance—are converging fast. Even when courts debate fair use, poor provenance and unlawful acquisition can create separate liability surfaces.

Case Snapshot: Who sued whom, when, and where

Filing and docket basics

  • Filed: 2025-12-22
  • Venue: U.S. District Court, Northern District of California
  • Case: Carreyrou et al v. Anthropic PBC et al, 3:2025cv10897
  • Defendants: Anthropic, Google, OpenAI, Meta, xAI, Perplexity

Why it matters: A confirmed docket trail makes this more than rumor-cycle news—it’s an active dispute where discovery, preservation, and provenance documentation can become decisive.

What the complaint is really attacking: acquisition + copying, not just “training”

Shadow libraries and alleged pirated downloads

Bloomberg Law reports the complaint points to “illegal shadow libraries” such as LibGen, Z-Library, and OceanofPDF as sources for pirated copies.

Multi-stage copying during training and optimization

The same report describes a two-step infringement theory: initial illegal downloading, followed by additional copies created while training or optimizing products.

Why plaintiffs avoided a class action

Reuters notes plaintiffs explicitly criticized class actions as favoring defendants by enabling cheaper, single settlement deals across many claims.

Why it matters: In practice, “fair use” fights are slow and fact-intensive. Clear evidence of unlawful acquisition or storage can become a more direct—and sometimes more dangerous—risk vector for AI teams.

Anthropic and the split between training vs. piracy/storage

The Guardian reports a key 2025 ruling framing training as fair use while treating large-scale pirated storage as infringement—separating the legal analysis across stages.

Meta’s fair use win (with caveats)

Another Guardian report describes Meta’s win where plaintiffs failed to show sufficient harm—again highlighting case-specific fact patterns.

The $1.5B Anthropic settlement signals the stakes

AP reports Anthropic agreed to a $1.5B settlement in an authors’ class action tied to pirated books used for training.

Why it matters: Legal outcomes increasingly hinge on “what exactly happened in the data pipeline” rather than generic arguments about innovation or transformation. Data governance becomes litigation strategy.

Technical reality: where “copying” happens in an LLM pipeline

Pipeline stages that create legally relevant copies

Ingestion → ETL → tokenization → training caches/checkpoints → tuning datasets and logs. These steps often create multiple derivative artifacts.

Memorization research is shaping evidentiary debates

Recent arXiv work demonstrates that pieces of copyrighted books can be extractable from some open-weight LLMs (with variation by model and book), complicating blanket claims from both sides. Related work studies membership inference signals on copyrighted book datasets, arguing for greater transparency around pre-training sources.

Why it matters: Even if your model rarely outputs verbatim text, weak provenance and uncontrolled artifact retention (raw dumps, shards, caches) can still create liability.

Practical playbook: build a “Dataset SBOM” and prove provenance

Minimum controls

  • Source + license metadata per dataset
  • Allowlist/denylist for acquisition
  • Retention + deletion proofs for raw and derived artifacts
  • Output-side safeguards against long-form verbatim reproduction

Lightweight manifest example (Python)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import hashlib
import json
from dataclasses import dataclass, asdict
from datetime import date
from pathlib import Path
from typing import Optional, List

@dataclass
class DatasetAsset:
    path: str
    sha256: str
    source: str
    license_id: str
    acquired_on: str
    notes: Optional[str] = None

def sha256_file(p: Path) -> str:
    h = hashlib.sha256()
    with p.open("rb") as f:
        for chunk in iter(lambda: f.read(1024 * 1024), b""):
            h.update(chunk)
    return h.hexdigest()

def build_manifest(files: List[Path], source: str, license_id: str, notes: str = "") -> dict:
    if not source or not license_id:
        raise ValueError("Missing provenance: source and license_id are required.")

    today = date.today().isoformat()
    assets = [
        DatasetAsset(
            path=str(f),
            sha256=sha256_file(f),
            source=source,
            license_id=license_id,
            acquired_on=today,
            notes=notes or None
        )
        for f in files
    ]

    return {
        "manifest_version": "1.0",
        "generated_on": today,
        "asset_count": len(assets),
        "assets": [asdict(a) for a in assets],
    }

if __name__ == "__main__":
    data_dir = Path("./incoming_dataset")
    files = [p for p in data_dir.rglob("*") if p.is_file()]
    manifest = build_manifest(files, "INTERNAL_VENDOR_ABC_2025Q4", "Commercial-License", "contract_id=CTR-2025-1042")
    Path("dataset_manifest.json").write_text(json.dumps(manifest, indent=2), encoding="utf-8")

Why it matters: In court, “we didn’t do it” is weaker than “here is the trace.” Provenance artifacts can reduce legal uncertainty and accelerate internal incident response.

Conclusion

  • The Carreyrou-led lawsuit (filed 2025-12-22) targets six major AI firms and centers on alleged pirated book acquisition and downstream copying in LLM pipelines.
  • 2025 decisions show courts may separate “training” from “piracy/storage,” making acquisition and retention controls a first-class risk area.
  • Treat dataset provenance like security: manifests, deny/allow lists, retention proofs, and deletion readiness are no longer optional.

Summary

  • Provenance beats narratives.
  • Pirated acquisition and artifact retention can be separate liability surfaces.
  • Build a Dataset SBOM to prove licensing and control derivative artifacts.

#copyright #llm #trainingdata #datagovernance #OpenAI #Google #Meta #xAI #Anthropic #Perplexity

References