John Carreyrou’s Copyright Lawsuit Puts LLM Training Data on Trial

Introduction

TL;DR: On 2025-12-22, investigative journalist and author John Carreyrou and five other authors filed a copyright lawsuit in the Northern District of California against OpenAI, Google, Meta, xAI, Anthropic, and Perplexity.
The complaint alleges the companies used pirated copies of copyrighted books—sourced from “shadow libraries”—to train and optimize large language models.
This case sharpens legal scrutiny not only on “fair use in training,” but also on upstream data acquisition, storage, and multi-stage copying across LLM pipelines.

Context: The keywords here—copyright, LLM training data, and data governance—are converging fast. Even when courts debate fair use, poor provenance and unlawful acquisition can create separate liability surfaces.

Case Snapshot: Who sued whom, when, and where

Filing and docket basics

Filed: 2025-12-22
Venue: U.S. District Court, Northern District of California
Case: Carreyrou et al v. Anthropic PBC et al, 3:2025cv10897
Defendants: Anthropic, Google, OpenAI, Meta, xAI, Perplexity

Why it matters: A confirmed docket trail makes this more than rumor-cycle news—it’s an active dispute where discovery, preservation, and provenance documentation can become decisive.

What the complaint is really attacking: acquisition + copying, not just “training”

Shadow libraries and alleged pirated downloads

Bloomberg Law reports the complaint points to “illegal shadow libraries” such as LibGen, Z-Library, and OceanofPDF as sources for pirated copies.

Multi-stage copying during training and optimization

The same report describes a two-step infringement theory: initial illegal downloading, followed by additional copies created while training or optimizing products.

Why plaintiffs avoided a class action

Reuters notes plaintiffs explicitly criticized class actions as favoring defendants by enabling cheaper, single settlement deals across many claims.

Why it matters: In practice, “fair use” fights are slow and fact-intensive. Clear evidence of unlawful acquisition or storage can become a more direct—and sometimes more dangerous—risk vector for AI teams.

How this fits into 2025’s AI copyright landscape

Anthropic and the split between training vs. piracy/storage

The Guardian reports a key 2025 ruling framing training as fair use while treating large-scale pirated storage as infringement—separating the legal analysis across stages.

Meta’s fair use win (with caveats)

Another Guardian report describes Meta’s win where plaintiffs failed to show sufficient harm—again highlighting case-specific fact patterns.

The $1.5B Anthropic settlement signals the stakes

AP reports Anthropic agreed to a $1.5B settlement in an authors’ class action tied to pirated books used for training.

Why it matters: Legal outcomes increasingly hinge on “what exactly happened in the data pipeline” rather than generic arguments about innovation or transformation. Data governance becomes litigation strategy.

Technical reality: where “copying” happens in an LLM pipeline

Pipeline stages that create legally relevant copies

Ingestion → ETL → tokenization → training caches/checkpoints → tuning datasets and logs. These steps often create multiple derivative artifacts.

Memorization research is shaping evidentiary debates

Recent arXiv work demonstrates that pieces of copyrighted books can be extractable from some open-weight LLMs (with variation by model and book), complicating blanket claims from both sides. Related work studies membership inference signals on copyrighted book datasets, arguing for greater transparency around pre-training sources.

Why it matters: Even if your model rarely outputs verbatim text, weak provenance and uncontrolled artifact retention (raw dumps, shards, caches) can still create liability.

Practical playbook: build a “Dataset SBOM” and prove provenance

Minimum controls

Source + license metadata per dataset
Allowlist/denylist for acquisition
Retention + deletion proofs for raw and derived artifacts
Output-side safeguards against long-form verbatim reproduction

Lightweight manifest example (Python)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import hashlib
import json
from dataclasses import dataclass, asdict
from datetime import date
from pathlib import Path
from typing import Optional, List

@dataclass
class DatasetAsset:
    path: str
    sha256: str
    source: str
    license_id: str
    acquired_on: str
    notes: Optional[str] = None

def sha256_file(p: Path) -> str:
    h = hashlib.sha256()
    with p.open("rb") as f:
        for chunk in iter(lambda: f.read(1024 * 1024), b""):
            h.update(chunk)
    return h.hexdigest()

def build_manifest(files: List[Path], source: str, license_id: str, notes: str = "") -> dict:
    if not source or not license_id:
        raise ValueError("Missing provenance: source and license_id are required.")

    today = date.today().isoformat()
    assets = [
        DatasetAsset(
            path=str(f),
            sha256=sha256_file(f),
            source=source,
            license_id=license_id,
            acquired_on=today,
            notes=notes or None
        )
        for f in files
    ]

    return {
        "manifest_version": "1.0",
        "generated_on": today,
        "asset_count": len(assets),
        "assets": [asdict(a) for a in assets],
    }

if __name__ == "__main__":
    data_dir = Path("./incoming_dataset")
    files = [p for p in data_dir.rglob("*") if p.is_file()]
    manifest = build_manifest(files, "INTERNAL_VENDOR_ABC_2025Q4", "Commercial-License", "contract_id=CTR-2025-1042")
    Path("dataset_manifest.json").write_text(json.dumps(manifest, indent=2), encoding="utf-8")

Why it matters: In court, “we didn’t do it” is weaker than “here is the trace.” Provenance artifacts can reduce legal uncertainty and accelerate internal incident response.

Conclusion

The Carreyrou-led lawsuit (filed 2025-12-22) targets six major AI firms and centers on alleged pirated book acquisition and downstream copying in LLM pipelines.
2025 decisions show courts may separate “training” from “piracy/storage,” making acquisition and retention controls a first-class risk area.
Treat dataset provenance like security: manifests, deny/allow lists, retention proofs, and deletion readiness are no longer optional.

Summary

Provenance beats narratives.
Pirated acquisition and artifact retention can be separate liability surfaces.
Build a Dataset SBOM to prove licensing and control derivative artifacts.

Recommended Hashtags

#copyright #llm #trainingdata #datagovernance #OpenAI #Google #Meta #xAI #Anthropic #Perplexity

References

New York Times reporter sues Google, xAI, OpenAI over chatbot training
Reuters | 2025-12-23
https://www.reuters.com/legal/government/new-york-times-reporter-sues-google-xai-openai-over-chatbot-training-2025-12-22/
OpenAI, Anthropic, xAI Hit With Copyright Suit from Writers
Bloomberg Law | 2025-12-23
https://news.bloomberglaw.com/ip-law/openai-anthropic-xai-hit-with-copyright-lawsuit-from-writers
Carreyrou et al v. Anthropic PBC et al (3:2025cv10897)
Justia Dockets | 2025-12-22
https://dockets.justia.com/docket/california/candce/3%3A2025cv10897/461656
Authors File New Lawsuit Against AI Companies Seeking More Money
Publishers Weekly | 2025-12-23
https://www.publishersweekly.com/pw/by-topic/industry-news/publisher-news/article/99347-authors-file-new-lawsuit-against-ai-companies-seeking-more-money.html
Anthropic did not breach copyright when training AI on books without permission, court rules
The Guardian | 2025-06-25
https://www.theguardian.com/technology/2025/jun/25/anthropic-did-not-breach-copyright-when-training-ai-on-books-without-permission-court-rules
Meta wins AI copyright lawsuit as US judge rules against authors
The Guardian | 2025-06-26
https://www.theguardian.com/technology/2025/jun/26/meta-wins-ai-copyright-lawsuit-as-us-judge-rules-against-authors
Anthropic to pay authors $1.5 billion to settle lawsuit over pirated books used to train AI chatbots
AP News | 2025-09
https://apnews.com/article/f294266bc79a16ec90d2ddccdf435164
Extracting memorized pieces of (copyrighted) books from open-weight language models
arXiv | 2025-05-18
https://arxiv.org/abs/2505.12546
Beyond Public Access in LLM Pre-Training Data
arXiv | 2025-04-24
https://arxiv.org/abs/2505.00020

Introduction#

Case Snapshot: Who sued whom, when, and where#

Filing and docket basics#

What the complaint is really attacking: acquisition + copying, not just “training”#

Shadow libraries and alleged pirated downloads#

Multi-stage copying during training and optimization#

Why plaintiffs avoided a class action#

How this fits into 2025’s AI copyright landscape#

Anthropic and the split between training vs. piracy/storage#

Meta’s fair use win (with caveats)#

The $1.5B Anthropic settlement signals the stakes#

Technical reality: where “copying” happens in an LLM pipeline#

Pipeline stages that create legally relevant copies#

Memorization research is shaping evidentiary debates#

Practical playbook: build a “Dataset SBOM” and prove provenance#

Minimum controls#

Lightweight manifest example (Python)#

Conclusion#

Summary#

Recommended Hashtags#

References#