• Post author:
  • Post category:AI World
  • Post last modified:December 5, 2025
  • Reading time:4 mins read

Licenses or lawsuits? The new battleground for AI training data

What Changed and Why It Matters

The AI data war just escalated. The New York Times filed a new lawsuit against Perplexity. It alleges large‑scale copying and use of Times content without a license. This follows the Times’ 2023 suit against OpenAI and Microsoft.

At the same time, publishers are signing more licenses with AI firms. Columbia Journalism Review launched a tracker that maps who is suing and who is dealing. The line is clear. If you don’t license, expect litigation and discovery.

Here’s the pattern most people miss. Data costs are moving from zero to non‑zero. That reshapes model training, product margins, and who controls distribution.

The Actual Move

  • The New York Times sued Perplexity, alleging the startup used millions of Times articles to train and serve content without permission.
  • News Corp took a similar path in 2024, accusing Perplexity of “massive illegal copying” through impermissible scraping of copyrighted content, and demanded a jury trial.
  • In the authors’ cases, a court compelled OpenAI to produce communications explaining why it deleted datasets of pirated books. Courts are forcing transparency into training pipelines.
  • OpenAI pushed back against a request to disclose 20 million user chats in the Times case. It cited privacy and security risks. Discovery now intersects with user data protection.
  • CJR’s “Lawsuit or License?” tool catalogs the growing split: lawsuits with some firms, licenses with others. 2025 saw a surge in both.
  • Debevoise & Plimpton’s year‑end review notes early fair‑use rulings and rising pressure on dataset provenance. The center of gravity is shifting from model bravado to data hygiene.
  • Public Knowledge outlines a potential middle ground. It points to structured licensing, transparency, and practical standards that smaller publishers can adopt.
  • Timeline trackers and legal advisories underscore the same message. Publicly visible does not mean free to train on. Courts are narrowing that gap.

The era of “train on the open web” is closing. The era of “license or litigate” has begun.

The Why Behind the Move

• Model

Models need vast, clean data. Courts now demand provenance. Silent datasets become liabilities.

• Traction

Search‑like products index news by default. That creates instant utility and immediate IP exposure.

• Valuation / Funding

Investors will discount models with murky data rights. Licenses de‑risk future cash flows.

• Distribution

Newsrooms control high‑intent, fresh information. Licensing buys both data and distribution legitimacy.

• Partnerships & Ecosystem Fit

Deals with publishers unlock ongoing updates, corrections, and trust. They also throttle competitors.

• Timing

Regulatory and court signals hardened in 2025. Waiting now invites punitive discovery and injunction risk.

• Competitive Dynamics

Winners will standardize rights, audit trails, and take‑down flows. Compliance becomes a moat.

• Strategic Risks

  • Discovery may force disclosure of sensitive training decisions.
  • User privacy can collide with evidence requests.
  • Non‑licensed outputs risk injunctions at peak usage.
  • Crawlers that ignore robots.txt trigger brand and legal damage.

Trust, provenance, and rev‑sharing are becoming the unit economics of LLMs.

What Builders Should Notice

  • Treat data like cloud: metered, contracted, and monitored. Budget it.
  • Provenance is product. Build audit logs and removal pipelines early.
  • License strategically. Pay for fresh, high‑leverage domains first.
  • Discovery readiness is a feature. Keep docs, approvals, and crawler configs tight.
  • Distribution beats model novelty. Partner where your users already live.

Buildloop reflection

The moat isn’t the model. It’s the rights to keep training tomorrow.

Sources