
Models & Junk Food
We are in the thick of topics related to licensing content to big tech and others seeking to feed the machines. With an insatiable appetite for content to differentiate synthetic data is also a topic of interest.
Personally, the term makes me shudder. It is the highly processed food — the polyester — of our industry and it can’t be good. It is junk food for models. Derived data I get. But there is still a lot of great quality data, content, and information available for the taking. It just requires licenses. As with paying for quality it doesn’t come free. Are you ready?
Last month we were asked about our current stance on the data/content continuum — from open and freely available data to highly proprietary, specialized, rights-controlled content and how we see this evolving in light of recent licensing activity, copyright considerations, and fair use trends. Here’s what Outsell’s Hugh Logue shared:
We view the continuum as:
- Public domain/open data: government datasets, Wikipedia, CC0 content. Freely usable, but limited differentiation or value.
- Free-to-use copyrighted material: open web pages, forums, user-generated-content. Technically copyrighted but long treated as “de facto open” under fair use or text-and-data-mining exceptions. That era is ending fast as platforms restrict scraping and move to paid APIs.
- Licensed commercial content: news archives, journals, legal databases, image libraries. Copyright-protected and increasingly sold under contract. This is now the growth segment, and where most of the AI licensing activity is concentrated.
- Highly proprietary data: premium research, standards, clinical and financial datasets. Rights-managed, tightly controlled, and commanding top-tier prices for AI fine-tuning and domain models.
Value creation is migrating to the bottom of this list, from open, low-value data to proprietary, specialized, rights-controlled content. AI firms are realizing that clean, rights-cleared data both reduces legal risk (EU AI Act etc.) and improves model quality.
We’ve tracked around 80 AI content licensing agreements, with examples from News Corp, FT, TIME, Washington Post, Wiley, Wolters Kluwer, and LexisNexis. Three themes stand out:
- The scraping era is over — developers now seek indemnified, licensed datasets to avoid Anthropic-style exposure.
- Retrieval and inference rights are overtaking bulk training rights, as publishers push for attribution and controlled use.
- Regulation is the catalyst — the EU AI Act, the UK’s 2026 Bill-in-waiting, and U.S. litigation are all steering the market toward licensed data sourcing.
Some firms still lean on fair use arguments for “transformative training,” but we believe that window is closing.
- In the U.S., the Bartz v. Anthropic case drew a bright line: training can be transformative, but sourcing from pirated data is never fair use.
- In the UK, Getty’s partial retreat against Stability AI highlights the legal grey zone, but the coming 2026 AI Bill is likely to tighten rules.
- In the EU, the AI Act’s transparency obligations and voluntary Code of Practice are already making unlicensed training commercially untenable.
It’s also worth noting that the Generative AI Copyright Disclosure Act in the U.S. is proposed legislation that we believe is unlikely to become law in the current climate.
Similarly, in Canada, the Artificial Intelligence and Data Act (AIDA) was tabled and has not yet passed, and progress has largely stalled. We’ve covered this, along with other jurisdictions, in our recent AI Regulation and Litigation Tracker analysis (on the Outsell platform here).
In short, the “fair use” defense is narrowing, and the compliance cost of unlicensed data is rising. The market is converging on a simple logic:
- Data without rights carries legal and reputational risk.
- Data with rights carries commercial value.
Outsell views having many players using the same free data being like everyone bathing in the same water. AI content licensing is a differentiator. And so licensing is emerging as a strategic revenue stream. The firms that are rights-clear, discoverable, and AI-ready are commanding premiums and setting the norms others will follow.
Big tech comes looking when the engineers say they need a data set — it’s reactively driven based on need so you have to be ready if you wish to monetize. It’s another important revenue stream and in this era of challenging growth why not be ready when the doorbell rings?
So much better than fake data, no?
Call us when you need advisory on this topic and don’t miss RevvedUp 2026 co-produced with H2K Labs and Outsell. This will be one of dozens of topics explored on driving growth and business model innovation. By invitation, for CEOs and revenue-facing c-suite execs. Register today.