
The Fallacy of Peak Data
In his spare time, when he’s not dismantling the federal government with a one-size-fits-all jack-hammer, Elon Musk is pontificating about the dearth of content to train AI and how we have reached peak data — a term for no more human-created data to train AI. A fallacy if there ever was one.
Seeing this, I immediately shot a note to Outsell’s Hugh Logue, our lead for advising IP owners on licensing strategies for GenAI use cases.
“Hugh, this contradicts our licensing tracker and your piece of last week. Elon has a bigger voice but let’s get the word out.”
So here we are. Once again, the data, information services, content industry is more or less invisible while the bro-sphere goes around telling us all the worlds info is out on the open web. I have heard this fallacy for the better part of 30 years while they forget there is a whole ecosystem of quality content, unique in its ability to train, if they could only wake up and figure out that it’s out there and they have to obtain it legitimately to feed the machine. Our licensing tracker says this is absolutely the case — over 70 publicly announced deals in it with many more, unannounced, behind it.
To quote Hugh’s piece:
“The [licensing] deals reveal that the claim that AI companies are ‘running out of training data’ is wrong. While concerns over data availability surfaced in early 2024, major tech firms have continued to ramp up licensing efforts, particularly in the book publishing sector. The reality is that AI firms have barely scratched the surface in licensing proprietary, high-quality datasets, and a vast reservoir of structured content remains untapped.”
This dovetails with our blog post last year about lessons from FTs licensing efforts. While big tech prefers not to talk about their deals, and many of the larger content providers don’t either, (or can’t) they are happening in spades. Trust me here — we are involved in many of them. And there is no such thing as ‘peak data’ because as Hugh so adequately says:
….the real issue is that companies need to license high-quality proprietary content. Comparing data to oil is flawed. Oil is a finite resource, and instead content for LLMs must be seen as renewable like wind, with publishers and information providers continuously generating valuable content that AI must incorporate to remain authoritative.
I asked Hugh to elaborate, and share his perspective…
The notion that AI is running out of training data (framed as “peak data”) is fundamentally flawed. Comparing data to oil creates a false dichotomy. Oil is a finite, non-renewable resource, while content is generated continuously. Far from facing a scarcity crisis, we have a vast and renewable source of content: publishers. For centuries, publishers have been experts at identifying valuable content, commissioning new material, and ensuring high-quality knowledge production. The publishing and information industry is structured around the continuous generation of authoritative and insightful content. There is no shortage of content; the only shortage is AI companies willing to pay for it.
Much of the world’s most valuable data isn’t freely available on the open internet. Information providers such as RELX, Clarivate, Wolters Kluwer, Bloomberg, and countless others have extensive proprietary datasets that AI developers can license. These data sets contain structured, verified, and high-value content that can enhance AI capabilities far beyond what’s available from scraping the web. Many AI companies already recognize this and Outsell has been involved in advising several deals where AI companies are quietly securing licensing deals to access specialist knowledge. These vast troves of information remain available — just not for free.
Generative AI is far from perfect, often producing hallucinations and inaccuracies. Instead of relying on increasingly synthetic loops of AI-generated content, AI companies must direct resources toward acquiring high-quality proprietary content that improves model efficiency and efficacy.
The extraordinary advances in AI over the past few years were driven by aggressive data consumption, but such exponential progress is not indefinitely sustainable. The reality is that AI will improve at a more measured pace as the industry shifts from mass data accumulation to precision learning. The era of unstructured mass web scraping must come to an end. The next stage requires AI companies to invest in licensing, refining, and optimizing their use of content rather than assuming infinite expansion.
AI developers must accept that quality training data comes at a cost. Many already do — securing licensing agreements for specialist content and leveraging partnerships with major information providers. The future of AI isn’t about panicking over data shortages or doubling down on synthetic feedback loops. It’s about strategic investment in high-quality, domain-specific knowledge that enables AI to perform better with less.
The real challenge isn’t a data drought — it’s an industry-wide reckoning that AI must evolve beyond the free-for-all model of the past and embrace the value of expertly curated, renewable content sources.
Well said, Hugh.
And besides, as Outsell’s David Worlock added:
Much of the ‘free/scraped data used is often pirated — and unless it comes from a licensed source one cannot tell data provenance, demonstrate its completeness, or be sure that it has not been subsequently edited or altered.
Mr. Musk, we have long said if the machine is only relying on sucking the internet dry, everyone will be bobbing around in the same bathwater, and we know that bathwater gets dirty — and so does just using content from the open web.
Where will the differentiation be? It certainly isn’t going to come from everyone training everything on the same public information. That is ridiculous.
The lesson here is that for LLMs to work they have to have different data sources, and fresh data sources because knowledge never rests and information is always newly created. Yesterday’s news trains the machines on yesterday’s news. Yesterday’s scholarly research does too. You have to feed the beast with fresh new sources and those sources are a) not free and b) not static and trust us they are plentiful.
Just ask our database of over 15000 companies and organizations that produce proprietary content — an active database growing by the day — ergo our point — even data about data never rests and it is certainly not finite.
Need help with your licensing strategy for GenAI? Give us a call. Meanwhile spread the word — Mr. Musk has this one wrong.