Forcing shadow libraries out of the darkness

25 July 2023

Data from so-called shadow libraries is used to train large language models (LLMs), to the consternation of many authors. Should the people behind free access to books online face recriminations, or does the responsibility fall on the technology companies profiting from shadow libraries?

LLMs that power systems like ChatGPT are developed using large libraries of text. Books, being long and well-written (supposedly), are ideal training material, but authors are beginning to push back against their work, made freely available (so not-for-profit) being digested in this way to educate LLMs behind paid-for services.

This week, more than 9,000 authors, including James Patterson and David Baldacci, have called on tech executives to stop training their tools on writers’ work without compensation.

In objecting to free use of authors’ work, the campaign has put the spotlight back on shadow libraries like Z Library, Bibliotik, and Library Genesis. Each of them are repositories holding millions of titles in obscure corners of the internet.

Privacy, piracy, AI(racy)

Earlier this year, LLMs came under fire for privacy violations and ChatGPT was banned in Italy. The concern was that the chats individuals had with the models was being used for training, raising privacy concerns.

After enabling users to opt out of their data being used for training purposes and making the links to the privacy policy clearer, OpenAI was, at the time of writing, back up and running in Italy.

The issue of piracy and shadow libraries has been hitting headlines recently after Z Library’s founders were arrested for offences around copyright and ownership of intellectual property. What hasn’t been so widely discussed is the fact that the free-access libraries are often used as AI training data.

The fact that AI training relies on shadow libraries has been acknowledged in research papers by the companies developing the technology. OpenAI’s GPT-1 was trained on BookCorpus, which has over 7,000 unpublished titles scraped from self-publishing platform Smashwords.

Once training began for GPT-3, OpenAI said that roughly 16% of the data it used was from two “internet-based books corpora” that it dubbed “Books1” and “Books2.” A  lawsuit by the comedian Sarah Silverman and two other authors against OpenAI claims that Books2 is a “flagrantly illegal” shadow library.

The Authors Guild has organized an open letter to tech executives citing studies [pdf] from 2016 and 2017 that suggested text piracy reduced legitimate book sales by as much as 14%.

Shadow libraries aren’t at fault

Tech companies are increasingly closed about what data they use to train their systems. Meta’s paper on Llama 2 [pdf], published by researchers this week, said the LLM was trained using only a “new mix of data from publicly available sources.”

Supposedly, as OpenAI noted in a research paper on GPT-4 [pdf] from March, secrecy about what its LLM was trained on was necessary due to “the competitive landscape” and “safety considerations.”

Whether tech companies are hiding their sources from each other, or protecting free sources for their own gain, efforts to shut down these sites have had little effect. Even after the FBI charged two Russian nationals accused of running Z Library with copyright infringement, fraud and money laundering, the site came forward with plans to go physical.

Shadow libraries have also moved onto the dark web and torrent sites, so they’re harder to trace. Because many of them are run from outside of the US, anonymously, punishing the operators is difficult.

However, although the average user of a site like Z Library shouldn’t face repercussions for accessing texts on a shadow library, perhaps the tech companies profiting from the databases should?

Given the volume of data needed to train an LLM, it’s unsurprising that amassing enough explicitly-licensed sources would be time consuming and tricky – so many AI researchers have opted to ask for forgiveness after the fact rather than permission.

They also argue that their use of data from online comes under fair use in copyright law, but as authors rally against shadow libraries, the focus might be being put on the wrong people.