This book will not be used to train LLMs

Jun 21

Recently, AI companies have been negotiating with publishers about using their books to train LLMs. Some publishers reached out to ask authors for permission to license their books, offering some form of compensation. Others went ahead and made a blanket decision to sell all their books to AI developers, without informing the authors.

Thankfully, my publisher did the former, making the change opt in and offering compensation, which I very much appreciate. They emailed authors a thoughtful explanation of what it means to license their books to AI companies, followed by a request to sign an addendum to the contract if they wanted to opt in.

I did not sign the addendum.

After working on the book for five years, it seemed crazy to let LLMs train on it — allowing people to unknowingly plagiarize my book and risking the LLMs taking my words out of context and misrepresenting my ideas.

The publisher informed us they were negotiating with the LLM developers for remuneration, fair attribution of content from the books, as well as limits on the amount of text that can be reproduced and on the ability to adapt or modify the work. While I agree with these negotiation points in principle, I’m highly skeptical of LLM developers being able to deliver on any of these. I’m honestly hoping they are not falsely promising this to publishers.

These requirements are currently technically impossible. There is no way to guarantee that content from the book would be fairly attributed, nor define limits on the amount of text that can be reproduced or on the extent to which the work can be adapted. Once the model is trained, it can draw on any text from its training data and there is no way to trace which source data influenced the model's output or control it in any way. For similar reasons, remuneration is infeasible because it's impossible to trace the source data that led to certain model generations to pay its creator. Even if it was possible, this would result in pennies (think about the percent of one book’s content among all of ChatGPT's generations for its millions of users); and tech companies would never implement it even if it was feasible.

I’m yet to see how my decision will affect the success of my book. It may make it hard to distribute it to vendors that signed similar agreements, as the publisher warned. I also hope I didn’t piss off my publisher by writing this or burned any bridges in the process — I do think they did the best that they could in this situation. At this point, I think publishers are signing these licenses because they are at least getting some compensation from AI companies, believing (probably rightfully) that some AI companies would steal this data anyway.

Which brings me to the last point: If you ask an LLM about the content of my book and it generates anything beyond the publicly-available information on this website, it either (1) hallucinates; or (2) have accessed it unlawfully. Please share these examples with me either way!

Vered Shwartz

This book will not be used to train LLMs

Inside the Book “Lost in Automatic Translation” – about AI, language technologies, and English