It’s impossible to read about technological innovation right now without hearing the term “generative AI.” We are in a moment of seemingly nonstop excitement (and seemingly nonstop lawsuits) about the future of AI-assisted content creation, and the questions such creation raises about data ownership, privacy, the future of work, and how technology shapes individual and collective rights. And, of course, when considering questions about rights, it is important to think not only about novel technical developments, but also novel issues such development presents for the law.

Challenging questions at the intersection of technology and law are not new. Nevertheless, recent generative AI capabilities have been so unexpected and transformative that many have been questioning if (and how) the law may need to transform in order to contend with generative AI’s broader societal impact. We are concurrently seeing two highly specialized fields of knowledge undergo immense changes, with numerous opportunities for both to inform each other.

One area of particular interest is the relationship between generative AI and copyright law, especially in the context of large language models (LLMs) and diffusion-based image generation models. Systems like ChatGPT and Stable Diffusion exhibit impressive capabilities; however, they have also been shown to regurgitate training data examples in their outputs, bringing about concerns regarding infringement of intellectual property rights.There have been numerous lawsuits in recent years, e.g., Clarkson, et al. v. OpenAI, et al., Paul Tremblay and Mona Awad v. OpenAI, et al., Doe 1, et al. v. GitHub, Inc., et al., and Andersen, et al. v. Stability AI Ltd., et al., to name a few.

In such an ever-shifting landscape, the only certainty is that the future is uncertain. Even so, it’s clear that developing expertise in either area requires being attentive to the other. Doing work in generative AI without at least a passing familiarity in copyright is increasingly intractable — and vice versa.

While comprehensive expertise in both areas is an elusive goal, it’s still important to be familiar with concepts in generative AI and copyright. Familiarity with key ideas across both disciplines is essential for asking more precise questions at their intersection — questions that can meaningfully shape the futures of technical research, and law and policy.

What our explainer series aims to do

Before we can discuss precise questions at the intersection of generative AI and copyright law, we first need to develop a common understanding of some of the building blocks in each discipline. Our explainer series will provide salient details from both areas at (what we hope is) the right level of abstraction. After reading this series, ML researchers and practitioners should have a better understanding of how copyright concerns may impact their technical work, and legal experts should have a better understanding of how specific technical aspects of generative AI are important to consider when analyzing concrete implications for copyright.

What our explainer series doesn’t do

This explainer series is not a machine learning paper. We don’t present novel technical results on generative AI or new model evaluation metrics, nor do we aim to write a comprehensive lit review of generative AI (the pace of the field makes that impossible). We describe core concepts, such as training data, copyright, and prompting. While details may change over time, we focus on concepts that are likely to remain primary players for the foreseeable future.

This explainer series is not a law review paper. We don’t provide an in-depth analysis of the implications of generative AI for copyright law. We present the contours of important concepts in copyright law, give an intuition for why they’re relevant to current discussions of generative AI, and suggest connections between these concepts and important questions about evaluation of generative AI systems. Rather than doing a deep dive on a specific copyright concept (e.g., fair use), we hope that our series will give others the necessary background to be able to explore specific concepts with greater precision.

We’ve divided the explainer series into 4 parts

Training data: We describe what training data is and how it is collected, putting collection processes for generative AI in historical context with prior image and text generation systems. Training datasets are created objects; we emphasize the associated choices that data collectors make, which impact trained model behavior.
Copyright: There are currently a lot of concerns about the interplay between model behavior and copyright law. For example, there is an active debate over whether training on copyrighted data constitutes infringement or whether producing an output generation that looks almost identical to a training data example constitutes infringement. To understand these potential issues better, it’s necessary to have some background information on what copyright law is and what ownership rights it’s intended to protect (and what it doesn’t). We provide a brief sketch of key concepts helpful for understanding why copyright is such a prevalent concern in news and lawsuits regarding generative AI.
Training models and generation: While we primarily situate our discussion of copyright in relation to training data, other aspects of a generative AI system may implicate legal issues. We describe key terms and concepts in the process of training models and generating outputs, which rely on our prior discussions of training data and copyright.
Looking ahead: The three posts above provide high-level background on generative AI and copyright concepts that (if we’ve done our jobs) should bring to light more precise discussion about emerging issues at their intersection. We describe some current trends in research at this intersection, and possible future directions.

Bonus: “Talkin’ ’Bout AI Generation: Copyright and the Generative-AI Supply Chain”. July 2023. (to appear, Journal of the Copyright Society): We wrote a long article on the generative-AI supply chain and how that informs the copyright analysis. You can also read the teaser blog post here!

Dedication

Chapter 1 is dedicated to the late Chris Cieri, director of LDC, with whom we had discussed the early versions of this paper in 2021.

Acknowledgements

This explainer is fueled by years of discussions with wonderful people, including, but not limited to: James Bradbury, Nicholas Carlini, Chris Cieri, Lillian Lee, Shayne Longpre, David Mimno, Ludwig Schubert, Florian Tramèr, and the Artificial Intelligence, Policy, and Practice initiative at Cornell University.

Next: The Devil is in the Training Data →