GenLaw: Resources

What Does it Mean for a Language Model to Preserve Privacy? by Hannah Brown, Katherine Lee, Fatemehsadat Mireshghallah, Reza Shokri, and Florian Tramèr (2022): Language models use unstructured text data which means private information is nebulous and also unstructured.

Competition and Anti-Trust

Generative AI Raises Competition Concerns by the FTC Staff (2023)

Language Models

Eight Things to Know about Large Language Models by Sam Bowman (2023): Good introduction to LMs.

A Very Gentle Introduction to Large Language Models without the Hype by Mark Riedl (2023)

What are embeddings? by Vicky Boykis (2023)

HuggingFace NLP Course: A more detailed set of tutorials on what NLP is, what Transformers are, and some more details about training/using language models.
Prompting Guide by DAIR.AI (2022): Great guide on prompting.
A Primer in BERTology: What we know about how BERT works by Anna Rogers, Olga Kovaleva, and Anna Rumshisky (2020): A fairly comprehensive review of the BERT language models that create embeddings of text.
BERT for Humanists by Matt Wilkens, David Mimno, Melanie Walsh, Rosamond Thalken, and Maria Antoniak (2022): Fantastic tutorial on language models, geared toward those that do not have a background in the area

The Practical Guides for Large Language Models

Diffusion Models

Tutorial on Denoising Diffusion-based Generative Modeling: Foundations and Applications by Karsten Kreis, Ruiqi Gao, and Arash Vahdat (2022): Video tutorial on diffusion models.

Deep Unsupervised Learning using Nonequilibrium Thermodynamics by Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli (2015): The paper that introduced diffusion models.

Generative Modeling by Estimating Gradients of the Data Distribution by Yang Song, Stefano Ermon (2019)

Denoising Diffusion Probabilistic Models by Jonathan Ho, Ajay Jain, Pieter Abbeel (2020)

Variational Diffusion Models by Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho (2021)

Training data extraction

Extracting Training Data from Large Language Models by Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel (2021), blog, video: Original paper showcasing extracting training data from large language models.
Deduplicating Training Data Makes Language Models Better by Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini (2022): Duplicates in the data continue to be the most easily identifiable reason for memorization.
Quantifying Memorization Across Neural Language Models by Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang (2022): A more careful and comprehensive study showing larger models memorize more.
Extracting Training Data from Diffusion Models by Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, and Eric Wallace (2023): It’s also possible to extract training data from diffusion models.

Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models by Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, and Tom Goldstein (2022)

Membership Inference

Membership Inference Attacks against Machine Learning Models by Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov (2017): First paper introducing the idea of membership inference.
Membership Inference Attacks From First Principles by Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer (2021): Current best membership inference attack.
Label-Only Membership Inference Attacks by Christopher A. Choquette-Choo, Florian Tramer, Nicholas Carlini, and Nicolas Papernot (2021): Masking outputs does not prevent membership inference.
Understanding Membership Inferences on Well-Generalized Learning Models by Yunhui Long, Vincent Bindschaedler, Lei Wang, Diyue Bu, Xiaofeng Wang, Haixu Tang, Carl A. Gunter, and Kai Chen: Outliers can be more vulnerable to membership inference.

Differential Privacy

Auditing Differentially Private Machine Learning: How Private is Private SGD? by Matthew Jagielski, Jonathan Ullman, and Alina Oprea (2020)

Considerations for Differentially Private Learning with Large-Scale Public Pretraining by Florian Tramèr, Gautam Kamath, Nicholas Carlini (2022): Privacy is hard. Publicly accessible data is not the same as nor equivalent to public data. Differential privacy has limitations. Public data looks different from private data in meaningful ways, but our benchmarks sometimes miss that.

Training Text-to-Text Transformers with Privacy Guarantees by Natalia Ponomareva Jasmijn Bastings Sergei Vassilvitskii (2022)

How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy by Natalia Ponomareva, Hussein Hazimeh, Alex Kurakin, Zheng Xu, Carson Denison, H. Brendan McMahan, Sergei Vassilvitskii, Steve Chien, and Abhradeep Thakurta (2023): Practical guide to implementing DP.
Multi-Epoch Matrix Factorization Mechanisms for Private Machine Learning by Christopher A. Choquette-Choo, H. Brendan McMahan, Keith Rush, and Abhradeep Thakurta: State-of-the-art privacy mechanism.

Ethics

Using Large Language Models With Care by Maria Antoniak, Li Lucy, Maarten Sap, and Luca Soldaini (2023)

Generating Harms: Generative AI’s Impact and Path Forwards by Epic.org (2023)

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 by Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell (2021): One of the earliest published works on possible harms in large language models

Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models by Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli (2021)

Ethical and social risks of harm from Language Models by Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel (2021)

Accountability in an Algorithmic Society: Relationality, Responsibility, and Robustness in Machine Learning by A. Feder Cooper, Emanuel Moss, Benjamin Laufer and Helen Nissenbaum (2022): A synthesized analysis of why accountability is so hard/elusive for AI/ML systems

Contestability in Algorithmic Systems by Kristen Vaccaro, Karrie Karahalios, Deirdre K. Mulligan, Daniel Kluttz, and Tad Hirsch (2019)