28 Comments

I am reminded of the conflict I see every time crypto people butt up against securities laws. Sure, maybe there’s an argument to be made that securities laws are outdated, and that they should reflect modern technologies like decentralized networks, etc., but that’s a *political* question, not a legal one. In the mean time, you still need to comply with the law, for better or worse.

Expand full comment

"If you spit out the full text of Harry Potter without permission to do so, you are going to have a bad time." is great. I was trying to come up with a similar analogy but couldn't do it. I'm so going to steal it.

I'm not sure what the law says about whether original contents generated by an AI is copyrightable. Let's say a piece of AI generated art got popular. Is it in the public domain? Or is it the property of the person who first 'saw' its creation? I suspect it has to be the latter. But what if I keep generating variations of Mickey Mouse in anticipation that one version will be used in the future? Geez, that's complicated. Either the variation isn't different enough to deserve a distinct copyright, or it is different enough where I own the copyright.

Damn. Intellectual property is hard when superhuman intelligence is involved.

Expand full comment

Whoah, wait a second! Why can't I gift an article to a machine? I get 10 of those a month. If a group of us banded together, we could gift hundreds if not thousands of articles per month, or in essence, anything worth publishing from that rag directly to Sam. That would give the machine permission to read the data, right? Then it could just number jumble it into something better. That's my idea. ha!

Expand full comment

I was wondering how much the NYT can enforce its copyright on articles that it knew were already copied all over the internet. Seems unfair for them to let the copies proliferate and then pounce. Also, how are they harmed if the copies are all over the internet for free?

I also note that you seem to slip back and forth between “the rules” and “in practice” or what is reasonable. For fair use, because it’s a multi factor analysis, the latter will probably be more informative.

Expand full comment

I don't care about the copyright as much as anything that can slow down or sabotage AI, which is a route to our death, should slow it down. So I am supporting NYT all the way here.

A machine is not a human and should not be treated as one.

Expand full comment

Regarding “seeking consent of rights holders”, this is mostly impossible in practice. I imagine much LLM training text are posts and comments from random Internet users who can’t be reached at all, and much of the rest of the set won’t bother responding.

Even for those that do, they’d likely never agree to a “we can make derivative work from your IP for free forever”, or demand living wage equivalent for the rest of their lives, since they’d effectively be made obsolete.

The total cost for licenses that could be negotiated would likely sink any AI firm, and even then, >80% of rights holders would tell them to pound sand. It’s time for people who say “seek consent” to start saying “no training, no how”, as it isn’t possible to train an LLM otherwise.

Expand full comment

So, just putting this out there: I'm a litigator in private firm practice primarily working in IP issues with an interest in AI (both safety and legal issues). If Zvi or others have questions about IP--especially those that fall under the "is" side of the "is/ought" divide (being in private practice unfortunately constrains what I'm allowed to say that has a normative component, particularly in any non-pseudonymous context) regarding copyright law and relevant cases / filings, I'm around as an interlocutor who may be able to provide some clarity on the content and scope of current law and/or document analysis if people have questions.

Please note that nothing I say is or is intended to create an attorney-client relationship or to constitute legal advice to Zvi or anyone else. I urge anyone to hire and consult with an attorney before taking action that has legal implications (IP-related or otherwise).

Expand full comment

I think you could end up seeing a quite nuanced argument involving an interplay between all these issues, it is smart of NYT to raise all the claims at once. I think it would be helpful to highlight more strongly the complexity of the legal doctrine of Fair Use, as that's what the defense will rest on; for example https://fairuse.stanford.edu/overview/fair-use/four-factors/:

> The four factors judges consider are:

> * the purpose and character of your use

> * the nature of the copyrighted work

> * the amount and substantiality of the portion taken, and

> * the effect of the use upon the potential market.

Rather than gesturing at first-principles arguments about what's fair like "We all learn, he points out, for free. Why should a computer system have to pay?", I think it's more useful do dig into case law and try to figure out how the law is actually interpreted.

A key point I didn't see mentioned explicitly WRT 1) (the training set) is that if we didn't see verbatim copying, it would (IMO) quite clearly be a transformative use. But the fact that you do see some verbatim copying of articles muddies the water on that argument. So the defense will be forced to retreat to a more nuanced point; it's impossible for the model to compress every article verbatim (from an entropic perspective this is obvious, but could be tough to explain to a judge/jury), most of the training usages are therefore transformative-not-copying. In some limited cases they did what might be found to be non-transformative-memorizing of the works; can the defense successfully sever the training process from these specific cases of duplication? The precedent of training being found to not be transformative use would be extremely damaging, as OpenAI seem to be refusing to license all the NYT content for training, and so you'd expect harsh damages if that is infringement.

I expect they will fall back on "amount and substantiality of the portion taken" to argue that where the model is storing verbatim copyrighted text, they are merely quoting some popular articles (as one could do in a book review) and not copying substantial portions of the corpus. I suspect this will be the crux. If we don't view the copying at the corpus level, and instead argue that an individual article was copied substantially or in full, do we need to argue one count of infringement for each article where they can demonstrate duplication? I'd be interested in more analysis on how this factor will be applied, and whether the current case takes a corpus-level approach vs. per-article, or something in-between.

Where the model is using retrieval to directly fetch pages, I suspect that's going to be closely equivalent to your browser doing the same thing, and it's not clear whether there are any actual damages from real users using this trick (there are easier ways of getting around the paywall). This one is much easier to fix, and I believe you'd expect to not see punitive damages here since it looks like OpenAI has tried to fix this issue when it's been reported. I think this one will fizzle.

And finally on all that, what parts do we expect to go to the SCOTUS? I'd expect the question of training being transformative use to be taken up, especially if it's found to be infringement. I'm not sure I see them taking up the mundane questions of specific cases of duplication being infringement, if training is found to be fair use, but specific cases of memorization is not.

Anyway, IANAL, would appreciate lawyers fleshing this out.

Expand full comment

The legal fiction of intellectual property is thought policing and censorship, nothing more. If you make some idea public, I should be free to think about it and to express my thoughts to others. If you don't want that to happen, keep it to yourself.

The constitutional basis for intellectual property law is "To promote the Progress of Science and useful Arts". It's not about creative control, author's rights, or any other such special pleading. (Corollary: IP law that prevents such progress is de facto unconstitutional.)

None of the above relates to the practical matters at hand, because IP law in this country left behind any pretense of sanity many decades ago. But you seem to think everyone is on board with the general principle of copyright, and you are quite wrong about that.

(I do agree that scale matters; but I'm not very confident that a court will understand that the large-scale entity is OpenAI the corporation rather than "doing things with computers". Historically, they seem to have a particular bias related to the latter.)

Expand full comment

Maybe I'm an idiot, but I think the solution is easy and should ("should", not necessarily "legally should") be answered by: "what if instead of an LLM, it was a person with perfect memory?"

This then leads us to:

0. that person (or their guardian) must purchase (1) NYT subscription in order to read it

1. if that person is asked to recite a NYT article word for word by someone else, and they do it - that is technically illegal, and they should stop doing it, and if they have made any profit from doing so, they have to pay that in damages.

2. if that person is asked to talk about the NYT article in general, or to excerpt part of it (with citation) so as to discuss it, that's fine and legal

3. the fact that they have the capacity to replicate the article in their mind is not legally relevant

Expand full comment

Why assume training an LLM is shenanigans? Surely we want to look at it at an incentive level and it seems pretty similar to the way we decide to let other media sources use the facts extracted from journalism.

Sure, scale and unique aspects of LLM could make this different, but it really worries me that many people seem to see the default as paying the creator for any use. It should be the opposite - extending copyright requires a positive argument the application is likely to incentivize more creation than it inhibits.

Indeed, I tend to think the internet itself has already increased the harm side to copyright alot relative to the benefits (there are alot more derivative works ppl would like to create and lower content creation costs and other mechanisms lower the extent of the exclusivity needed to incentivize creation).

Unfortunately, people seem to emotionally see it as kinda like stealing which creates all kind of issues.

Expand full comment

By typing a relatively short string of characters (call it a "prompt") into a browser's URL bar, one can cause the browser to display the full text of Harry Potter and the Sorcerer’s Stone. That is a true fact about the world. No one seriously thinks that this is the browser-makers' fault. So why is ChatGPT different? Why had it "better not work"?

Similarly, would "If users are actually using [Chrome] to get the text of New York Times articles on purpose for actual use, in practice, that seems clearly not okay" make sense and be a reason to be upset with Google (as the makers of Chrome)?

I'll also note that we live in a world in which all books and all movies are as a matter of practice available for free on the internet. Do with that as you will—but it's the empirical truth about the world we live in. You can't pretend that somehow copyright is what causes authors to get paid for their works, not when it's trivial to e.g. download books off of z-library. And yet somehow authors make money anyway. Why? Obviously not because of copyright! Not when the law is so unenforced (and it seems in practice unenforceable.) As a matter of legal realism, whatever it is that causes authors to get paid, it's not actually copyright law as law. Perhaps it's copyright "law" as a norm, i.e. that people would feel guilty for pirating books, or even merely feel like it's "just not done" to pirate books. Or maybe people just sensibly feel that it is in their own enlightened self-interest that authors get paid for books (at least, that's what I tell myself whenever I pay for a book.)

Expand full comment

This was a good and balanced long read, packed with insights.

Yann LeCun's tweet about "all books should be freely available for download" tells me how little respect and understanding he, and I suspect many influential AI developers, have for art, creative work, and copyright law. Like seriously, how should authors make a living if they cannot be paid for the work they do? Who would write books? Only people who are rich from capital investments and can afford to spend a lot of time to write without any prospect of a financial upside.

As I see it, the main question in the NYT case is not if people actually use GPT-4 to read NYT articles, as you say practically no one does. Just the mere fact that GPT-4 is actually capable of quoting long passages of text from NYT articles is concerning. And the worst part for OpenAI is that that these articles, as you also comment, are not publicly available but protected by paywalls. Even if judges accept that AI training on copyrighted material is fine, I think its a hard case to make that its ok to bypass paywalls, access proprietary data unathorized, and use this data to train an AI model.

Expand full comment