I am reminded of the conflict I see every time crypto people butt up against securities laws. Sure, maybe there’s an argument to be made that securities laws are outdated, and that they should reflect modern technologies like decentralized networks, etc., but that’s a *political* question, not a legal one. In the mean time, you still need to comply with the law, for better or worse.
"If you spit out the full text of Harry Potter without permission to do so, you are going to have a bad time." is great. I was trying to come up with a similar analogy but couldn't do it. I'm so going to steal it.
I'm not sure what the law says about whether original contents generated by an AI is copyrightable. Let's say a piece of AI generated art got popular. Is it in the public domain? Or is it the property of the person who first 'saw' its creation? I suspect it has to be the latter. But what if I keep generating variations of Mickey Mouse in anticipation that one version will be used in the future? Geez, that's complicated. Either the variation isn't different enough to deserve a distinct copyright, or it is different enough where I own the copyright.
Damn. Intellectual property is hard when superhuman intelligence is involved.
Regarding this issue -- "'m not sure what the law says about whether original contents generated by an AI is copyrightable."
AI-generated original art is generally not copyrightable as it lacks the required element of human authorship, in particular due to the lack of predictability of its output. Thus it is generally public domain in and of itself. Note that the selection and arrangement of AI outputs (rather than on the output itself) may still be eligible for copyright, however.
Hypothetically, does an AI generated image become uncopyrightable? It seems like a way to prevent Disney from coming up with future variations of Mickey Mouse. Not that anyone would actually do that
A clever dodge, but this would be unlikely to work in the general case. Copyrights include the right to prepare derivative works of the original, so third-party made works (including fan works) employing copyrighted materials or elements are often at least nominally infringing and don't generally vitiate the rights of the original copyright holder.
Whoah, wait a second! Why can't I gift an article to a machine? I get 10 of those a month. If a group of us banded together, we could gift hundreds if not thousands of articles per month, or in essence, anything worth publishing from that rag directly to Sam. That would give the machine permission to read the data, right? Then it could just number jumble it into something better. That's my idea. ha!
I was wondering how much the NYT can enforce its copyright on articles that it knew were already copied all over the internet. Seems unfair for them to let the copies proliferate and then pounce. Also, how are they harmed if the copies are all over the internet for free?
I also note that you seem to slip back and forth between “the rules” and “in practice” or what is reasonable. For fair use, because it’s a multi factor analysis, the latter will probably be more informative.
IANAL, but as far as I know there's no legal obligation to sue on every copyright infringement (this is, for example, why the Internet Archive is allowed to exist). There _is_ such an obligation for trademarks, because the point of a trademark is to indicate that the holder, specifically, is responsible for the use of the trademark.
I don't care about the copyright as much as anything that can slow down or sabotage AI, which is a route to our death, should slow it down. So I am supporting NYT all the way here.
A machine is not a human and should not be treated as one.
Regarding “seeking consent of rights holders”, this is mostly impossible in practice. I imagine much LLM training text are posts and comments from random Internet users who can’t be reached at all, and much of the rest of the set won’t bother responding.
Even for those that do, they’d likely never agree to a “we can make derivative work from your IP for free forever”, or demand living wage equivalent for the rest of their lives, since they’d effectively be made obsolete.
The total cost for licenses that could be negotiated would likely sink any AI firm, and even then, >80% of rights holders would tell them to pound sand. It’s time for people who say “seek consent” to start saying “no training, no how”, as it isn’t possible to train an LLM otherwise.
So, just putting this out there: I'm a litigator in private firm practice primarily working in IP issues with an interest in AI (both safety and legal issues). If Zvi or others have questions about IP--especially those that fall under the "is" side of the "is/ought" divide (being in private practice unfortunately constrains what I'm allowed to say that has a normative component, particularly in any non-pseudonymous context) regarding copyright law and relevant cases / filings, I'm around as an interlocutor who may be able to provide some clarity on the content and scope of current law and/or document analysis if people have questions.
Please note that nothing I say is or is intended to create an attorney-client relationship or to constitute legal advice to Zvi or anyone else. I urge anyone to hire and consult with an attorney before taking action that has legal implications (IP-related or otherwise).
Hi--I’m very much interested in this. Have any interest in being interviewed for my Substack? Fwiw, I see AI world committing the same error that crypto world has with securities laws. Maybe that’s what you mean by is-ought: technologists think about what the law *ought* to be given what they want to do with technology, while courts concern themselves with what *is*.
I'll have to get back to on the interview issue for your Substack (relates to internal conflicts clearance. I'll look into it), but while you definitely have the *ought* side ride, with respect to the *is* side it's a bit more about not taking sides in an active controversy in a way that implicates the interests of clients (either present or potential), so sticking to what courts (and litigants) have said and interpreting that accordingly -- sometimes there may also be a reasonably clear extrapolative implication of what courts and parties have said, but at most private practice attorneys (like yours truly) are going to be beet advised to say "this approach favors the interests of class X, this approach favors the interest of class Y, and here are some relevant cases that touch on the present dispute and what they say."
Courts are more of a mixed bag: they generally display a strong preference for not deciding prospective issues that aren't before them if they don't feel a need to do so (Zvi's A→B→D example is really good here. If courts don't think that the edge cases leading to D are actually going to come up and/or they'll be able to avoid absurd results, they're not necessarily going to avoid commonsense rulings that can't be extended infinitely. You'll sometimes see arguments of this form derided as a "parade of horribles" in judicial opinions.). Likewise, if an issue clearly goes one way or the other under controlling precedent, it makes courts' jobs easy (and limits the risk of reversal on appeal) and courts will strive to hew to that line if it makes sense.
However, courts are not precluded from accepting policy / "should" arguments on matters of first impression or where the decision is ambiguous, and can often be receptive to them if it's appropriate, although lawyers will generally prefer to argue that controlling law dominates everything and/or that existing precedent can be extended to new areas. This creates conflicts in technology spaces, especially because as Zvi aptly observes, "scale matters," and technology is great at scaling practically-irrelevant curiosities into must-be-decided-now controversies.
A seminal example of this is Justice Sotomayor's concurrence in judgment in U.S. v. Jones, which in practice is all about how the relative cheapness and ease of surveillance technologies breaks implicit presumptions of limited government capacity that undergirded a large amount of Fourth Amendment search/seizure jurisprudence -- in other words, scale matters.
What is the precedent on transitory copies (eg your computer loads the content into memory to display it on the screen) and how is this likely to affect rulings on training LLMs?
I mean, the actual weights in the network seem obviously highly transformative and may not even meaningfully contain a copy of the copyrightable elements (as opposed to the facts) so I'm wondering if the arguments are going to turn on the transitory copies made in the course of training.
Also, how do courts individuate copyrighted works and copies to compute statutory damages? For instance, if the court decides an LLM contains 10 snippets in it's memory from a given NYT article each of which is sufficient to qualify as a copyright violation is that 10 violations? One? Per website? Per cloud server the LLM is deployed on?
Regarding transitory copies: this is a great question that I'll get back to you on ASAP.
I will note that transformativeness can be difficult to argue about per se without an extensive body of precedent for the simple reason that transformativeness is a factor in fair use and fair use is an affirmative defense. The best information I'm aware of at present on this topic is the dismissal of the Andersen Complaint at pp. 8-14 (https://cdn.arstechnica.net/wp-content/uploads/2023/10/Andersen-v-Stability-AI-Order-on-Motions-to-Dismiss-and-Strike-10-30-2023.pdf) which suggests that you might run into substantial similarity problems trying to establish infringement based on model weights, but speaking from a *purely personal perspective* I would be hesitant to generalize too much from that case at the present time, both because it's a single trial court case dismissing a claim without prejudice, and because if you have LLMS that encode near-perfect recall of articles this might change the calculus somewhat: I keep coming back to the idea that a ZIP file of a registered work isn't human readable in and of itself, but I would be unlikely to advise a client that reversibly ZIPping something carries no infringement risk. So are you analogizing LLMS to ZIP files, or to a different ontological category here? We'll see how case law develops.
Regarding the latter question of "ten snippets," I will say that *in general* infringement is "per work" (assuming the work to be registered, which is a precondition of both suit and entitlement to statutory damages), which would be suggestive of ten snippets of a single (registered) article comprising a a single violation for each reproduction of the medium in which in the LLM is fixed (e.g. your cloud server example), assuming that the weights themselves slash the model is per se infringing.
*However* in practice this *may* be largely irrelevant to the issue damages: in copyright the classes of damages are chiefly governed by 17 U.S.C 504 (https://www.law.cornell.edu/uscode/text/17/504), and the classes of award are typically "actual damages and profits" (504(a)-(b)) or "statutory damages" (504(c)). Typically plaintiffs strongly prefer statutory damages for a variety of reasons, which are awarded "...for all infringements involved in the action, with respect to any one work..." (504(c)). So the actual number of specific infringements in the case that a plaintiff elected statutory damages would be somewhat moot, because all of them would be encompassed by the award.
The number of works infringed, however, would be a big deal since the award is "...with respect to any one work...."
Few in principle, tons in practice. This is a pretty standard boilerplate warning that is intended to further the idea that "you are not entitled to rely on these statements as legal advice, only as commentary."
I think you could end up seeing a quite nuanced argument involving an interplay between all these issues, it is smart of NYT to raise all the claims at once. I think it would be helpful to highlight more strongly the complexity of the legal doctrine of Fair Use, as that's what the defense will rest on; for example https://fairuse.stanford.edu/overview/fair-use/four-factors/:
> The four factors judges consider are:
> * the purpose and character of your use
> * the nature of the copyrighted work
> * the amount and substantiality of the portion taken, and
> * the effect of the use upon the potential market.
Rather than gesturing at first-principles arguments about what's fair like "We all learn, he points out, for free. Why should a computer system have to pay?", I think it's more useful do dig into case law and try to figure out how the law is actually interpreted.
A key point I didn't see mentioned explicitly WRT 1) (the training set) is that if we didn't see verbatim copying, it would (IMO) quite clearly be a transformative use. But the fact that you do see some verbatim copying of articles muddies the water on that argument. So the defense will be forced to retreat to a more nuanced point; it's impossible for the model to compress every article verbatim (from an entropic perspective this is obvious, but could be tough to explain to a judge/jury), most of the training usages are therefore transformative-not-copying. In some limited cases they did what might be found to be non-transformative-memorizing of the works; can the defense successfully sever the training process from these specific cases of duplication? The precedent of training being found to not be transformative use would be extremely damaging, as OpenAI seem to be refusing to license all the NYT content for training, and so you'd expect harsh damages if that is infringement.
I expect they will fall back on "amount and substantiality of the portion taken" to argue that where the model is storing verbatim copyrighted text, they are merely quoting some popular articles (as one could do in a book review) and not copying substantial portions of the corpus. I suspect this will be the crux. If we don't view the copying at the corpus level, and instead argue that an individual article was copied substantially or in full, do we need to argue one count of infringement for each article where they can demonstrate duplication? I'd be interested in more analysis on how this factor will be applied, and whether the current case takes a corpus-level approach vs. per-article, or something in-between.
Where the model is using retrieval to directly fetch pages, I suspect that's going to be closely equivalent to your browser doing the same thing, and it's not clear whether there are any actual damages from real users using this trick (there are easier ways of getting around the paywall). This one is much easier to fix, and I believe you'd expect to not see punitive damages here since it looks like OpenAI has tried to fix this issue when it's been reported. I think this one will fizzle.
And finally on all that, what parts do we expect to go to the SCOTUS? I'd expect the question of training being transformative use to be taken up, especially if it's found to be infringement. I'm not sure I see them taking up the mundane questions of specific cases of duplication being infringement, if training is found to be fair use, but specific cases of memorization is not.
Anyway, IANAL, would appreciate lawyers fleshing this out.
The legal fiction of intellectual property is thought policing and censorship, nothing more. If you make some idea public, I should be free to think about it and to express my thoughts to others. If you don't want that to happen, keep it to yourself.
The constitutional basis for intellectual property law is "To promote the Progress of Science and useful Arts". It's not about creative control, author's rights, or any other such special pleading. (Corollary: IP law that prevents such progress is de facto unconstitutional.)
None of the above relates to the practical matters at hand, because IP law in this country left behind any pretense of sanity many decades ago. But you seem to think everyone is on board with the general principle of copyright, and you are quite wrong about that.
(I do agree that scale matters; but I'm not very confident that a court will understand that the large-scale entity is OpenAI the corporation rather than "doing things with computers". Historically, they seem to have a particular bias related to the latter.)
Maybe I'm an idiot, but I think the solution is easy and should ("should", not necessarily "legally should") be answered by: "what if instead of an LLM, it was a person with perfect memory?"
This then leads us to:
0. that person (or their guardian) must purchase (1) NYT subscription in order to read it
1. if that person is asked to recite a NYT article word for word by someone else, and they do it - that is technically illegal, and they should stop doing it, and if they have made any profit from doing so, they have to pay that in damages.
2. if that person is asked to talk about the NYT article in general, or to excerpt part of it (with citation) so as to discuss it, that's fine and legal
3. the fact that they have the capacity to replicate the article in their mind is not legally relevant
Why assume training an LLM is shenanigans? Surely we want to look at it at an incentive level and it seems pretty similar to the way we decide to let other media sources use the facts extracted from journalism.
Sure, scale and unique aspects of LLM could make this different, but it really worries me that many people seem to see the default as paying the creator for any use. It should be the opposite - extending copyright requires a positive argument the application is likely to incentivize more creation than it inhibits.
Indeed, I tend to think the internet itself has already increased the harm side to copyright alot relative to the benefits (there are alot more derivative works ppl would like to create and lower content creation costs and other mechanisms lower the extent of the exclusivity needed to incentivize creation).
Unfortunately, people seem to emotionally see it as kinda like stealing which creates all kind of issues.
By typing a relatively short string of characters (call it a "prompt") into a browser's URL bar, one can cause the browser to display the full text of Harry Potter and the Sorcerer’s Stone. That is a true fact about the world. No one seriously thinks that this is the browser-makers' fault. So why is ChatGPT different? Why had it "better not work"?
Similarly, would "If users are actually using [Chrome] to get the text of New York Times articles on purpose for actual use, in practice, that seems clearly not okay" make sense and be a reason to be upset with Google (as the makers of Chrome)?
I'll also note that we live in a world in which all books and all movies are as a matter of practice available for free on the internet. Do with that as you will—but it's the empirical truth about the world we live in. You can't pretend that somehow copyright is what causes authors to get paid for their works, not when it's trivial to e.g. download books off of z-library. And yet somehow authors make money anyway. Why? Obviously not because of copyright! Not when the law is so unenforced (and it seems in practice unenforceable.) As a matter of legal realism, whatever it is that causes authors to get paid, it's not actually copyright law as law. Perhaps it's copyright "law" as a norm, i.e. that people would feel guilty for pirating books, or even merely feel like it's "just not done" to pirate books. Or maybe people just sensibly feel that it is in their own enlightened self-interest that authors get paid for books (at least, that's what I tell myself whenever I pay for a book.)
This was a good and balanced long read, packed with insights.
Yann LeCun's tweet about "all books should be freely available for download" tells me how little respect and understanding he, and I suspect many influential AI developers, have for art, creative work, and copyright law. Like seriously, how should authors make a living if they cannot be paid for the work they do? Who would write books? Only people who are rich from capital investments and can afford to spend a lot of time to write without any prospect of a financial upside.
As I see it, the main question in the NYT case is not if people actually use GPT-4 to read NYT articles, as you say practically no one does. Just the mere fact that GPT-4 is actually capable of quoting long passages of text from NYT articles is concerning. And the worst part for OpenAI is that that these articles, as you also comment, are not publicly available but protected by paywalls. Even if judges accept that AI training on copyrighted material is fine, I think its a hard case to make that its ok to bypass paywalls, access proprietary data unathorized, and use this data to train an AI model.
I note that there are quite a lot of authors who put their work out for free online and rely on Patreon - or Substack! - for revenue; conversely, it's quite common for authors with conventional publishing to be rather poor, since the author normally gets barely more than 10% of the revenue
I am reminded of the conflict I see every time crypto people butt up against securities laws. Sure, maybe there’s an argument to be made that securities laws are outdated, and that they should reflect modern technologies like decentralized networks, etc., but that’s a *political* question, not a legal one. In the mean time, you still need to comply with the law, for better or worse.
"If you spit out the full text of Harry Potter without permission to do so, you are going to have a bad time." is great. I was trying to come up with a similar analogy but couldn't do it. I'm so going to steal it.
I'm not sure what the law says about whether original contents generated by an AI is copyrightable. Let's say a piece of AI generated art got popular. Is it in the public domain? Or is it the property of the person who first 'saw' its creation? I suspect it has to be the latter. But what if I keep generating variations of Mickey Mouse in anticipation that one version will be used in the future? Geez, that's complicated. Either the variation isn't different enough to deserve a distinct copyright, or it is different enough where I own the copyright.
Damn. Intellectual property is hard when superhuman intelligence is involved.
Nope, currently it is public domain.
Regarding this issue -- "'m not sure what the law says about whether original contents generated by an AI is copyrightable."
AI-generated original art is generally not copyrightable as it lacks the required element of human authorship, in particular due to the lack of predictability of its output. Thus it is generally public domain in and of itself. Note that the selection and arrangement of AI outputs (rather than on the output itself) may still be eligible for copyright, however.
https://fingfx.thomsonreuters.com/gfx/legaldocs/klpygnkyrpg/AI%20COPYRIGHT%20decision.pdf
Hypothetically, does an AI generated image become uncopyrightable? It seems like a way to prevent Disney from coming up with future variations of Mickey Mouse. Not that anyone would actually do that
A clever dodge, but this would be unlikely to work in the general case. Copyrights include the right to prepare derivative works of the original, so third-party made works (including fan works) employing copyrighted materials or elements are often at least nominally infringing and don't generally vitiate the rights of the original copyright holder.
Someone went and created a stable diffusion model of the public domain Mickey Mouse. It'll be interesting to see what AI comes up with...
https://huggingface.co/Pclanglais/Mickey-1928
Whoah, wait a second! Why can't I gift an article to a machine? I get 10 of those a month. If a group of us banded together, we could gift hundreds if not thousands of articles per month, or in essence, anything worth publishing from that rag directly to Sam. That would give the machine permission to read the data, right? Then it could just number jumble it into something better. That's my idea. ha!
I was wondering how much the NYT can enforce its copyright on articles that it knew were already copied all over the internet. Seems unfair for them to let the copies proliferate and then pounce. Also, how are they harmed if the copies are all over the internet for free?
I also note that you seem to slip back and forth between “the rules” and “in practice” or what is reasonable. For fair use, because it’s a multi factor analysis, the latter will probably be more informative.
IANAL, but as far as I know there's no legal obligation to sue on every copyright infringement (this is, for example, why the Internet Archive is allowed to exist). There _is_ such an obligation for trademarks, because the point of a trademark is to indicate that the holder, specifically, is responsible for the use of the trademark.
I don't care about the copyright as much as anything that can slow down or sabotage AI, which is a route to our death, should slow it down. So I am supporting NYT all the way here.
A machine is not a human and should not be treated as one.
Regarding “seeking consent of rights holders”, this is mostly impossible in practice. I imagine much LLM training text are posts and comments from random Internet users who can’t be reached at all, and much of the rest of the set won’t bother responding.
Even for those that do, they’d likely never agree to a “we can make derivative work from your IP for free forever”, or demand living wage equivalent for the rest of their lives, since they’d effectively be made obsolete.
The total cost for licenses that could be negotiated would likely sink any AI firm, and even then, >80% of rights holders would tell them to pound sand. It’s time for people who say “seek consent” to start saying “no training, no how”, as it isn’t possible to train an LLM otherwise.
And poison all data sets in the meanwhile on top of that with Glaze, Nightshade, etc. Anything we can do to keep us from dying is a good thing.
Don't worry, the EULAs of most major sites have the rights to the comments owned by the site not the commenter :/
So, just putting this out there: I'm a litigator in private firm practice primarily working in IP issues with an interest in AI (both safety and legal issues). If Zvi or others have questions about IP--especially those that fall under the "is" side of the "is/ought" divide (being in private practice unfortunately constrains what I'm allowed to say that has a normative component, particularly in any non-pseudonymous context) regarding copyright law and relevant cases / filings, I'm around as an interlocutor who may be able to provide some clarity on the content and scope of current law and/or document analysis if people have questions.
Please note that nothing I say is or is intended to create an attorney-client relationship or to constitute legal advice to Zvi or anyone else. I urge anyone to hire and consult with an attorney before taking action that has legal implications (IP-related or otherwise).
Hi--I’m very much interested in this. Have any interest in being interviewed for my Substack? Fwiw, I see AI world committing the same error that crypto world has with securities laws. Maybe that’s what you mean by is-ought: technologists think about what the law *ought* to be given what they want to do with technology, while courts concern themselves with what *is*.
I'll have to get back to on the interview issue for your Substack (relates to internal conflicts clearance. I'll look into it), but while you definitely have the *ought* side ride, with respect to the *is* side it's a bit more about not taking sides in an active controversy in a way that implicates the interests of clients (either present or potential), so sticking to what courts (and litigants) have said and interpreting that accordingly -- sometimes there may also be a reasonably clear extrapolative implication of what courts and parties have said, but at most private practice attorneys (like yours truly) are going to be beet advised to say "this approach favors the interests of class X, this approach favors the interest of class Y, and here are some relevant cases that touch on the present dispute and what they say."
Courts are more of a mixed bag: they generally display a strong preference for not deciding prospective issues that aren't before them if they don't feel a need to do so (Zvi's A→B→D example is really good here. If courts don't think that the edge cases leading to D are actually going to come up and/or they'll be able to avoid absurd results, they're not necessarily going to avoid commonsense rulings that can't be extended infinitely. You'll sometimes see arguments of this form derided as a "parade of horribles" in judicial opinions.). Likewise, if an issue clearly goes one way or the other under controlling precedent, it makes courts' jobs easy (and limits the risk of reversal on appeal) and courts will strive to hew to that line if it makes sense.
However, courts are not precluded from accepting policy / "should" arguments on matters of first impression or where the decision is ambiguous, and can often be receptive to them if it's appropriate, although lawyers will generally prefer to argue that controlling law dominates everything and/or that existing precedent can be extended to new areas. This creates conflicts in technology spaces, especially because as Zvi aptly observes, "scale matters," and technology is great at scaling practically-irrelevant curiosities into must-be-decided-now controversies.
A seminal example of this is Justice Sotomayor's concurrence in judgment in U.S. v. Jones, which in practice is all about how the relative cheapness and ease of surveillance technologies breaks implicit presumptions of limited government capacity that undergirded a large amount of Fourth Amendment search/seizure jurisprudence -- in other words, scale matters.
What is the precedent on transitory copies (eg your computer loads the content into memory to display it on the screen) and how is this likely to affect rulings on training LLMs?
I mean, the actual weights in the network seem obviously highly transformative and may not even meaningfully contain a copy of the copyrightable elements (as opposed to the facts) so I'm wondering if the arguments are going to turn on the transitory copies made in the course of training.
Also, how do courts individuate copyrighted works and copies to compute statutory damages? For instance, if the court decides an LLM contains 10 snippets in it's memory from a given NYT article each of which is sufficient to qualify as a copyright violation is that 10 violations? One? Per website? Per cloud server the LLM is deployed on?
Regarding transitory copies: this is a great question that I'll get back to you on ASAP.
I will note that transformativeness can be difficult to argue about per se without an extensive body of precedent for the simple reason that transformativeness is a factor in fair use and fair use is an affirmative defense. The best information I'm aware of at present on this topic is the dismissal of the Andersen Complaint at pp. 8-14 (https://cdn.arstechnica.net/wp-content/uploads/2023/10/Andersen-v-Stability-AI-Order-on-Motions-to-Dismiss-and-Strike-10-30-2023.pdf) which suggests that you might run into substantial similarity problems trying to establish infringement based on model weights, but speaking from a *purely personal perspective* I would be hesitant to generalize too much from that case at the present time, both because it's a single trial court case dismissing a claim without prejudice, and because if you have LLMS that encode near-perfect recall of articles this might change the calculus somewhat: I keep coming back to the idea that a ZIP file of a registered work isn't human readable in and of itself, but I would be unlikely to advise a client that reversibly ZIPping something carries no infringement risk. So are you analogizing LLMS to ZIP files, or to a different ontological category here? We'll see how case law develops.
Regarding the latter question of "ten snippets," I will say that *in general* infringement is "per work" (assuming the work to be registered, which is a precondition of both suit and entitlement to statutory damages), which would be suggestive of ten snippets of a single (registered) article comprising a a single violation for each reproduction of the medium in which in the LLM is fixed (e.g. your cloud server example), assuming that the weights themselves slash the model is per se infringing.
*However* in practice this *may* be largely irrelevant to the issue damages: in copyright the classes of damages are chiefly governed by 17 U.S.C 504 (https://www.law.cornell.edu/uscode/text/17/504), and the classes of award are typically "actual damages and profits" (504(a)-(b)) or "statutory damages" (504(c)). Typically plaintiffs strongly prefer statutory damages for a variety of reasons, which are awarded "...for all infringements involved in the action, with respect to any one work..." (504(c)). So the actual number of specific infringements in the case that a plaintiff elected statutory damages would be somewhat moot, because all of them would be encompassed by the award.
The number of works infringed, however, would be a big deal since the award is "...with respect to any one work...."
> I urge anyone to hire and consult with an attorney before taking action that has legal implications (IP-related or otherwise
Is there any action that doesn't have legal implications? ;-)
Few in principle, tons in practice. This is a pretty standard boilerplate warning that is intended to further the idea that "you are not entitled to rely on these statements as legal advice, only as commentary."
I think you could end up seeing a quite nuanced argument involving an interplay between all these issues, it is smart of NYT to raise all the claims at once. I think it would be helpful to highlight more strongly the complexity of the legal doctrine of Fair Use, as that's what the defense will rest on; for example https://fairuse.stanford.edu/overview/fair-use/four-factors/:
> The four factors judges consider are:
> * the purpose and character of your use
> * the nature of the copyrighted work
> * the amount and substantiality of the portion taken, and
> * the effect of the use upon the potential market.
Rather than gesturing at first-principles arguments about what's fair like "We all learn, he points out, for free. Why should a computer system have to pay?", I think it's more useful do dig into case law and try to figure out how the law is actually interpreted.
A key point I didn't see mentioned explicitly WRT 1) (the training set) is that if we didn't see verbatim copying, it would (IMO) quite clearly be a transformative use. But the fact that you do see some verbatim copying of articles muddies the water on that argument. So the defense will be forced to retreat to a more nuanced point; it's impossible for the model to compress every article verbatim (from an entropic perspective this is obvious, but could be tough to explain to a judge/jury), most of the training usages are therefore transformative-not-copying. In some limited cases they did what might be found to be non-transformative-memorizing of the works; can the defense successfully sever the training process from these specific cases of duplication? The precedent of training being found to not be transformative use would be extremely damaging, as OpenAI seem to be refusing to license all the NYT content for training, and so you'd expect harsh damages if that is infringement.
I expect they will fall back on "amount and substantiality of the portion taken" to argue that where the model is storing verbatim copyrighted text, they are merely quoting some popular articles (as one could do in a book review) and not copying substantial portions of the corpus. I suspect this will be the crux. If we don't view the copying at the corpus level, and instead argue that an individual article was copied substantially or in full, do we need to argue one count of infringement for each article where they can demonstrate duplication? I'd be interested in more analysis on how this factor will be applied, and whether the current case takes a corpus-level approach vs. per-article, or something in-between.
Where the model is using retrieval to directly fetch pages, I suspect that's going to be closely equivalent to your browser doing the same thing, and it's not clear whether there are any actual damages from real users using this trick (there are easier ways of getting around the paywall). This one is much easier to fix, and I believe you'd expect to not see punitive damages here since it looks like OpenAI has tried to fix this issue when it's been reported. I think this one will fizzle.
And finally on all that, what parts do we expect to go to the SCOTUS? I'd expect the question of training being transformative use to be taken up, especially if it's found to be infringement. I'm not sure I see them taking up the mundane questions of specific cases of duplication being infringement, if training is found to be fair use, but specific cases of memorization is not.
Anyway, IANAL, would appreciate lawyers fleshing this out.
The legal fiction of intellectual property is thought policing and censorship, nothing more. If you make some idea public, I should be free to think about it and to express my thoughts to others. If you don't want that to happen, keep it to yourself.
The constitutional basis for intellectual property law is "To promote the Progress of Science and useful Arts". It's not about creative control, author's rights, or any other such special pleading. (Corollary: IP law that prevents such progress is de facto unconstitutional.)
None of the above relates to the practical matters at hand, because IP law in this country left behind any pretense of sanity many decades ago. But you seem to think everyone is on board with the general principle of copyright, and you are quite wrong about that.
(I do agree that scale matters; but I'm not very confident that a court will understand that the large-scale entity is OpenAI the corporation rather than "doing things with computers". Historically, they seem to have a particular bias related to the latter.)
Maybe I'm an idiot, but I think the solution is easy and should ("should", not necessarily "legally should") be answered by: "what if instead of an LLM, it was a person with perfect memory?"
This then leads us to:
0. that person (or their guardian) must purchase (1) NYT subscription in order to read it
1. if that person is asked to recite a NYT article word for word by someone else, and they do it - that is technically illegal, and they should stop doing it, and if they have made any profit from doing so, they have to pay that in damages.
2. if that person is asked to talk about the NYT article in general, or to excerpt part of it (with citation) so as to discuss it, that's fine and legal
3. the fact that they have the capacity to replicate the article in their mind is not legally relevant
Why assume training an LLM is shenanigans? Surely we want to look at it at an incentive level and it seems pretty similar to the way we decide to let other media sources use the facts extracted from journalism.
Sure, scale and unique aspects of LLM could make this different, but it really worries me that many people seem to see the default as paying the creator for any use. It should be the opposite - extending copyright requires a positive argument the application is likely to incentivize more creation than it inhibits.
Indeed, I tend to think the internet itself has already increased the harm side to copyright alot relative to the benefits (there are alot more derivative works ppl would like to create and lower content creation costs and other mechanisms lower the extent of the exclusivity needed to incentivize creation).
Unfortunately, people seem to emotionally see it as kinda like stealing which creates all kind of issues.
By typing a relatively short string of characters (call it a "prompt") into a browser's URL bar, one can cause the browser to display the full text of Harry Potter and the Sorcerer’s Stone. That is a true fact about the world. No one seriously thinks that this is the browser-makers' fault. So why is ChatGPT different? Why had it "better not work"?
Similarly, would "If users are actually using [Chrome] to get the text of New York Times articles on purpose for actual use, in practice, that seems clearly not okay" make sense and be a reason to be upset with Google (as the makers of Chrome)?
I'll also note that we live in a world in which all books and all movies are as a matter of practice available for free on the internet. Do with that as you will—but it's the empirical truth about the world we live in. You can't pretend that somehow copyright is what causes authors to get paid for their works, not when it's trivial to e.g. download books off of z-library. And yet somehow authors make money anyway. Why? Obviously not because of copyright! Not when the law is so unenforced (and it seems in practice unenforceable.) As a matter of legal realism, whatever it is that causes authors to get paid, it's not actually copyright law as law. Perhaps it's copyright "law" as a norm, i.e. that people would feel guilty for pirating books, or even merely feel like it's "just not done" to pirate books. Or maybe people just sensibly feel that it is in their own enlightened self-interest that authors get paid for books (at least, that's what I tell myself whenever I pay for a book.)
This was a good and balanced long read, packed with insights.
Yann LeCun's tweet about "all books should be freely available for download" tells me how little respect and understanding he, and I suspect many influential AI developers, have for art, creative work, and copyright law. Like seriously, how should authors make a living if they cannot be paid for the work they do? Who would write books? Only people who are rich from capital investments and can afford to spend a lot of time to write without any prospect of a financial upside.
As I see it, the main question in the NYT case is not if people actually use GPT-4 to read NYT articles, as you say practically no one does. Just the mere fact that GPT-4 is actually capable of quoting long passages of text from NYT articles is concerning. And the worst part for OpenAI is that that these articles, as you also comment, are not publicly available but protected by paywalls. Even if judges accept that AI training on copyrighted material is fine, I think its a hard case to make that its ok to bypass paywalls, access proprietary data unathorized, and use this data to train an AI model.
I note that there are quite a lot of authors who put their work out for free online and rely on Patreon - or Substack! - for revenue; conversely, it's quite common for authors with conventional publishing to be rather poor, since the author normally gets barely more than 10% of the revenue