“
6Pages write-ups are some of the most comprehensive and insightful I’ve come across – they lay out a path to the future that businesses need to pay attention to.
— Head of Deloitte Pixel
“
At 500 Startups, we’ve found 6Pages briefs to be super helpful in staying smart on a wide range of key issues and shaping discussions with founders and partners.
— Thomas Jeng, Director of Innovation & Partnerships, 500 Startups
“
6Pages is a fantastic source for quickly gaining a deep understanding of a topic. I use their briefs for driving conversations with industry players.
— Associate Investment Director, Cambridge Associates
Read by

Used at top MBA programs including
Jul 11 2025
12 min read
1. Copyrights and AI training data
- While the intellectual property (IP) issues associated with AI training are far from settled, summary judgments in recent court cases – viewed as qualified wins for Anthropic and Meta – are beginning to clear the fog. The judges generally agreed that AI training transforms copyrighted content into something new – one of the 4 factors in determining “fair use” (see below). The judge in Anthropic’s case, however, drew a clear distinction between the use of pirated vs. legally obtained training data, saying that Anthropic had “no entitlement” to use pirated copies. The judge in Meta’s case, in turn, pointed towards market harm as a potentially productive avenue that future plaintiffs could go down.
- In the US, up until recently, generative AI has largely been predicated on model training being considered fair use under copyright law. In the US, fair use considers 4 factors: (1) Purpose and character of the use (e.g. how transformational is the use, commercial vs. nonprofit); (2) nature of the copyrighted work (e.g. creative vs. factual, unpublished vs. published); (3) amount and substantiality of the portion used (e.g. all vs. a small piece, the “heart” of the work vs. a peripheral part); and (4) impact on the market or value of the copyrighted work. Note that copyright law only protects finished works – not a creator’s style, technique or ideas.
- Critically, the 4 factors are not all equally weighted. Specifically, the more transformative a particular use is, the less significant the other factors are in determining fair use. In general, the purpose and character of the use (#1) is one of the most important factors, whereas the nature of the work (#2) is less important. The amount of the work used (#3) is context-dependent and can be neutralized if the use is highly transformative. The impact on the market (#4) is considered highly important alongside #1, and could become the crux of future cases. However, the courts generally hold that the 4 factors should be weighed together holistically.
- Anthropic’s case was heard in San Francisco in the US District Court for the Northern District of California – a venue known for cases involving complex technology/IP issues – by the influential Judge William Alsup. The case was brought by several authors of books that were part of a dataset assembled by Anthropic for AI training. Notably, Anthropic’s early gathered dataset included two widely used digitized libraries of books – Books3 and LibGen – as well as books from the Pirate Library Mirror (PiLiMi).
- In mid-2024, in an effort to address the legal issues associated with training on pirated content, Anthropic began to acquire millions of used and new print books – at a cost of millions of dollars – which were scanned by vendors to create its own research library. In the process, it “destroyed” each print copy by stripping the binding and eventually discarding it, thereby “replacing” the print copy with an owned digital copy intended for internal use. In addition to finding that AI training is transformative, Alsup also determined that this one-for-one replacement of print for digital – “without adding new copies, creating new works, or redistributing existing copies” – was fair use.
- On the other hand, Alsup asserted that “piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.” In Anthropic’s case, the use of pirated copies to build a central library of works was specifically determined to be not fair use. Furthermore, “no damages from pirating copies could be undone by later paying for copies of the same works,” although it could affect the extent of the statutory damages. The damages could be significant for Anthropic given that 7M+ copies may have been pirated. Statutory damages range from $750 to $30K per work (median is $3K), and up to $150K (median is $10K) for willful infringement, suggesting a possible liability in the billions. That part of Anthropic’s case will continue to go to trial.
- Importantly, Alsup addresses the 4th factor directly: “The copies used to train specific LLMs did not and will not displace demand for copies of [the plaintiffs’] works, or not in the way that counts under the Copyright Act.” He went on, “[the plaintiffs’] complaint is no different than it would be if they complained that training schoolchildren to write well would result in an explosion of competing works. This is not the kind of competitive or creative displacement that concerns the Copyright Act. The Act seeks to advance original works of authorship, not to protect authors against competition.”
- Just two days after Alsup’s summary judgment, in a similar case against Meta, Judge Vince Chhabria (also of the Northern District of California, although less senior than Alsup) agreed that AI model training was transformative. However, he framed a narrower determination that puts the most weight on market harm (#4). According to Chhabria, “[T]he fair use inquiry is highly fact dependent, and there are few bright-line rules. There is certainly no rule that when your use of a protected work is ‘transformative,’ this automatically inoculates you from a claim of copyright infringement. And here, copying the protected works, however transformative, involves the creation of a product with the ability to severely harm the market for the works being copied, and thus severely undermine the incentive for human beings to create.”
- Chhabria references Alsup’s judgment directly, explicitly disagreeing with Alsup’s dismissal of the market-harm concerns. “[W]hen it comes to market effects, using books to teach children to write is not remotely like using books to create a product that a single individual could employ to generate countless competing works with a [minuscule] fraction of the time and creativity it would otherwise take.” He introduces the relatively novel concept of “market dilution or indirect substitution” as kinds of harm (#4) that could bar fair use. Even Chhabria agrees, however, that “plaintiffs are not entitled to the market for licensing their works as AI training data.”
- These cases are being viewed as qualified victories because they leave several issues unaddressed – and in some cases, even point out more productive avenues for plaintiffs to bring cases. Chhabria reluctantly rules in favor of Meta but points out that the case is not a class action, which means the door is open to other plaintiffs who can bring a stronger case. He clarifies that “[t]his ruling does not stand for the proposition that Meta’s use of copyrighted materials to train its language models is lawful,” especially when there is market harm.
- These cases were also largely focused on how the inputs to AI models were collected and handled. What was not fully addressed were the outputs. Plaintiffs in the Anthropic case, for instance, did not claim that its AI models were producing exact copies or infringing knockoffs. In part, this is because the largest AI players have gotten wise and now take efforts to avoid extensive word-for-word regurgitation. (This is not consistent across models though.) The NYTimes’ case against OpenAI, which does take aim at “near-verbatim excerpts” of articles, is still ongoing.
- The two cases furthermore do not address the use of copyrighted figures in generated work. Last month, Disney and NBCUniversal filed a case against Midjourney based on its display of copyrighted characters in its video-generation tools, as well as its use of the characters in AI model training.
- More narrowly, the cases do not cover “any copies made from central library copies but not used for training.” Alsup specifically excludes these from his summary judgment. This suggests that AI players are going to need better hygiene on how they handle datasets that include copyrighted material – e.g. such as controlling access, limiting copies, and tracking data lineages.
- There’s clearly a lot of landscape that needs to be settled. We do know more now than we did before. These two summary judgments represent perhaps the first clear answers by authoritative figures on the major question of whether AI training is transformative. Alsup goes even further to assert: “The copies used to train specific LLMs were justified as a fair use. Every factor but the nature of the copyrighted work [the least important factor] favors this result. The technology at issue was among the most transformative many of us will see in our lifetimes.”
- In general, these cases point towards IP law continuing to hold even amid AI’s broad-sweeping disruption. For AI players – particularly those using publicly available datasets (i.e. most players) – this means evaluating their exposure to claims of piracy, regurgitation, and market harm.
- The measures to address this exposure could include acquiring and digitizing print books (like Anthropic has done), bolstering guardrails to avoid reproducing longer excerpts of 50+ words, instituting data-access controls (incl. limiting copying and tracking data lineage), and diversifying their sources (so only a de minimis amount is used from any one source). Harvard, for instance, recently released an Institutional Books 1.0 dataset, which includes 983K public-domain books that were digitized as part of its library’s participation in the Google Books project.
- The ongoing lawsuits will continue to add costs for existing players, and potentially turn some egregious infringers into “money piñata[s].” In the long run, however, they are likely to raise the barriers for new entrants, entrench the incumbents, and advantage foreign players less beholden to US laws. In the US, only the biggest players will be able to carry the costs associated with training AI models using legitimately collected data.
- The incentives for IP owners and AI players to negotiate licensing deals are growing on both sides. The former are facing a harder road to win their cases, while the latter will see higher costs and significant business-model implications if they do lose.
Related Content:
- Apr 28 2023 (3 Shifts): Community platforms want to get paid for their AI-training datasets
- Jun 3 2020 (Brief #34): Can an AI be an inventor or author? The current state of IP protection
Become an All-Access Member to read the full brief here
All-Access Members get unlimited access to the full 6Pages Repository of787 market shifts.
Become a Member
Already a Member?Log In
Disclosure: Contributors have financial interests in Meta, Alphabet, Oracle, Uber, OpenAI, and Rocket Lab. Amazon, Google, and OpenAI are vendors of 6Pages.
Have a comment about this brief or a topic you'd like to see us cover? Send us a note at tips@6pages.com.
All Briefs
Get unlimited access to all our briefs.
Make better and faster decisions with context on far-reaching shifts.
Become a Member
Already a Member?Log In
Get unlimited access to all our briefs.
Make better and faster decisions with context on what’s changing now.
Become a Member
Already a Member?Log In