In this post the authors explore how today’s contractual restrictions on AI mirror the concerns libraries raised 20 years ago during the US Copyright Office Digital Millennium Copyright Act (DMCA) Section 104 study. Further, they examine the differences between copyright law – which enables access through fair use and other rights – and contracts, which can carry legal weight and intimidation tactics, such as copyright warnings.
Private conversations are something entirely different from publically available data, and not really what we’re discussing here. Compensation for essentially making observations will inevitably lead to abuse of the system and deliver AI into the hands of the stupidly rich, something the world doesn’t need.
But that’s kind of the question here… Is data processing Fair Use in every case? If yes, we just also brought private conversations and everything in. If not: What are the requirements? We now need to talk about which use cases we deem legitimate and what gets handled how… I think that’s exactly what we’re discussing here. IMHO that’s the point of the debate… It’s either everything… or nothing… or we need to discuss the details.
I’m not sure about that. I mean I gave some examples with licensing music on events and libraries (in general). Does that also get abused by the rich? I don’t think so. At least not that much, so that makes me think that might be a feasible approach. Of course it get’s more complicated than that. Licensing music for example brings in some collecting societies, and all those agencies have proven to be problematic in various ways and all the licensing industry isn’t exactly fair, they also mainly shove money into the hands of the rich… So a proper solution would be a bit more complicated than that.
I mean I’d like to agree with you here and have a straightforward parallel on how to deal with AI training datasets. But I don’t think it’s as easy as that. We can’t just say processing data is Fair Use, because there are a lot of details involved, as I said with privacy. We can’t process private data and just do whatever with it. We can’t do everything with copyrighted material, even if it’s in the public. If a use is legitimate already depends on the details. And I think the same applies to AI. It needs a more nuanced perspective than just allow or prohibit everything.
I’m not discussing the use of private data, nor was I ever. You’re presenting a false Dichotomy and trying to drag me into a completely unrelated discussion.
As for your other point. The difference between this and licensing for music samples is that the threshold for abuse is much, much lower. We’re not talking about hindering just expressive entertainment works. Research, reviews, reverse engineering, and even indexing information would be up in the air. This article by Tori Noble a Staff Attorney at the Electronic Frontier Foundation should explain it better than I can.
You’re right, the private data is a bit of a construed example. I just wanted to make the argument that this isn’t just about copyright law. Something could be fair use under copyright law, but still illegal to use for other reasons. Which is problematic when doing unsupervised web scraping, for example. It’s definitely an issue, but out of scope if we limit the discussion to copyright only.
I’ve just skimmed the last article, I’m going to read it tomorrow. But I don’t think I’d like to argue for extending copyright. I think that would be bad. But I think it’s debatable whether AI training falls into that category. I’m not sure how it is in different jurisdictions… Maybe it’s clear in the US? I always struggle to read American legislation. I can just say it’s not clear where I live. And that comes with consequences: Companies do AI in other countries like the USA or China, rather than in the EU. Which is an issue for our economy and scientific progress. And everywhere where law isn’t clear enough, that’s a disadvantage for smaller companies/institutions or individuals, since it’s the big companies who can afford lawyers easily. And it has consequences for me personally. For example Meta’s use policy for the newer Llama models excludes Europeans. I’m not allowed to use it. That might not be about copyright either, but definitely due to unclear regulations.
So I don’t advocate for extending copyright. My stance is, we don’t have clear regulation in the first place. I’d leave all exemptions and specifics in place. We can leave libraries, music, research and reverse engineering as is. But the current warfare is super unhealthy. We have some companies scraping everything, meanwhile other people come up with tarpits and countermeasures like Cloudflare with their AI labyrinth last week… One newer business model is introducing walled gardens so companies can make sure they’re the only one selling their userdata… I think that’s all very unhealthy. And it favors large companies doing that “research”. Meanwhile the internet gets flooded with slop, half the internet services are barely usable and we might end up with dystopian Skynet corporations dominating information flow anyways. And I think that’s the bigger issue than copyright. If AI proves to be disruptive, it needs to be used somewhat ethically. And I think the only way to do it is regulation. We need to even the field so research and non-profit gets a chance. We currently have “smaller” startups participating, and several companies release open-weight models. But we can’t rely on their altruism. My prediction is they’ll all stop once this starts to interfere with their business or those models get really useful. And then it’s going to be OpenAI and Anthropic & Co who get to decide what kind of information the world has access to. Which would be very bad. And they also offer little transparency. More and more people rely on these services and AI is very much a black box. And the large companies have stopped telling what went in a few years ago when all the copyright lawsuits started. The first Llama model still came with a scientific paper, detailing all the dataset that went in. But as far as I understand, they stopped soon after, when copyright lawsuits started. And the rest are trade secrets. So if someone were to use ChatGPT (which lots of people do), they’re completely at mercy of OpenAI. OpenAI get to decide in which ways the model is biased (or not), what it can and can not answer, what is fed to the users. I think that’s the main issue with it. (Along with slop.) Copyright of training data is some sort or sideshow. But I still think we have a lot of unaddressed issues with AI. And leaving that open is just going to help the big players. I think we need more clear regulation so a small company who can’t afford a lot of lawyers can also be 100% sure whether someting is fair use or whether it isn’t. And personally I think we need to hold them all accountable and force them to be more transparent with everything. Like a rough estimation of the datasets. And I’d force service of generative AI services to implement watermarking to at least try to tackle slop and people doing their homework with ChatGPT. Sure this can all be circumvented, but we can at least try to do something about it. And I’d also like if big companies bought at least one copy of a book they use to train their AI. Meta or OpenAI can afford to pay a few millions. Otherwise they just leech on people’s content. I think it’s unfair that some people take quite some time to write books, Reddit comments, Wikipedia articles and then someone else gets to make big profit from that. It’s not very straightforward to solve it, but I don’t think it’s very healthy for humanity to just hand over everything to greedy companies. And I also don’t think it’s healthy to embark in a warfare, which seems to happen right now. That way we’re likely to all lose access to free information.