Can coding agents relicense open source through a “clean room” implementation of code?

Pierre-Yves Lapersonne@programming.dev · 2 days ago

Can coding agents relicense open source through a “clean room” implementation of code?

JackbyDev@programming.dev · 5 hours ago

My understanding of copyright law, I hate the cliche, but I am not a lawyer, is that things computers make aren’t covered by copyright. Now whether courts will decide if AI agents operating alongside user prompts counts as something a computer generated “itself” versus something the human made by using the computer as a tool, who really knows. But the idea of sending an AI agent to remake some proprietary code and have it be part of the public domain is interesting. Though, sadly it would go both ways, corporations could make public domain versions of copyleft code.

polakkenak@feddit.dk · 2 days ago

No, absolutely not. It is safe to assume that most/all open source (and otherwise) has been part of the training data. You need not look further than the fact that some models can recite Harry Potter from memory. There is no such thing as “clean room” for AI.

Captain Beyond@linkage.ds8.zone · edit-2 1 day ago

Ironically though this makes the reverse a bit more defensible (i.e. using an LLM to reverse engineer a proprietary app) because that proprietary app’s source code is less likely to be among the publicly available dataset.

But I imagine the corpos aren’t going to look fondly on that for obvious reasons.

StellarExtract@lemmy.zip · 2 days ago

This really isn’t true though, even if it is currently true in many cases. Case in point, if I wrote something and published it right now, it wouldn’t be part of any AI model yet. A party with a lot of money (like, say, a tech corporation) could easily create a bespoke coding model that is trained on everything but the desired libraries, thus achieving “clean room”.

polakkenak@feddit.dk · edit-2 2 days ago

In theory: Yes, future works are not yet part of the training data.

In practice: It takes months or years for an open source project (or any new technology) to take off and be considered valuable.

The other argument relies on said tech organization doing the right thing, and spending resources on training their own model (years and 100+ million) instead of including the cost of the lawsuit and pending fine in their cost/benefit analysis. I’m not aware that any such tech organization (with the means) exists.

StellarExtract@lemmy.zip · 1 day ago

Again, while this may be currently true for the most part, this is not considering the future evolution of technology. Models are only going to continue getting cheaper to produce. While it is possible that it is prohibitively expensive today (and I’m not convinced that that’s the case universally) that will not be the case in the future as model training is essentially guaranteed to get dramatically cheaper in the coming years due to hardware advancements. Burying our heads in the sand now isn’t going to help anything.

eleijeep@piefed.social · 2 days ago

Well last I heard you can’t copyright the output of an LLM, so the entire concept of a licence for open slopware is moot.

MalReynolds@slrpnk.net · edit-2 2 days ago

Unfortunately the “with significant human input” case hasn’t been tried yet. As with most of these things the team that spends the most on lawyers wins the vast majority of the time, so corpos will get the case law.

I’m hoping that the “with significant human input” case turns out to be a massive own goal and basically breaks software copyright a few years down the line when anyone can re-implement any software.

Of course that’s when lobbying buys a law to override the case law. Sigh.

mholiv@lemmy.world · 2 days ago

Yah but community centric GPL to no copyright is sort of the goal for the recent slop rewrites.

If there is no copyright on the slop output code based on GPL code that’s a win for the corps.

dev_null@lemmy.ml · 2 days ago

So you are agreeing using the LLM worked? Because that’s what the author wanted: generate a freely usable version that is no longer bound by copyright or the original license.

eleijeep@piefed.social · 2 days ago

Whether you own the copyright to your derivative work is not the same question as whether you are infringing someone else’s copyright.

dev_null@lemmy.ml · 2 days ago

Yes, but what does that have to do with LLM output being not copyrightable?

eleijeep@piefed.social · 1 day ago

Because the title of the post is

Can coding agents relicense open source …

My response was no, because the output will always be in the public domain, which is the opposite of licensed.

However your reply asked a different question:

So you are agreeing using the LLM worked?

This is a different question, because it’s asking not about the general case of “can a coding agent produce a clean-room reimplementation” but rather “did the chardet rewrite achieve the goals of the maintainer?”

It’s clear from the information uncovered about the chardet rewrite that it cannot be considered a clean-room reimplementation, therefore there is an argument to be made of copyright infringement, regardless of whether anyone can own the copyright for it.

But the title of the article is asking whether the general case is possible. In that case, an agent reimplementing a project that does not appear in its own training data and whose prompts do not contain any copyrighted source code, could in theory produce a clean-room reimplementation from functional descriptions alone, that would not violate the copyright of the author of the original project.

However in that case, the rewrite would still not be licensable since nobody would own the copyright to it.

I hope that clears up the point I was making and why it’s relevant to the post.

dev_null@lemmy.ml · 1 day ago

That all makes sense to me, all I meant is that you are answering the relicense question literally, which I don’t think actually matters. The situation we are pondering is that someone wants to free a project from it’s original license.

They are claiming they did a magic trick with an LLM and now the project is MIT licensed. And you are saying that it’s not, it’s public domain. But the distinction is immaterial to the person’s goal. Whether the author is right or you are right, the project is no longer under its original license, and whether that is something that can happen is the actual question here, regardless if the resulting output can be licensed or not.

eleijeep@piefed.social · 1 day ago

They are claiming they did a magic trick with an LLM and now the project is MIT licensed. And you are saying that it’s not, it’s public domain.

That’s absolutely not what I’m saying. I’m saying that the rewrite of chardet infringes on the copyright of the original work. That is neither MIT licensed nor public domain. It’s illegally reproduced and distributed copyrighted work.

dev_null@lemmy.ml · edit-2 19 hours ago

Then what did you mean when you said:

the output will always be in the public domain

It seems to me like a pretty clear statement.

I’m saying that the rewrite of chardet infringes on the copyright of the original work. That is neither MIT licensed nor public domain. It’s illegally reproduced and distributed copyrighted work.

That I never disputed, I’m not interested about chardet or whatever happened here, I’m interested about your comment that LLM output is always public domain, and if so, whether it could be used to achieve the goal of reimplementing a library so that it achieves the same purpose but isn’t bound by the original license, if you do it without infringing on the copyright of the original work.

Scrubbles@poptalk.scrubbles.tech · 2 days ago

I saw the project that did this. It was satirical, and I think the point was to show how absurd it would be to maintain everything yourself, even with AI

Captain Beyond@linkage.ds8.zone · 2 days ago

I think the fact that the maintainer is intimately knowledgeable about the original codebase is enough for it to not be a clean room re-implementation, no? That’s what makes it “clean”