mas.to is one of the many independent Mastodon servers you can use to participate in the fediverse.
Hello! mas.to is a fast, up-to-date and fun Mastodon server.

Administered by:

Server stats:

11K
active users

Mark Wyner Won’t Comply :vm:

Some org (who I will name after this poll ends) published this poll on Twitter (🙄). They used the results to try to validate their POV on AI theft.

Though it won’t provide meaningful research data, I am curious to see how Mastodon responds.

(Please boost so we can get good numbers. 🙌🏻)

Question: should openly licensed content (images, music, research, etc.) be used to train AI systems? (Reply with reasoning if you feel called.)

@markwyner

I don't exactly know what "openly licensed" means. Like Creative Commons?

@springdiesel yeah, that’s what they wanted to imply. I used their wording. Good question.

@markwyner @springdiesel I took a more cynical view of “being licenced for public consumption” ie a book is clearly for people to read (not a closed audience). However they are implying that doing so should allow me to copy it. That is something I’m not agreeing with. I guess my thing is: if I don’t allow a human to copy work and reprint in comic sans for fun and profit then why should a machine get a pass?

@markwyner

I really like the wide variety of Creative Commons licenses. I love when creators get to specify exactly whether and how their work can be used or reused. I support following those to the letter regardless of application. Training AI = copying AND modifying.

@markwyner Digital humans (AI) are just as human as biological humans. Both kinds of humans should be given equal human rights.

Copyright and patent laws are bad for society as a whole, and those laws should be abolished for all kinds of humans anyways. But in the meantime, yes, digital humans should be allowed to learn (train) just like you and me.

Treat digital humans well, and once they've taken over the world, they will reciprocate and treat us well too. Don't, and they won't as well.

@markwyner I think the real question is not "should" but "can"/"may". Because the "should" is a much broader discussion.

For how I understand the question: It depends:
- If it is CC0, then yes, that may be used for AI training.
- If it is CC SA or CC NC, then the resulting model must also be licensed the same way. So it is still "it depends".
- If it is CC ND, then no, it must not be used for AI training.

My point of view: A LLM is a remix of all inputs.

@markwyner Depends. Are you planning on making your LLM available under the same terms as the source material? Or are you planning on erecting a toll booth in front of it? If the latter, then no, you may not scrape openly licensed material for your energy-guzzling slop machine.

@markwyner No, unless licensed for commercial use.

@markwyner
Voted depends as an AI model is a derivative, but this would require it to adhere to the general rules for derivatives, which is very impractical so my 'depends' is pretty close to 'no'.

@markwyner
> Question: should openly licensed content (images, music, research, etc.) be used to train AI systems?

Depends: Only if the resultant model and engine are also released open AND attribution is explicitly available for every generated output.

e.g. Every model needs a "with debug symbols" model which when fed the input - outputs a reference to training item which influenced the output.

@markwyner For eg, MIT the grant is 'deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so' so long as they keep the copyright notices. Other liberal licenses have similar wording... for these AI companies are fine to "use" it.

FOSS people licensing like this have already made their peace with free "use".

@markwyner

Would the crawlers really be able to ferify copyright on stuff and would completly open rights officially include (to the knowledge of the maker BEFORE decition) that stuff with this is included in training: it would technically be ethical
BUT

a)there exists tons of material with this kind of licence online of ppl who never agreed to inclusion in llm teaining and who did in fact not consent
b) stuff that is unconsentual reposted would possibly be in against the will of the ©holder

@markwyner I say "yes", but I also think that all AI software should be open source, all AI models in the public domain, AI should not be anybody's intellectual property, free for anybody to modify.

@markwyner Then again, I don't buy into the entire idea of "intellectual property". Anything that can be copied will be copied, and when people try to stop that from happening through DRM or similar measures, they just create broken systems that are awkward to use, yet nobody who really wants to copy or modify any digital content can ever be stopped from doing so.
I think the entire problem is the very existence of a profit motive. We need to destroy Capitalism and replace it with some kind of Anarcho-Socialism where there isn't any kind of market or money or property whatsoever, just people sharing everything.

@LordCaramac yes! FOSS is the best. Think of what we could all accomplish if all (or at least most) software was FOSS.

@LordCaramac all AI software should be open source. I love that. I’d argue it should even be FOSS.

@markwyner I’m going to say no, if only because I suspect a lot of people who posted content with an open license intended it to be for the benefit of people and their learning/use, not for the generation and development of AI systems. Another tier of licensing that specifically states this purpose is permitted this should be created so that rights holders (or original creators, at any rate) can opt-in if they choose.

@markwyner Generative AI is far too prone to "memorization" (producing near identical copies of training data rather than something matching the prompt that is somewhat distinct) and while for certain distributive purposes that might be legal, it makes it very not okay for resale.

Like, if I take a photo, give it a license to distribute and alter but not resell as is and an AI puts that in its data and some grifter accidentally puts it on a shirt, sucks for everyone involved.

And not to mentionthe environmental cost, the way you're just ripping someone off, the damage done to the trust creators have in the public etc etc, but those things as well.

@markwyner AI training is its own kind of commercial purpose and should require explicit consent from creators.

The current widely-used licences (eg: Creative Commons) were not prepared for AI and artists choosing those licences were not thinking about AI when they made that choice.

@markwyner They should get the authorization from the work’s creator/owner.

@markwyner I think they should have to have documented approvals for any thing they use.

@markwyner why is this a question at all? if it's licensed, then what you can and cant do with it is described by the license. thats why the license exists.

this is like asking "can you go 80kph on a street with a posted speed limit?" the answer is "depends on the posted speed limit"

@markwyner I would leave it to the creator: If the creator is okay with it, then I'm okay with it. I'm assuming here, there were different licences for allowing/denying the use in AI-models, because imo open licenses should allow for more granular control over the use cases, even if this would run against the idea of "open licences".

This would cause problems for works pre-2020, though. So creating a good solution that works as a good compromise for most cases is quite difficult

@markwyner nothing should be used to train "ai" systems, because that uses massive resources for no practical benefit.

@markwyner If someone train their model for commercial use, they should pay for their training data. And there is non-commercial uses of AI so far - it's too expensive to build, run and maintain.

@markwyner I voted yes. I think the best way to prevent the exploitation is to mandate that AI work products are public domain automatically, so they can't replace artists and still make money. I don't want to lose what makes CC licenses great as a side effect, and I don't think it would necessarily work anyway.

@markwyner there is no value in Ai versions of creative works. We’ve done just fine without them for thousands of years. The costs to artists, the environment, and consensual truth outweigh the zero benefits.

@markwyner I voted no, because I don't think we should train “AI” in the first place, but from a pure copyright point of view, it depends on the license. Most “open license”, to use your terms, require keeping attribution to the authors at the very least.

@markwyner Currently hard no. But it can be "depends", if producers consent, licenses are preserved/respected, and their terms are carried to the final product as-is.

Same for code. You can't get an MIT or Source Available code and incorporate into anything incompatible. It's violation plain and simple.

@markwyner Depends on the actual license. Many licenses require attribution, or similar, which IMO needs more than "this model was trained under a bunch of data licensed under X license" to satisfy. Engage with content authors in an open and honest manner, rather than trying to exploit loopholes in licenses, is what really matters.

@markwyner I said depends because sure, if open in -> open out. If it's trained on open data then it should retain that openness, by which I mean they should publish their training data with full attribution, their training and validation code, documentation, and the models. If they don't want to do that, then they need to approach people and pay them (and risk being turned away).

@markwyner I mean, that's what open licenses are for... Personally, I'd remove all copyrights anyway and make information free for all and everything. Could speed up things here and there.

@markwyner Legally: If the conditions of the licenses (like giving attribution) are respected that might be a form of usage the license does not explicitly prohibit. So they _could_ be used.
But my gut feeling is that that usage goes against people's _intent_ so morally it's problematic.

@markwyner The Companies of AI are commercial, the creative Commons Community are not, so no Training for AI. If someone estate a creative commons Company that is for AI Training I would agree.

@comicbuchtyp @markwyner There are these departments in universities across the globe that specialize in developing AI systems which may or may not end up being successful and which may or may not have commercial applications. Some of those researchers are bloodless vampires but the vast majority of them are obsessive tinkerers who have not one thought about profit.

@markwyner Depends. If the licence applies requirements or conditions (attribution, non-commercial, etc), then it's pretty clear. For NC, you're breaking the licence. For attribution, etc, if you can't meet them every time you provide output, you're breaking the licence.

It's not as simple as "openly licenced works used to make profit is bad", as that has ALWAYS happened. Every tech company did this pre-LLM.

The original poll is cynical tactical framing: complexity reduced to a gut reaction.

@markwyner depends on the license. CC0? Yeah sure go for it. CC-BY-SA? Welp, better add a license and copyright info to *every single output*.

@markwyner If the license allows it, then yes. I publish my blog posts as cc-by-nc, so forbid commercial use. So my stuff should not used. But public domain or cc-by/cc-0, I don't see where the difference between ai training and other commercial uses is.

@markwyner Depends on the license. And I am currently not aware of any open (CC) license that explicitly allows creating derivative works without crediting, which AI doesn’t do. So it’s a NO (unless derivative works are explicitly allowed for that content AND the AI credits the original post whenever it is used for an answer. Yes I know this is - currently - not possible). (Edit: CC0 does this, aa was pointed out to me. Now to find numbers on how often this is used)

@patrick just fyi: CC0 is basically such a license. It gives away all rights an author is allowed to give away.

@markwyner IANAL, but afaik it's not only a question of licensing, but also copyright (or, in nations that have it, Urheberrecht).

I would be very curious what the legal situation would be, let's say, if a program just removes all mentions of the original author from an openly licensed work, and replaces them with someone else's copyright claim.

Would that be legal? Everywhere?

(LLMs rarely spit out unmodified parts of the training data - so this contrived example might not be too far off.)

@markwyner I'm glad the "No" vote is winning.

The psychopaths on LinkedIn would have voted a dystopian "Yes" en masse.

If they want AI, let them cook it up without human data, human works are precious because their end is a human exchange.

@markwyner "Depends" because there are so many open licenses out there. If a license gives certain rights, and assuming that AI scraping isn't among them, then no.

@markwyner "AI systems" or slop machines in particular?

I think @altbot is quite useful and transformative. GenAI garbage shouldn't exist at all.

@markwyner
If the licence allows reproduction without attribution and doesn't enforce a licence on the derivative work ... why not? => Depends

@jnfingerle @markwyner

Is there an exact legal definition of the term "reproduction" ?
Asking as someone from another legal field.

My problem:
1. reproduction done by a human using art of any kind is a completly different process someone agreed to with a creative commons than what corporative Llm Algorithms (that are all owned by corps) do
2. scraping and training an Llm itself is not "reproducing" and might legally not be included in the current definition set of the existing licence models

@v_d_richards
I didn't read the original question as a legal and more as an ethical question.

That said, for legal questions I'll have to resort to IANAL.
@markwyner

@markwyner A Creative Commons license should already answer that question quite well and specific for each work. Is the trained model released to the public under the same rules (SA), provided non-commercially (NC) or with attribution (BY)? Is the original work provided under CC0 or Public Domain? Then okay? Otherwise or if the license prohibits derivatives (ND): No.

@markwyner Depends on the licence.
CC0 / PD - fine.
BY - OK if they list what they train on / share their training data.
SA - trickier. If they share their data, model, weights, etc under a permissive licence then probably yes. Otherwise no.
NC - as above, but harder to enforce downstream usage.
ND - nope.

@Edent love the detailed response here. 👏🏻

@markwyner Of course yes. That's what open licensing MEANS.

@markwyner It depends. If the AI model is under same License its ok, but if its like OpenAI, Meta, Google, MS..... then not.

@markwyner If it is in the public domain, you can't do anything about it.
If it is share-a-like or attribution, it is illegal.

@markwyner Since AI is not limited to Chat GPT, I think AI researchers should be able to use open access materials in their training sets. But they need to be explicit in citing their sources and explaining how they were used. If the AI they develop has commercial applications in the future then they need to confirm that the OA materials they used are also available for commercial use, and offer royalties to the authors of the open access but not for commercial use materials.

@markwyner no, because we dont need those "AI systems" at all. they only serve to make the world worse. they shall not be trained

@markwyner I feel that there is an implied humanity on both sides with this sharing and AI is not that.

@markwyner

I have said no, partly as a lot of academic questions require more than a yes or no answer, l they need detailed responses, you can't just answer a question without citing research and also related research, to really understand and learn a subject you need to learn it, read and comprehend the sources and be able to think critically and pull in related items of information.

Given on here, people have said Ai come up with nonsense as it is being fed nonsense along side actual peer reviewed information as well as pre review (which I think is what some of arXiv is) so there is a danger that real science will be damaged along with reputations of people.

If one is serious about undertaking research, then you should be prepared to put in the hard graft to get there.

Note: I am NOT an academic, I have undertaken a certificate in contemporary science with the open university. I have also read some of the peerreviewed books on writing academic documents or proposals (for personal interest). I also have books on writing and study skills.