AI’s become so invasively popular and I’ve seen more evidence of its ineffectiveness than otherwise, but what I dislike most about it is that many run on datasets of stolen data for the sake of profitability à la OpenAI and Deepseek
https://mashable.com/article/openai-chatgpt-class-action-lawsuit https://petapixel.com/2025/01/30/openai-claims-deepseek-took-all-of-its-data-without-consent/
Are there any AI services that run on ethically obtained datasets, like stuff people explicitly consented to submitting (not as some side clause of a T&C), data bought by properly compensating the data’s original owners, or datasets contributed by the service providers themselves?
Some machine learning models are trained on what’s called synthetic data, which is generated specifically for that purpose and mimics real-world data. What I don’t know is how much of the data used is synthetic vs. stolen.
Then again, the synthetic data is generated with previous generation models, that were trained on scraped data.
Turtles all the way down
Stolen, but this time passed through an additional bullshit layer for even less reliable results! Buy now!
If the real world data it is based on was stolen then using the synthetic version still counts as stolen.
deleted by creator