What are the platforms on the Fediverse doing to prevent data scraping and prevent bots?

thesharky@piefed.blahaj.zone · 5 hours ago

What are the platforms on the Fediverse doing to prevent data scraping and prevent bots?

TropicalDingdong@lemmy.world · edit-2 5 hours ago

You misunderstood the assignment

thesharky@piefed.blahaj.zone · edit-2 2 hours ago

Did I? I can’t see how.

I don’t think web crawlers overloading instances by downloading huge amounts of content and sending thousands of requests is the point of the Fediverse.

But I might be genuinely confused here. Correct me if I’m wrong.

FaceDeer@fedia.io · 2 hours ago

That’s a very narrow view of data scraping, there’s lots of ways to get data.

The Fediverse is built on ActivityPub, which is an open protocol that’s designed to broadcast data with no limitations or restrictions. If you don’t want your data to end up in the hands of anyone who wants it - including those nefarious AI trainers - then that’s an inherently incompatible goal with ActivityPub.

If you’re just worried about specific instances being overloaded with requests, then sure, all the usual rate limiting DDOS-prevention Cloudflare tricks will work. But the data itself isn’t “protected.” Someone who wants it could simply run an instance of their own specifically to collect it.

thesharky@piefed.blahaj.zone · 2 hours ago

I don’t understand the part where you say that my view is narrow. I am talking about a specific kind of data scraping. I’m not sure what I’ve said that has lead you and a few other people to believe I’m necessarily worried about people getting hold of “my data”.

Am I just expressing myself badly here?

As for the rate limiting, that’s closer to what I wanted to know. Thanks.

FaceDeer@fedia.io · 2 hours ago

Your original post didn’t specify a particular kind of data scraping. TropicalDingdong had no way to know you were only specifically interested in that one kind of data scraping, so his comment is appropriate - you can’t stop data scraping in general, and attempting to do so in the general case goes directly against the goal of ActivityPub.

thesharky@piefed.blahaj.zone · 2 hours ago

I guess. But that was an assumption on you guys’ part as well. Not that there’s anything wrong with that.

I’m curious about the “in general” part, though. Maybe that’s a part of the philosophy I don’t quite understand yet, but how’s the kind of scraping that I mentioned any good? Or is that not the right question to ask?

FaceDeer@fedia.io · 2 hours ago

I didn’t say anything about the “prevent instances from being overloaded” part being good or bad. I didn’t even give an opinion on ActivityPub, just pointed out the practical limitations and incompatible design goals.

Personally, I’ve got no problem with websites implementing rate caps and whatnot to ensure that their traffic remains within the limits they can handle, or throttling specific IPs. I am very concerned with how Cloudflare in particular has become the single centralized “gatekeeper” for vast swaths of the Internet, though. If they decide that some particular client isn’t allowed to see stuff then poof, a big chunk of the Internet is cut off. That’s worrisome IMO.