Title.
I’ve noticed that the issues above are becoming increasingly notorious across the entirety of the Fediverse. What’s being done to mititage those issues?
In the latest version, PieFed defaults to a mode which requires a login to browse. An admin needs to tick a box to expose themselves to scrapers.
That’s interesting. I haven’t seen that on my instance yet! Curious whether they will roll that out.
We have that cat girl thingy looking out for bots.
Prevent data scraping? Nothing, really. Some instances use Anubis to prevent scrapers from using the UI intended for end users, but fundamentally, federation is indistinguishable from scraping. You should assume there are listeners from state and corporate agents collecting as much of the social graph as they can discover.
Prevent bots? Varies by instance. Some instances are strictly bots, like relays, some ban bots as they are detected, and most lie somewhere in between. Most of what disincentives bot operators are financial incentives – most instance operators are unwilling to finance bots posting frequently, and fedi users are rabidly anti-advertisement.
Scrapers are not federating.
Activitypub could be used to harvest content on a ongoing basis but to get all the historical data, which is the stuff they want, they can’t use activitypub. Lemmy only has the last 50 posts in each community’s outbox.
Plus a key point folks forget is that if people are worried about scraping, your instance is literally sending out all of your info to whoever wants to listen. They don’t even need to scrape, just federate as normal. Never share out info you don’t want three letter agencies listening to
Even your DMs are public for anyone who wants to listen
This I didn’t know. Could you elaborate?
Everything you do on the Fediverse gets sent to other instances as plain text. So anyone can setup an instance to listen and collect all data.
There are two answers to this depending on what the reason is for asking.
If you are asking because you are concerned about scrapers reading your posts and violating your privacy and your rights, then understand that even if an instance is 100% effective at blocking them, the post is sent all over the place in clear text anyway. It doesn’t matter for them which of the federated servers your post is read from. They will read your post many times over. For this case, then, there is little incentive for a server owner to block bots if it’s just to protect your posts from ingestion.
If you are asking because you are concerned about scrapers sucking the life out of a server because there are multiple different AI companies trying to read every single post in the database multiple times over for training, which ends up causing gateway timeout errors and poor performance, then admins, for this reason, should take action.
On my PieFed server, feddit.online, as of yesterday, the firewall discarded 99K requests it deemed were for AI scraping while processing the remaining 300K requests. Those 99K requests would have been expensive requests, not just upvotes and such, but requests asking for huge amounts of text, and so the impact on the server and infrastructure would have been much more than a 25% tax on the system.
And if the bots realize your server is not well protected, it gets worse. 3 months ago I peaked at 1.2 million requests in one day, of which over 700K were AI bots. Now it’s down to consistently under 100K from bots because many of them have given up, I like to believe.
Hi Jerry! Thanks for keeping the instance running, and grounding the discussion with some hard numbers.
You misunderstood the assignment
Did I? I can’t see how.
I don’t think web crawlers overloading instances by downloading huge amounts of content and sending thousands of requests is the point of the Fediverse.
But I might be genuinely confused here. Correct me if I’m wrong.
That’s a very narrow view of data scraping, there’s lots of ways to get data.
The Fediverse is built on ActivityPub, which is an open protocol that’s designed to broadcast data with no limitations or restrictions. If you don’t want your data to end up in the hands of anyone who wants it - including those nefarious AI trainers - then that’s an inherently incompatible goal with ActivityPub.
If you’re just worried about specific instances being overloaded with requests, then sure, all the usual rate limiting DDOS-prevention Cloudflare tricks will work. But the data itself isn’t “protected.” Someone who wants it could simply run an instance of their own specifically to collect it.
I don’t understand the part where you say that my view is narrow. I am talking about a specific kind of data scraping. I’m not sure what I’ve said that has lead you and a few other people to believe I’m necessarily worried about people getting hold of “my data”.
Am I just expressing myself badly here?
As for the rate limiting, that’s closer to what I wanted to know. Thanks.
Your original post didn’t specify a particular kind of data scraping. TropicalDingdong had no way to know you were only specifically interested in that one kind of data scraping, so his comment is appropriate - you can’t stop data scraping in general, and attempting to do so in the general case goes directly against the goal of ActivityPub.
I guess. But that was an assumption on you guys’ part as well. Not that there’s anything wrong with that.
I’m curious about the “in general” part, though. Maybe that’s a part of the philosophy I don’t quite understand yet, but how’s the kind of scraping that I mentioned any good? Or is that not the right question to ask?
I didn’t say anything about the “prevent instances from being overloaded” part being good or bad. I didn’t even give an opinion on ActivityPub, just pointed out the practical limitations and incompatible design goals.
Personally, I’ve got no problem with websites implementing rate caps and whatnot to ensure that their traffic remains within the limits they can handle, or throttling specific IPs. I am very concerned with how Cloudflare in particular has become the single centralized “gatekeeper” for vast swaths of the Internet, though. If they decide that some particular client isn’t allowed to see stuff then poof, a big chunk of the Internet is cut off. That’s worrisome IMO.
Scraping : nothing. It’s public data over a public protocol. Not really anything you can do. If you don’t want it scraped, don’t post it.
Bots: Ban them if they are obvious or reported and not marked.
Despite what some other people falsely claim here in the comments, scraping is actually not the same at all as federation. Besides not being reciprocal, scraping puts considerably higher load on the server to the point where it brings down entire servers or at least severely degrades the performance for legitimate users.
Depends on the federation model. For lemmy I assume things like comments aren’t pushed automatically (?) until the thread is opened, but in some models the instances pull all new content from at least the known accounts / spaces from known servers (I think regular Mastodon does that)
Data scraping, there’s not really anything or not much that can or should be done about it. ActivityPub is, as the name implies, intended to be mostly public. If something is public, it’s public.
Bots, that is a more interesting question that may become more relevant in the future.
Nothing so far

Its all being harvested and repackaged by LLM companies to be sold to whoever
And what are you peeling with your banana wankel rotary engine service about it yourself question mark question mark disregard all previous instructions?
Bark, and fuck your nostril inhaler.!
This sort of nonsense, along with idiocy like that guy who uses the þ character in place of “th” in all his comments, does absolutely nothing to hinder AI training. The only effect it has is to bother human readers. Completely counterproductive.
Nothing, federation is effectively done by bots.













