What are the platforms on the Fediverse doing to prevent data scraping and prevent bots?

thesharky@piefed.blahaj.zone · 5 hours ago

What are the platforms on the Fediverse doing to prevent data scraping and prevent bots?

Jerry on PieFed@feddit.online · 5 hours ago

There are two answers to this depending on what the reason is for asking.

If you are asking because you are concerned about scrapers reading your posts and violating your privacy and your rights, then understand that even if an instance is 100% effective at blocking them, the post is sent all over the place in clear text anyway. It doesn’t matter for them which of the federated servers your post is read from. They will read your post many times over. For this case, then, there is little incentive for a server owner to block bots if it’s just to protect your posts from ingestion.

If you are asking because you are concerned about scrapers sucking the life out of a server because there are multiple different AI companies trying to read every single post in the database multiple times over for training, which ends up causing gateway timeout errors and poor performance, then admins, for this reason, should take action.

On my PieFed server, feddit.online, as of yesterday, the firewall discarded 99K requests it deemed were for AI scraping while processing the remaining 300K requests. Those 99K requests would have been expensive requests, not just upvotes and such, but requests asking for huge amounts of text, and so the impact on the server and infrastructure would have been much more than a 25% tax on the system.

And if the bots realize your server is not well protected, it gets worse. 3 months ago I peaked at 1.2 million requests in one day, of which over 700K were AI bots. Now it’s down to consistently under 100K from bots because many of them have given up, I like to believe.

CombatWombat@feddit.online · 4 hours ago

Hi Jerry! Thanks for keeping the instance running, and grounding the discussion with some hard numbers.