• tal@lemmy.today
    link
    fedilink
    English
    arrow-up
    8
    ·
    edit-2
    6 hours ago

    What makes this worse is that git servers are the most pathologically vulnerable to the onslaught of doom from modern internet scrapers because remember, they click on every link on every page.

    The especially disappointing thing is that, for the specific case that Xe was running into, a better-written scraper could just recognize that this is a public git repository and just git clone the thing and get all the useful code without the overhead. Like, it’s not even “this scraper is scraping data that I don’t want it to have”, but “this scraper is too dumb to just scrape the thing efficiently and is blowing both the scraper’s resources and the server’s resources downloading innumerable redundant copies of the data”.

    It’s probably just as well, since the protection is relevant for other websites, and he probably wouldn’t have done it if he hadn’t been getting his git repo hammered, but…

    EDIT: Plus, I bet that the scraper was requesting a ton of files at once from the server, since he said that it was unusable. Like, you have a zillion servers to parallelize requests over. You could write a scraper that requested one file at once per server, which is common courtesy, and you’re still going to be bandwidth constrained if you’re schlorping up the whole Internet. Xe probably wouldn’t have even noticed.

    • mic_check_one_two@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      8
      ·
      6 hours ago

      Sorta like how people complain about bots scraping Lemmy, even though federation already exists as a standardized protocol for distributing data. Like any scraper who wanted to efficiently scrape Lemmy would just spin up their own instance and let federation do the scraping for them. It would even have the added benefit that they could set their server to ignore delete requests, so deleted posts/comments wouldn’t get automatically removed from their server. And then they could scrape as much as they wanted without impacting anyone else.

      But they don’t want to do that, because it would require the smallest modicum of forethought. They don’t care that scrapers are trashing the Internet and causing massive bandwidth issues for hosters. They just want the data, and they want it now. All of those “bots are flooding my server and eating all my bandwidth, so legitimate users can’t actually access the site” complaints are for other people.