• rtxn@lemmy.world
      link
      fedilink
      arrow-up
      3
      ·
      edit-2
      6 days ago

      Anubis is a simple anti-scraper defense that weighs a web client’s soul by giving it a tiny proof-of-work workload (some calculation that doesn’t have an efficient solution, like cryptography) before letting it pass through to the actual website. The workload is insignificant for human users, but very taxing for high-volume scrapers. The calculations are done on the client’s side using Javascript code.

      (edit) For clarification: this works because the computation workload takes a relatively long time, not because it bogs down the CPU. Halting each request at the gate for only a few seconds adds up very quickly.

      Recently, the FSF published an article that likened Anubis to malware because it’s basically arbitrary code that the user has no choice but to execute:

      […] The problem is that Anubis makes the website send out a free JavaScript program that acts like malware. A website using Anubis will respond to a request for a webpage with a free JavaScript program and not the page that was requested. If you run the JavaScript program sent through Anubis, it will do some useless computations on random numbers and keep one CPU entirely busy. It could take less than a second or over a minute. When it is done, it sends the computation results back to the website. The website will verify that the useless computation was done by looking at the results and only then give access to the originally requested page.

      Here’s the article, and here’s aussie linux man talking about it.

      • The Quuuuuill@slrpnk.net
        link
        fedilink
        English
        arrow-up
        2
        ·
        7 days ago

        fwiw Anubis is working on a more respectful update, this was their first pass solution for what was basically a break glass emergency. i understand FSF’s concern, but Anubis is the only thing that’s making a free and open internet remotely possible right now, and far better it that nightmare fuel like cloudflare

        • daniskarma@lemmy.dbzer0.com
          link
          fedilink
          arrow-up
          0
          arrow-down
          1
          ·
          7 days ago

          How does it factor in the “free” and “open”?

          It seems to be more about IP protection that any other thing.

          • rtxn@lemmy.world
            link
            fedilink
            arrow-up
            1
            ·
            edit-2
            7 days ago
            • A web server that can’t discriminate between a request made by a human and one made by a machine has to handle all requests. It may not be an issue for large companies like Amazon or Microsoft, but small websites will suffer timeouts and outages.
            • Without a locally hosted solution like Anubis, small websites would have to move behind a large centralized service like Cloudflare.
            • Otherwise they might not be able to continue operating and only large corporate-backed services like Twitter and Reddit would survive.

            The alternative is having to choose between Reddit and Cloudflare. Does that look “free” and “open” to you?

            • daniskarma@lemmy.dbzer0.com
              link
              fedilink
              arrow-up
              0
              arrow-down
              1
              ·
              edit-2
              7 days ago

              That whole thing is under two wrong suppositions.

              It assumes that we sites are under constant ddos and that cannot exist if there is not ddos protection.

              This is false.

              It assumes that anubis is effective against ddos attacks. Which is not. Is a mitigation, but any ddos attack worth is name would not have any issue bringing down a site with anubis. As the sever still have to handle request even if they are smaller requests.

              Anubis only use case is to make AI scrappers to consume more energy while scrapping, while also making many legitimate users also use more energy. It’s just being promoted in the anti-AI wave, but I don’t really see much usefulness into it.

              • ℍ𝕂-𝟞𝟝@sopuli.xyz
                link
                fedilink
                English
                arrow-up
                0
                ·
                7 days ago

                Websites were under a constant noise of malicious requests even before AI, but now AI scraping of Lemmy instances usually triples traffic. While some sites can cope with this, this means a three-fold increase in hosting costs in order to essentially fuel investment portfolios.

                AI scrapers will already use as much energy as available, so making them use more per site measn less sites being scraped, not more total energy used.

                And this is not DDoS, the objective of scrapers is to get the data, not bring the site down, so while the server must reply to all requests, the clients can’t get the data out without doing more work than the server.

                • daniskarma@lemmy.dbzer0.com
                  link
                  fedilink
                  arrow-up
                  0
                  arrow-down
                  1
                  ·
                  7 days ago

                  AI does not triple traffic. It’s a completely irrational statement to make.

                  There’s a very limited number of companies training big LLM models, and these companies do train a model a few times per year. I would bet that the number of requests per year of s resource by an AI scrapper is on the dozens at most.

                  Using as much energy as a available per scrapping doesn’t even make physical sense. What does that sentence even mean?

                  • grysbok@lemmy.sdf.org
                    link
                    fedilink
                    English
                    arrow-up
                    0
                    ·
                    6 days ago

                    You’re right. AI didn’t just triple the traffic to my tiny archive’s site. It way more than tripled it. After implementing Anubis, we went from 3000 ‘unique’ visitors down to 20 in a half-day. Twenty is a much more expected number for a small college archive in the summer. That’s before I did any fine-tuning to Anubis, just the default settings.

                    I was getting constant outage reports. Now I’m not.

                    For us, it’s not about protecting our IP. We want folks to get to find out information. That’s why we write finding aids, scan it, accession it. But, allowing bots to siphon it all up inefficiently was denying everyone access to it.

                    And if you think bots aren’t inefficient, explain why Facebook requests my robots.txt 10 times a second.

                  • ℍ𝕂-𝟞𝟝@sopuli.xyz
                    link
                    fedilink
                    English
                    arrow-up
                    0
                    ·
                    7 days ago

                    AI does not triple traffic. It’s a completely irrational statement to make.

                    Multiple testimonials from people who host sites say they do. Multiple Lemmy instances also supported this claim.

                    I would bet that the number of requests per year of s resource by an AI scrapper is on the dozens at most.

                    You obviously don’t know much about hosting a public server. Try dozens per second.

                    There is a booming startup industry all over the world training AI, and scraping data to sell to companies training AI. It’s not just Microsoft, Facebook and Twitter doing it, but also Chinese companies trying to compete. Also companies not developing public models, but models for internal use. They all use public cloud IPs, so the traffic is coming from all over incessantly.

                    Using as much energy as a available per scrapping doesn’t even make physical sense. What does that sentence even mean?

                    It means that Microsoft buys a server for scraping, they are going to be running it 24/7, with the CPU/network maxed out, maximum power use, to get as much data as they can. If the server can scrape 100 sites per minute, it will scrape 100 sites. If it can scrape 1000, it will scrape 1000, and if it can do 10, it will do 10.

                    It will not stop scraping ever, as it is the equivalent of shutting down a production line. Everyone always uses their scrapers as much as they can. Ironically, increasing the cost of scraping would result in less energy consumed in total, since it would force companies to work more “smart” and less “hard” at scraping and training AI.

                    Oh, and it’s S-C-R-A-P-I-N-G, not scrapping. It comes from the word “scrape”, meaning to remove the surface from an object using a sharp instrument, not “scrap”, which means to take something apart for its components.

              • rtxn@lemmy.world
                link
                fedilink
                arrow-up
                0
                ·
                edit-2
                7 days ago

                It assumes that we sites are under constant ddos

                It is literally happening. https://www.youtube.com/watch?v=cQk2mPcAAWo https://thelibre.news/foss-infrastructure-is-under-attack-by-ai-companies/

                It assumes that anubis is effective against ddos attacks

                It’s being used by some little-known entities like the LKML, FreeBSD, SourceHut, UNESCO, and the fucking UN, so I’m assuming it probably works well enough. https://policytoolbox.iiep.unesco.org/ https://xeiaso.net/notes/2025/anubis-works/

                anti-AI wave

                Oh, you’re one of those people. Enough said. (edit) By the way, Anubis’ author seems to be a big fan of machine learning and AI.

                (edit 2 just because I’m extra cross that you don’t seem to understand this part)

                Do you know what a web crawler does when a process finishes grabbing the response from the web server? Do you think it takes a little break to conserve energy and let all the other remaining processes do their thing? No, it spawns another bloody process to scrape the next hyperlink.

                • daniskarma@lemmy.dbzer0.com
                  link
                  fedilink
                  arrow-up
                  0
                  arrow-down
                  1
                  ·
                  edit-2
                  7 days ago

                  Some websites being under ddos attack =/= all sites are under constant ddos attack, nor it cannot exist without it.

                  First there’s a logic fallacy in there. Being used by does not mean it’s useful. Many companies use AI for some task, does that make AI useful? Not.

                  The logic it’s still there all anubis can do against ddos is raising a little the barrier before the site goes down. That’s call mitigation not protection. If you are targeted for a ddos that mitigation is not going to do much, and your site is going down regardless.

                  • CanadaPlus@lemmy.sdf.org
                    link
                    fedilink
                    arrow-up
                    0
                    ·
                    edit-2
                    6 days ago

                    If a request is taking a full minute of user CPU time, it’s one hell of a mitigation, and anybody who’s not a major corporation or government isn’t going to shrug it off.

          • kautau@lemmy.world
            link
            fedilink
            arrow-up
            0
            ·
            7 days ago

            Free software

            users have the freedom to run, copy, distribute, study, change and improve the software

            https://www.gnu.org/philosophy/free-sw.en.html

            Open source

            https://en.wikipedia.org/wiki/The_Open_Source_Definition

            1. No discrimination against fields of endeavor, like commercial use

            You are removing the terms software and source. The code is freely available and to be open source should be usable for whatever purpose.

            As an aside, it’s used by smaller sites frequently to prevent overwhelming scraping that could take down the site, which has become far more rampant recently due to AI bots

            • daniskarma@lemmy.dbzer0.com
              link
              fedilink
              arrow-up
              0
              ·
              edit-2
              7 days ago

              I’m not saying it’s not open source or free. I say that it does not contribute to make the web free and open. It really only contribute into making everyone waste more energy surfing the web.

              The web is already too heavy we do NOT need PoW added to that.

              I don’t think even a raspberry 2 would go down over a web scrap. And Anubis cannot protect from proper ddos so…

              • kautau@lemmy.world
                link
                fedilink
                arrow-up
                0
                ·
                7 days ago

                I don’t think even a raspberry 2 would go down over a web scrap

                Absolutely depends on what software the server is running, if there’s proper caching involved. If running some PoW is involved to scrape 1 page it shouldn’t be too much of an issue, as opposed to just blindly following and ingesting every link.

                Additionally, you can choose “good bots” like the internet archive, and they’re currently working on a list of “good bots”

                https://github.com/TecharoHQ/anubis/blob/main/docs/docs/admin/policies.mdx

                AI companies ingesting data nonstop to train their models doesn’t make for a open and free internet, and will likely lead to the opposite, where users no longer even browse the web but trust in AI responses that maybe be hallucinated.

                • daniskarma@lemmy.dbzer0.com
                  link
                  fedilink
                  arrow-up
                  0
                  arrow-down
                  1
                  ·
                  edit-2
                  7 days ago

                  There a small number of AI companies training full LLM models. And they usually do a few trains per years. What most people see as “AI bots” are not actually that.

                  The influence of AI over the net is another topic. But anubis is also not doing anything about that as it just makes so the AI bots waste more energy getting the data or at most that data under “anubis protection” does not enter the training dataset. The AI will still be there.

                  Am I in the list of “good bots” ?sometimes I scrap websites for price tracking or change tracking. If I see a website running malware on my end I would most likely just block that site, one legitimate user less.