lads

daniskarma@lemmy.dbzer0.com · edit-2 8 days ago

I mean number of pirates correlates with global temperature. That doesn’t mean causation.

The rest of the indices would aso match for any archiving bot, or with any bit in search of big data. We must remember that big data is used for much more than AI. At the end of the day scraping is cheap, but very few companies in the world have access to the processing power to train that amount of data. That’s why it seems so illogical to me.

We are seeing how many LLM models which are results of a full train, per year? Ten? twenty? Even if they update and retrain often it’s not compatible with the amount of request people are implying as AI scraping that would put services into dos risk. Specially when I would think that any AI company would not try to scrap the same data twice.

I have also experience an increase in bot requests in my host. But I just think is a result of internet getting bigger, more people using internet with more diverse intentions, some ill some not. I’ve also experience a big increase on probing and attack attempts on general, and I don’t think it’s OpenAI trying some outdated Apache vulnerability on my server. Internet is just a bigger sea with more fish in it.

grysbok@lemmy.sdf.org · 8 days ago

I just looked at my log for this morning. 23% of my total requests were from the useragent GoogleOther. Other visitors include GPTBot, SemanticScholarBot, and Turnitin. That’s the crawlers that are still trying after I’ve had Anubis on the site for over a month. It was much, much worse before, when they could crawl the site, instead of being blocked.

That doesn’t include the bots that lie about being bots. Looking back at an older screenshot of a monitors—I don’t have the logs themselves anymore—I seriously doubt I had 43,000 unique visitors using Windows per day in March.

daniskarma@lemmy.dbzer0.com · edit-2 8 days ago

Why would they request so many times a day the same data if the objective was AI model training. It makes zero sense.

Also google bots obeys robots.txt so they are easy to manage.

There may be tons of reasons google is crawling your website. From ad research to any kind of research. The only AI related use I can think of is RAG. But that would take some user requests aways because if the user got the info through the AI google response then they would not enter the website. I suppose that would suck for the website owner, but it won’t drastically increase the number of requests.

But for training I don’t see it, there’s no need at all to keep constantly scraping the same web for model training.