this post was submitted on 07 Jul 2025

590 points (98.2% liked)

Open Source

38848 readers

87 users here now

All about open source! Feel free to ask questions, and share news, and interesting stuff!

Useful Links

Rules

Posts must be relevant to the open source ideology
No NSFW content
No hate speech, bigotry, etc

Related Communities

Community icon from opensource.org, but we are not affiliated with them.

founded 5 years ago

MODERATORS

kevincox@lemmy.ml

CrypticCoffee@lemmy.ml

Lettuceeatlettuce@lemmy.ml

590

The Open-Source Software Saving the Internet From AI Bot Scrapers (www.404media.co)

submitted 5 days ago by fattyfoods@feddit.nl to c/opensource@lemmy.ml

109 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] irotsoma@lemmy.blahaj.zone 29 points 4 days ago

TL;DR: You should have both due to the explicit breaking of the robots.txt contract by AI companies.

AI generally doesn't obey robots.txt. That file is just notifying scrapers what they shouldn't scrape, but relies on good faith of the scrapers. Many AI companies have explicitly chosen not no to comply with robots.txt, thus breaking the contract, so this is a system that causes those scrapers that are not willing to comply to get stuck in a black hole of junk and waste their time. This is a countermeasure, but not a solution. It's just way less complex than other options that just block these connections, but then make you get pounded with retries. This way the scraper bot gets stuck for a while and doesn't waste as many of your resources blocking them over and over again.