I was reading the reddit thread on Claude AI crawlers effectively DDOSing Linux Mint forums https://libreddit.lunar.icu/r/linux/comments/1ceco4f/claude_ai_name_and_shame/
and I wanted to block all ai crawlers from my selfhosted stuff.
I don't trust crawlers to respect the Robots.txt but you can get one here: https://darkvisitors.com/
Since I use Caddy as a Server, I generated a directive that blocks them based on their useragent. The content of the regex basically comes from darkvisitors.
Sidenote - there is a module for blocking crawlers as well, but it seemed overkill for me https://github.com/Xumeiquer/nobots
For anybody who is interested, here is the block_ai_crawlers.conf I wrote.
(blockAiCrawlers) {
@blockAiCrawlers {
header_regexp User-Agent "(?i)(Bytespider|CCBot|Diffbot|FacebookBot|Google-Extended|GPTBot|omgili|anthropic-ai|Claude-Web|ClaudeBot|cohere-ai)"
}
handle @blockAiCrawlers {
abort
}
}
# Usage:
# 1. Place this file next to your Caddyfile
# 2. Edit your Caddyfile as in the example below
#
# ```
# import block_ai_crawlers.conf
#
# www.mywebsite.com {
# import blockAiCrawlers
# reverse_proxy * localhost:3000
# }
# ```
I have bought a font with a really shitty license agreement and I have a couple of questions.
How can I best share the font with the community? (I am afraid of metadata in the font files, which may be tied to my payment account etc. - I had to register and log in to download the ttf files)
How can I remove the DSIG and other metadata from the ttf file while keeping it usable?
Are they able to detect it if I use the font in a commercial product online by crawling my website and if yes, how could I prevent an automatic detection attempt?
To my (and possibly your) surprise, I didn't find any free downloads of the font online. Their license is tied to a personal account, you have to log into once a year to keep the license. As far as I understand they theoretically could use the DSIG to let the ttf files "expire", at least when used in software that verifies the signature. But I may be wrong, please let me know.
Thanks in advance and cheers-I mean ARR
Sorry for not doing much research beforehand and asking a newbee question. I am looking for some entrypoint info to the question:
How would one go about datahoarding lemmy?
It seems to be a grade above what I've been doing so far (downloading video/audio from streaming platforms and backing up web articles and blogposts as pdfs) due to the distributed nature and the activitypub protocol.
Relevant stuff that I've found so far but havent studied extensively:
@Deckweiss
@lemmy.world