Opening discussion about network crawlability and SPAM countermeasures — heartwood

I noticed when I run rad node routing, I can see a huge list of known rad: IDs, that can be just copied-pasted to app.radicle.xyz to see what's hosted there. Random repos from complete strangers even the opposite side of the network.

Today when Radicle is still fresh, I believe most of the content is legitimate, but for me this behaviour indicates a fundamental weakness of the system. What if one day some people start to flood the network with illegal materials such as child pornography? It could be easily discovered by anyone. One could build a script to crawl the entire network, then do whatever they like with the data – look for keywords, train their LLMs, scan the pictures... As a result, Radicle could lose its reputation or in worst case be massively blocked by ISPs and governments.

Of course we wouldn't want to censor the network – that is, the people's ability to store and access the data, to share it to their collegues and collaborate among each other – rather to limit the discoverability, not to make it a global wall where anyone could post the worst stuff the world has to offer, because it wouldn't be possible to moderate it, as is in hierarchical or federated spaces. Here I'd like to open a discussion about how to achieve the former while avoiding the latter.

My initial idea, is to have the repo ID splitted into two parts – public and secret. The public part is enough to identify which nodes seeds it, the secret part is shared among the seeders, so that any node requesting it must pass it away. From user's PoV it's just a repo ID found somewhere on the Internet or received in a message.

The possible downside is that even there, crawling would still be possible, as long as we are able to list all the repos linked to a user, just a little harder. Obviously, this feature might be disabled, so we couldn't list all user's repos, but for instance, verify object A links to the same user as object B, however this feels uncomfortable, as we are used to such feature from other git forges. Instead, one could make a "hub" repo linking to all the others, or let the system do that automatically, perhaps even decide what is listed and what is unlisted. The key point is that it's a one-way navigation. If I have access to hub page, I can list the repos, but if I only know one of the repos, I can't navigate to hub page.

Another potential flaws are public PRs and issues, which, again could not be moderated, and so get overused for SPAM, especially on the popular repos. Avoiding SPAM in p2p networks is hard, so we should think it very carefully when the system is still young, not to let it become what e-mail is today. One thing that comes to my mind is configurable repo policy about content attached by strangers. Think of it as of Linux file permissions:

r – repo is readable by anyone (who has its ID)
w – anyone can attach an issue or PR
x – the attached content is available in public, not just the delegates

This picture is probably imperfect, but... you get the point. We already can delegade others to maintain our repo, so is it more than one owner? I didn't check how it works. Why don't we delegate the same way checking attached issues and PRs, or, I don't know, attach some bot with some identity to deal with it, and let our inbox rest?

As we can see, it's definitely not a small change, or anything easy. I'm wondering how we, as a community, are going to deal with the overall described problem. So I started the discussion and encourage anyone to participate.