While you're reading this, keep in mind that I'm available for hire stupid!
Yep, you read the title correctly. I broke a chunk of StatusNet, the platform Mastodon exists atop. It wasn’t on purpose, just to be clear! My DON project, which I wrote about previously, was apparently making some odd requests, and gumming up the message queues of Mastodon and GNU Social. This was reminiscent of the slowloris class of DoS attacks from years past, where a single client could tie up the resources of an entire platform. The problem is fixed on my end, but the bug is still there in several popular OStatus implementations so I won’t go into details about it today.
What happened?
The problem was twofold. Firstly, I was making a lot of requests to a specific function on the affected OStatus nodes, owing to faulty retry logic on my side. Secondly, the requests I was making were just a tiny bit weird, which meant that they took far longer to process than normal. These two things combined meant that I was chewing up a ton of resources for no reason, and as the error rate increased, my consumption increased. Heck of a situation.
The bug that I triggered caused messages to back up in GNU Social and Mastodon instances’ event queues, delaying propagation across the network and interrupting communication on several nodes completely. I should not have been able to do this, especially by accident.
Unfortunately most popular OStatus software, in its default configuration, is vulnerable to denial of service conditions through resource exhaustion. Luckily, it doesn’t have to be this way. You can use one weird trick to defeat most of these attacks. Skip down to the “how do I protect my instance” section if you just want to help protect your own users from these problems.
Why was this possible?
In any network application, especially one exposed to the internet at large, it’s useful to treat every connection as suspicious. Assume you’re under attack, all the time, and every connection could be malicious. Sanity check everything. Rate limit like there’s no stopping. Because even though you’re not being attacked today, you might be tomorrow! A single client should not be able to overload your entire service.
Yeah, I know DDOS exists. Defending against that kind of attack requires a whole range of techniques that are a topic for another day!
Now, I realise this might sound pessimistic, and maybe it is a bit. But this attitude is borne from experience - the experience of scrambling to recover services after misbehaving clients started consuming too many resources.
I believe that the uniformity of the existing OStatus federated network has effected a sort of negative feedback cycle; underused functionality receives less attention and is thus more sensitive. This is similar to other open systems, where several “major” implementations end up in lockstep. The introduction of new implementations can be disruptive, as they may exhibit behaviour unfamiliar to the existing systems. This behaviour may trigger undesired responses from existing software, especially if the new implementations are actually operating incorrectly.
I believe a more diverse OStatus ecosystem will strengthen both new and existing implementations as time goes forward. It might be a bumpy ride, but it’ll be an interesting one!
How do I protect my instance?
Short version? Caching and rate limiting. Caching will help mitigate multiple clients overloading your service by requesting the same resource (as often happens during a traffic spike), and rate limiting will help stop a single client consuming too much time requesting different resources.
If you have specific instances that you’re strongly federated with, you may want to whitelist them in your config. I’d recommend against blacklisting anything, simply because it’s backwards if you want to protect against unexpected problems or attacks. By definition, you won’t know where those requests will originate - especially if there are proxies involved, as is the case with most attacks!
NGINX has the http_limit_req module, Apache has mod_ratelimit, IIS has its limits configuration, and Cloudflare has their own rate limiting functionality. That just about covers the whole public internet, so you really have no excuse not to aggressively rate limit sensitive functions.
The functions that I would apply this to would be those that write to the database as a side effect, perform external requests, or interact with a message queue. In the context of OStatus this means the following URLs specifically.
/.well-known/host-meta
/.well-known/webfinger
- PubSubHubbub endpoints
- Mastodon:
/api/push
- GNU Social:
/main/push/hub
- Mastodon:
- API endpoints
- Mastodon:
/api
- GNU Social:
/api
- Mastodon:
I would also cache all unauthenticated API endpoints for probably a minute or so. This includes RSS/Atom endpoints for individual users, as these will often be fetched during a follow operation. Caching these results will help protect your instance during traffic spikes. By what mechanism and for how long to cache things would be a rather in-depth topic. I suggest that you refer to the documenation for your specific package. Mastodon has a production guide, while GNU Social has only a short section in the install guide. This is not specific to OStatus though, so you should be able to follow a generic guide, like this one from Google.