Wednesday, April 5, 2017

Sleeps Don't Scale

We've all done it - something races something else, we don't care if we lose the race, so we just throw a sleep in there and call it done for the day.

Unfortunately, as a system grows, all those buried sleeps start adding up against you. A large complex system simply cannot sustain sleeps, and there are three main reasons why not.

First: a sleep usually is a workaround for a race condition. As your system grows more complicated, the timing of that race condition will change - usually the window grows larger. However, in most cases, the duration of the sleep itself remains the same. A 10ms sleep that worked fine when you only had 5 threads is suddenly right on the edge when you have 20. It may break every time when you have a hundred (eek! But you know it happens.) Good luck finding the sleep that is suddenly causing your issues, you forgot about it months ago, cause everything was working fine.

Secondly, and this is strongly related to the first point: the sleeps will destabilize your code base. Little races that you didn't even know about will come up, because some other function that you had forgotten about and assumed was quick suddenly sits around for a while, giving your new code a chance to run and then suddenly pre-empting the middle of it. You know, when you sort of assumed your main loop was idle.

Finally, they kill performance. As your system scales up, all those sleeps start to add up. Take the example of a simple web service, for instance. It receives a message on a socket, passes it to a waiting worker, and gets the response back. Now imagine you had some dumb little race condition and threw in a 10ms sleep. When you're receiving 1 response a second, who cares about a 10ms sleep? Nobody! What happens when your service gets popular and needs to process 1000 packets a second? It's impossible - you need at least 10 seconds of sleep time to process that 1 seconds worth of traffic.

So what do you do? First off, you don't let them in there in the first place.

Oh sure, there are cases where you can let it slide. A standalone task running on its own thread with no interdependencies with the rest of your mainline code, sure, let it sleep.

Sleep might also be helpful for releasing your timeslice when you're done working, but a timer or signal would be better, since you will get more predictable wakeup times. For instance, if you want to process once every 10 milliseconds, you could do your task, then sleep for 10ms, then wake again. That will work. But if your work takes 1ms, then your cycle time that loop actually becomes 11ms. If you use a timer, you could include that work time in your 10ms cycle with minimal extra effort.

But in your mainline code? Anything that is processing on behalf of another system? Really bad idea. If you are sleeping just because you need to wait for something else to be done, you are far better off finding out what that other thing is and synchronizing with it than blindly sleeping. In long sequential tasks, a state machine approach might be better. Each cycle you can check if its time to work - if yes, do the work. If no, don't. This allows you to interleave the operation of multiple clients through that state machine, rather than blocking on the completion of each individual task.

And I'm bored. No punchline today. ;)