Thursday, November 12, 2020

On Broken Systems...

 I've been doing a lot of scripting for the last couple of weeks on Second Life (in violation of my SurveyWalrus, don't tell him!) One of the things that surprises me about it is just how many of the APIs are unreliable. That is, the statement executes, no error occurs, but it doesn't work every time.

Actually, this is crazy common, especially when dealing with hardware in the real world. Nothing works quite as documented, and often not as you expected it to either. Surprisingly for this blog, I'm not criticizing it, I find it rather endearing. (When it's HARDWARE. Hardware is hard, SL has no excuse ;) )

So what do you do when you execute the command, and nothing happens? Well, there are a few things that tend to help out in that case.

First and foremost, you need to ensure that the command actually happened, and that it happened the way you expected it to. There's little point going any further if you can't prove this. This means instrumenting your code, which I've covered before. It probably also means probing the hardware to make sure what was supposed to trigger the effect actually happened. Oscilloscopes have come down a lot in price over the decades, and even a cheap pocket one from Alibaba is a better diagnostic tool than poking the circuit with a wet finger.

(Disclaimer: don't poke circuits with wet fingers. It's not good for the circuit, it's not good for your finger, it likely won't tell you anything and it looks silly.)

After you have verified that the command is happening the way you expected, just take a moment and double-check the documentation matches your expectation. This step can save you hours of effort. Of course, about one in five times the documentation is also wrong. Optimist.

Okay! So your code is working! The command matches the documentation! And it doesn't work. Now what? Debugging starts. That's right, you don't only have to debug the code you write, you will probably (always) have to debug code you didn't write and hardware you didn't create. And this part you usually can't fix! Man, computers are great!

More common than outright failure is inconsistent operation. That is, sometimes it works and sometimes it doesn't. This was the case with the SL APIs. It's very rare that something which is not actually defective is truly random -- and if it is, you can't work with it anyway. You need to start in on the scientific method and work out which conditions it works in, and which conditions it doesn't. At the over-simplified level, that's:

1) Create a theory
2) Devise a way to test your theory
3) Execute your test and record the results
4) Revise the theory based on the new information
5) Repeat at 2

You'll notice there's no exit condition. Usually, you stop when you get it reliable enough, and that depends on your needs. Or you stop when the hardware guy finally gets tired of your questions and adds some resistors to the circuit, suddenly stabilizing it. ;)

But to get you started - most of the time I've run into inconsistent operation, it has been timing. From the Atari Jaguar to cheap LCD panels to, yes, Second Life, it's really common for both APIs and hardware to drop commands if they come packed too close together. So, spacing them out is a good first test.

What if it is truly random (or you just can't narrow it down any further?), and you have no choice but to use it. Well, first, push back really hard, because you really don't want to have to support your code on this broken system for the next 10 years, do you? It's not over when you push it to Gitlab!

But if you really can't, well, you need to improve the odds of success. Can you safely execute the command twice? Safely means that it's okay if the command works once and fails once, and still okay if it executes twice. Apparently the NES needs to do this workaround on the controller port if making heavy use of the sample channel, for instance. Alternately, is there a way to VERIFY the command, and repeat it if it failed? Is there another way to accomplish the same thing, and bypass the broken command? Even if it's slower, that is probably better than unreliable.

In the case of outright failure, then you have really two possible causes: either the device/api is actually broken, or you are commanding it wrong. In the former case, you probably can't do very much about it -- and if you are still reading here, you probably can't prove it either. So you need to figure out what you are doing wrong.

Unfortunately, this one is much harder to advise - it all comes down to experience. Think about similar APIs you have implemented, and compare to the information you have. Does it make sense to try a different byte order? What about bit order? Is there an off-by-one error? (This is common in software and hardware both!) Make sure, if you are working with hardware, that it is safe to send bad data on purpose. Set up an isolated test bed, and try different things. Use the scientific method again, and you might just figure it out!

Then you can enjoy a coffee and go tease the hardware guy that the software guy found their bug. That's always fun. ;)