Sunday, August 26, 2018

You're Doing it Wrong: Debug

One of the last things that gets any attention in any large software project is the debug system - which is amazing because the vast majority of time spent developing a large software project is debugging. You'd think people would want to make it easier. You'd think that EVERYONE would want to make it easier. From the bean counters who are paying ridiculous sums of money for every minute of time, down to the individual contributor who knows that if they actually reported the true number of minutes that went into the product, the company would go bankrupt.

So what does a debug system do, and why does it get so little attention?

A debug system provides the tools and information necessary to identify the causes of undesired behavior. That's it, really, and what that means may vary some from project to project, but in most cases it boils down to some kind of log file that records the state of the software as it runs. Most commonly this is a human readable file (but it doesn't have to be), and is used after the fact to determine how some undesired event occurred.

Why is it so neglected? Well, people just seem to assume that they are perfect, therefore everyone else is perfect too. Perfect people never make mistakes - they never write incorrect code, and they certainly never input bad values or click things in the wrong order. Naturally, the systems people work with are also perfect, and never fail or get viruses or suffer bad RAM or otherwise pass garbage into a program. So, since nothing can ever go wrong, there's no need to check for errors or report events that occur. Obviously, people who believe this should not lead software teams.

In general, debug systems are also not on people's radar when they start a new project. They are focused on the awesome new function they want to provide, and are not thinking much about how they will troubleshoot that function when it ships.

In the case where the undesired event is an outright crash, there's usually some form of post-mortem analysis. In Linux, for instance, applications generally write 'core' dumps. Windows applications can do the same thing (in the form of 'minidumps' -- and developers who need to support Windows software would be well advised to enable these.) Both of these are binary files that capture the state of the software at the moment of failure, allowing you to load them into a debugger and examine just about any part of the system that might interest you - a moment frozen in time.

But although core dumps are a great first line of defense, they are often not enough. For instance, you can see that your pointer is NULL, and you can see that you crashed trying to deference it, but you can't figure out WHY it became NULL. This is because you only have a snapshot of the exact instant that the NULL was dereferenced. Some people would just add a NULL check right there and claim success - those people should not lead software teams.

A text log is most application's next line of defense. There's a pretty common evolution of text logging in large applications. The first evolution is just a simple dump of lines of text into a log. To avoid stressing the system or filling the disk, log is kept to a minimum or even programmatically disabled except in special cases. As log lines are added, eventually someone gets a line of text in a spinloop, which fills the disk, so filtering for duplicate lines is added. Then the increasingly complex system needs more and more debug, some of which becomes very frequent and is needed but is not spinning, so we add throttling. Then the amount of debug available starts to become unwieldy, and so it is reorganized into classifications (such as severity levels, and possibly subsystems). Then the fact that debug needs to be manually turned on and recompiled starts to be a problem, so debug is made programmatically controllable. Now people are so used to having this great information that when a problem occurs at 2am, they want the log information. But it was off, as the system can't cope with logging everything all the time. So the writer is optimized and logs are on. Finally, although this step sometimes comes much earlier, someone gets tired of the disk getting filled up, and log rotation is added. At this point, you pretty much have a basic text logging system.

So to recap, you might as well start with:
-classifications. Classify your message severity (DEBUG, INFORMATION, WARNING, ERROR, for instance) and possibly by subsystem, if appropriate.
-dynamic. Make it possible to turn logging on and off by classification. Recommend that users leave it on if possible.
-consistent. All systems should emit debug - but see below for what.
-throttling. Should be able to detect if the same debug is being emitted more frequently than is actually useful to a human reader. If you wouldn't read every single line, the computer probably shouldn't write every single line. "x lines per second" from a single statement usually works well enough, but your needs may vary. Make sure the log includes mention of throttled statements so the reader knows why it's not showing, and make sure that mention itself doesn't spam the log!
-rotation. The system should be able to track and control the amount of disk space consumed by logs, while making a reasonable amount of data available for troubleshooting. How much depends on the system and the needs.
-performant. Debug systems need to execute quickly enough to not impact the primary purpose of the system. By designing everything up front, you often get this one out of simple necessity. The simplest first pass is usually a ring buffer that debug statements are copied into, with a separate, lower priority thread writing the ring buffer out to disk. For bonus points, have your crash handler dump the ring buffer too. ;)
-information. I didn't mention this above, but your log functions should almost always be a macro so you can automatically include the filename and line number in them. ie:
#define log(LEVEL,args) writeToLog(__FILE__,__LINE__,LEVEL,args)

So what goes into a text log? Basically, anything you can't determine by any other means. If you were troubleshooting a video game, for instance, and you wondered 'what state was the boss in when he suddenly flew up into the sky?', you probably need to log changes to the boss' state.

Which leads me to the point - you want to keep your information as compact as possible. Logging changes is a pretty good tactic, since the reader can answer any state questions by reading backwards to the last state change. Likewise, if you have an if...then...else statement with two sides, in many cases you need only log one side (the LESS common one), since the reader can determine if the log doesn't appear, the other path was taken. (However, if there's related information you always need, emitting it before or after the IF statement with the data needed to determine which path was taken is also good.)

Good debug lets you read a function beside the logfile, and know exactly what that function did and why. Good debug also does not emit information that you can infer from other information, unless it's so much effort to make that inference that having the computer tell you actually saves time.

As a for instance, consider this:

log(DBG, "Pointer value is %p", ptr);
if (ptr == NULL) {
  log(ERR, "Pointer was NULL, not responding.");
} else {
  log(DBG, "Pointer is valid, doing something useful.");
  log(DBG, "Got %d from doSomethingUseful()");
log(DBG,"Final result is %d", res);

You can certainly tell EXACTLY what this block of code did.

Pointer value is 0x00000000
Pointer was NULL, not responding.
Final result in 0


Pointer value is 0x12345678
Pointer is valid, doing something useful.
Got 42 from doSomethingUseful()
Final result is 42.

It's explicit, but your log will fill pretty fast if you're that verbose. Consider this case, which gives you all the same information:

if (ptr == NULL) {
} else {
log(DBG,"Pointer was %p, Final result is %d", res);

Pointer was 0x0000000, Final result is 0


Pointer was 0x12345678, Final result is 42

Mind you, this does lose the ERR level message in the NULL pointer case, which if it was important, it would be just fine to keep. You can generally consider messages at different levels to be in different classes.

Which brings me to the concept of what goes into a message. In general, you should assign a certain level (such as INFO) for which that level and above is intended for your customers. Customer facing debug messages should attempt to communicate using human language (not numbers). Furthermore, high level customer messages, such as a WARNING or ERROR, must tell the customer what to do about it. If there is nothing they can do, it doesn't justify being a WARNING or an ERROR.

For instance, this is useless:

if (ptr == NULL) {
  log(ERR, "NULL ptr... about to crash.");

You've told the end user, and there is literally nothing they can do about it. Just make it a DBG level for the poor sap who needs to reverse engineer the crash, and maybe spew out some other relevant information if you have it. But don't tell the user. However, this might be helpful.

if (ptr == NULL) {
  log(ERR, "A failure has been detected. Please safely shut down and restart the system. Contact support to collect troubleshooting information.")

Human communication. Directions on what to do about it. Maybe the software should initiate its own shutdown, if that makes sense. Either way, far more useful to the person who suddenly gets the big red message.

That's it for now.

Thursday, August 16, 2018

You're doing it wrong: Coding Styles

There has been lots and lots written about coding styles, and I'm writing still more. My basic premise is the same as most of the other articles - everyone else is wrong, so listen to me.

I see the same basic issues pop up over and over again in coding guidelines. The most egregious, or at least the ones that come to mind as I write this, follow:

  • Detailed Guidelines that cover every conceivable syntax case: Here's the first rule. If your coding guidelines take up more than a single page of normally written text - including properly spaced examples - you've got too many rules. Remember that the purpose of software development is to create code, and the purpose of having coding guidelines is to reduce bugs. If your guideline can't reasonably be defended to reduce common bugs, it isn't needed. This is subjective, but this very basic guideline can trim the size of many lists. If you have too many, people won't even remember them, let alone use them.
  • Getting picky over the little things: It's very common to enforce things like whitespace. This tends to be pedantic after a point. In truth, I approve of little things like aligning lists of initializers or cases. (Some people are ridiculously against this, their defense is they don't want to read code in a code review. Use 'ignore whitespace' then. Aligning the code leads to a faster code review because you can just scan the list, instead of searching each line for the important parts. I'll complain about code reviews again someday.) Ultimately, the purpose of software development is to create code, and the purpose of having coding guidelines is to reduce bugs. It's rare that tabs versus spaces break the code.
  • Created by a secret cabal: it's pretty common that coding guidelines are developed by the senior developers and everyone gets to follow them. But unless you're a mega-corp with a company-wide policy and you can't change the rules or your job, the guidelines should be developed by the entire team - even that kid who just finished college and it's his first day and he hasn't even done orientation or dealt with IT yet. There are two very important reasons - it provides a feeling of contribution, and you all have to deal with them - so you might as well agree on them. Be diplomatic but firm about unnecessary rules - but if someone is very persistent and nobody else really cares - maybe consider adopting the rule, even if it is tabs instead of spaces. In truth, the purpose of software development is to create code, and the purpose of having coding guidelines is to reduce bugs. If people don't like the rules, they won't follow them or enforce them, and you've just wasted everyone's time.
  • Not well communicated: I think just about every rant in every field can include this bullet point. You can't just post a note on a semi-private IM chat channel and say you've communicated the guidelines to the group. Communication means making sure EVERYONE has had their say. There are two important parts there - "EVERYONE", and "making sure". Push notifications might cover everyone (might), but if you didn't make sure, you didn't communicate. You broadcast. There's a difference. In the end, the purpose of software development is to create code, and the purpose of having coding guidelines is to reduce bugs. If people don't even know about the guidelines, it's just wasted time (not to mention that means they didn't get their input as in the previous point).
  • Based on the latest trends: also known as "hype-driven development", trends should never be enforced. Most of the time they don't last, and the rest of the time only one or two of your developers know what they are and what they mean anyway. This means that coding the latest cool trick or using some amazing style that you found on a web page will only serve to confuse everyone else. Do note, the purpose of software development is to create code, and the purpose of having coding guidelines is to reduce bugs. If people don't even understand the intent of the whizz-bang feature, they can't effectively use it.
So what is important, then? Coding guidelines should aim to make the code readable. Coding guidelines should include comments - hell yes they should. Comments describe what the code is supposed to do and why it's doing it the way that it is (comments shouldn't describe what the code is doing unless it's really unclear). Coding guidelines should discourage 'tricks' that make the code hard to read (even if it works and is neato-keen) and encourage styles that let the compiler help find bugs. But most importantly, coding guidelines should exist to help the programmer get their job done. 

A secondary goal is to help the code reviewer get the code reviewed, because of course you are using code reviewers who are enforcing the guidelines, right? The code reviewer has their own code to get written, and needs to get through the review as quickly as possible. Basic syntax rules based on common errors make it easy to check if a code should be rejected without having to analyze whether it's actually faulty, which is a huge time saver. At the same time, reviewers should be generous when considering non-bug-causing issues, like whitespace alignment. A programmer who is running through his code fixing whitespace and unaligning lists and alphabetizing includes is not writing new code or testing existing code, which one would expect, is what we'd prefer they were doing. At least if I was paying them, that's what I'd prefer. In addition, after doing all those meaningless cleanups, they then need to repeat all the testing they previously did. Whitespace cleanup costs real money, and software developers are generally not cheap. Ask yourself if changing that tab to a space is really worth $500 or more (time to checkout, fix, build, re-test, commit, restart the code review, reviewers to come check it again...), or if you can just get it next time.