Sunday, August 26, 2018

You're Doing it Wrong: Debug

One of the last things that gets any attention in any large software project is the debug system - which is amazing because the vast majority of time spent developing a large software project is debugging. You'd think people would want to make it easier. You'd think that EVERYONE would want to make it easier. From the bean counters who are paying ridiculous sums of money for every minute of time, down to the individual contributor who knows that if they actually reported the true number of minutes that went into the product, the company would go bankrupt.

So what does a debug system do, and why does it get so little attention?

A debug system provides the tools and information necessary to identify the causes of undesired behavior. That's it, really, and what that means may vary some from project to project, but in most cases it boils down to some kind of log file that records the state of the software as it runs. Most commonly this is a human readable file (but it doesn't have to be), and is used after the fact to determine how some undesired event occurred.

Why is it so neglected? Well, people just seem to assume that they are perfect, therefore everyone else is perfect too. Perfect people never make mistakes - they never write incorrect code, and they certainly never input bad values or click things in the wrong order. Naturally, the systems people work with are also perfect, and never fail or get viruses or suffer bad RAM or otherwise pass garbage into a program. So, since nothing can ever go wrong, there's no need to check for errors or report events that occur. Obviously, people who believe this should not lead software teams.

In general, debug systems are also not on people's radar when they start a new project. They are focused on the awesome new function they want to provide, and are not thinking much about how they will troubleshoot that function when it ships.

In the case where the undesired event is an outright crash, there's usually some form of post-mortem analysis. In Linux, for instance, applications generally write 'core' dumps. Windows applications can do the same thing (in the form of 'minidumps' -- and developers who need to support Windows software would be well advised to enable these.) Both of these are binary files that capture the state of the software at the moment of failure, allowing you to load them into a debugger and examine just about any part of the system that might interest you - a moment frozen in time.

But although core dumps are a great first line of defense, they are often not enough. For instance, you can see that your pointer is NULL, and you can see that you crashed trying to deference it, but you can't figure out WHY it became NULL. This is because you only have a snapshot of the exact instant that the NULL was dereferenced. Some people would just add a NULL check right there and claim success - those people should not lead software teams.

A text log is most application's next line of defense. There's a pretty common evolution of text logging in large applications. The first evolution is just a simple dump of lines of text into a log. To avoid stressing the system or filling the disk, log is kept to a minimum or even programmatically disabled except in special cases. As log lines are added, eventually someone gets a line of text in a spinloop, which fills the disk, so filtering for duplicate lines is added. Then the increasingly complex system needs more and more debug, some of which becomes very frequent and is needed but is not spinning, so we add throttling. Then the amount of debug available starts to become unwieldy, and so it is reorganized into classifications (such as severity levels, and possibly subsystems). Then the fact that debug needs to be manually turned on and recompiled starts to be a problem, so debug is made programmatically controllable. Now people are so used to having this great information that when a problem occurs at 2am, they want the log information. But it was off, as the system can't cope with logging everything all the time. So the writer is optimized and logs are on. Finally, although this step sometimes comes much earlier, someone gets tired of the disk getting filled up, and log rotation is added. At this point, you pretty much have a basic text logging system.

So to recap, you might as well start with:
-classifications. Classify your message severity (DEBUG, INFORMATION, WARNING, ERROR, for instance) and possibly by subsystem, if appropriate.
-dynamic. Make it possible to turn logging on and off by classification. Recommend that users leave it on if possible.
-consistent. All systems should emit debug - but see below for what.
-throttling. Should be able to detect if the same debug is being emitted more frequently than is actually useful to a human reader. If you wouldn't read every single line, the computer probably shouldn't write every single line. "x lines per second" from a single statement usually works well enough, but your needs may vary. Make sure the log includes mention of throttled statements so the reader knows why it's not showing, and make sure that mention itself doesn't spam the log!
-rotation. The system should be able to track and control the amount of disk space consumed by logs, while making a reasonable amount of data available for troubleshooting. How much depends on the system and the needs.
-performant. Debug systems need to execute quickly enough to not impact the primary purpose of the system. By designing everything up front, you often get this one out of simple necessity. The simplest first pass is usually a ring buffer that debug statements are copied into, with a separate, lower priority thread writing the ring buffer out to disk. For bonus points, have your crash handler dump the ring buffer too. ;)
-information. I didn't mention this above, but your log functions should almost always be a macro so you can automatically include the filename and line number in them. ie:
#define log(LEVEL,args) writeToLog(__FILE__,__LINE__,LEVEL,args)

So what goes into a text log? Basically, anything you can't determine by any other means. If you were troubleshooting a video game, for instance, and you wondered 'what state was the boss in when he suddenly flew up into the sky?', you probably need to log changes to the boss' state.

Which leads me to the point - you want to keep your information as compact as possible. Logging changes is a pretty good tactic, since the reader can answer any state questions by reading backwards to the last state change. Likewise, if you have an if...then...else statement with two sides, in many cases you need only log one side (the LESS common one), since the reader can determine if the log doesn't appear, the other path was taken. (However, if there's related information you always need, emitting it before or after the IF statement with the data needed to determine which path was taken is also good.)

Good debug lets you read a function beside the logfile, and know exactly what that function did and why. Good debug also does not emit information that you can infer from other information, unless it's so much effort to make that inference that having the computer tell you actually saves time.

As a for instance, consider this:

log(DBG, "Pointer value is %p", ptr);
if (ptr == NULL) {
  log(ERR, "Pointer was NULL, not responding.");
  res=0;
} else {
  log(DBG, "Pointer is valid, doing something useful.");
  res=ptr->doSomethingUseful();
  log(DBG, "Got %d from doSomethingUseful()");
}
log(DBG,"Final result is %d", res);

You can certainly tell EXACTLY what this block of code did.

Pointer value is 0x00000000
Pointer was NULL, not responding.
Final result in 0

...or...

Pointer value is 0x12345678
Pointer is valid, doing something useful.
Got 42 from doSomethingUseful()
Final result is 42.

It's explicit, but your log will fill pretty fast if you're that verbose. Consider this case, which gives you all the same information:

if (ptr == NULL) {
  res=0;
} else {
  res=ptr->doSomethingUseful();
}
log(DBG,"Pointer was %p, Final result is %d", res);

Pointer was 0x0000000, Final result is 0

...or...

Pointer was 0x12345678, Final result is 42

Mind you, this does lose the ERR level message in the NULL pointer case, which if it was important, it would be just fine to keep. You can generally consider messages at different levels to be in different classes.

Which brings me to the concept of what goes into a message. In general, you should assign a certain level (such as INFO) for which that level and above is intended for your customers. Customer facing debug messages should attempt to communicate using human language (not numbers). Furthermore, high level customer messages, such as a WARNING or ERROR, must tell the customer what to do about it. If there is nothing they can do, it doesn't justify being a WARNING or an ERROR.

For instance, this is useless:

if (ptr == NULL) {
  log(ERR, "NULL ptr... about to crash.");
}

You've told the end user, and there is literally nothing they can do about it. Just make it a DBG level for the poor sap who needs to reverse engineer the crash, and maybe spew out some other relevant information if you have it. But don't tell the user. However, this might be helpful.

if (ptr == NULL) {
  log(ERR, "A failure has been detected. Please safely shut down and restart the system. Contact support to collect troubleshooting information.")
}

Human communication. Directions on what to do about it. Maybe the software should initiate its own shutdown, if that makes sense. Either way, far more useful to the person who suddenly gets the big red message.

That's it for now.


No comments:

Post a Comment