Sunday, August 26, 2018

You're Doing it Wrong: Debug

One of the last things that gets any attention in any large software project is the debug system - which is amazing because the vast majority of time spent developing a large software project is debugging. You'd think people would want to make it easier. You'd think that EVERYONE would want to make it easier. From the bean counters who are paying ridiculous sums of money for every minute of time, down to the individual contributor who knows that if they actually reported the true number of minutes that went into the product, the company would go bankrupt.

So what does a debug system do, and why does it get so little attention?

A debug system provides the tools and information necessary to identify the causes of undesired behavior. That's it, really, and what that means may vary some from project to project, but in most cases it boils down to some kind of log file that records the state of the software as it runs. Most commonly this is a human readable file (but it doesn't have to be), and is used after the fact to determine how some undesired event occurred.

Why is it so neglected? Well, people just seem to assume that they are perfect, therefore everyone else is perfect too. Perfect people never make mistakes - they never write incorrect code, and they certainly never input bad values or click things in the wrong order. Naturally, the systems people work with are also perfect, and never fail or get viruses or suffer bad RAM or otherwise pass garbage into a program. So, since nothing can every go wrong, there's no need to check for errors or report events that occur. Obviously, people who believe this should not lead software teams.

In general, debug systems are also not on people's radar when they start a new project. They are focused on the awesome new function they want to provide, and are not thinking much about how they will troubleshoot that function when it ships.

In the case where the undesired event is an outright crash, there's usually some form of post-mortem analysis. In Linux, for instance, applications generally write 'core' dumps. Windows applications can do the same thing (in the form of 'minidumps' -- and developers who need to support Windows software would be well advised to enable these.) Both of these are binary files that capture the state of the software at the moment of failure, allowing you to load them into a debugger and examine just about any part of the system that might interest you - a moment frozen in time.

But although core dumps are a great first line of defense, they are often not enough. For instance, you can see that your pointer is NULL, and you can see that you crashed trying to deference it, but you can't figure out WHY it became NULL. This is because you only have a snapshot of the exact instant that the NULL was dereferenced. Some people would just add a NULL check right there and claim success - those people should not lead software teams.

A text log is most application's next line of defense. There's a pretty common evolution of text logging in large applications. The first evolution is just a simple dump of lines of text into a log. To avoid stressing the system or filling the disk, log is kept to a minimum or even programmatically disabled except in special cases. As log lines are added, eventually someone gets a line of text in a spinloop, which fills the disk, so filtering for duplicate lines is added. Then the increasingly complex system needs more and more debug, some of which becomes very frequent and is needed but is not spinning, so we add throttling. Then the amount of debug available starts to become unwieldy, and so it is reorganized into classifications (such as severity levels, and possibly subsystems). Then the fact that debug needs to be manually turned on and recompiled starts to be a problem, so debug is made programmatically controllable. Now people are so used to having this great information that when a problem occurs at 2am, they want the log information. But it was off, as the system can't cope with logging everything all the time. So the writer is optimized and logs are on. Finally, although this step sometimes comes much earlier, someone gets tired of the disk getting filled up, and log rotation is added. At this point, you pretty much have a basic text logging system.

So to recap, you might as well start with:
-classifications. Classify your message severity (DEBUG, INFORMATION, WARNING, ERROR, for instance) and possibly by subsystem, if appropriate.
-dynamic. Make it possible to turn logging on and off by classification. Recommend that users leave it on if possible.
-consistent. All systems should emit debug - but see below for what.
-throttling. Should be able to detect if the same debug is being emitted more frequently than is actually useful to a human reader. If you wouldn't read every single line, the computer probably shouldn't write every single line. "x lines per second" from a single statement usually works well enough, but your needs may vary. Make sure the log includes mention of throttled statements so the reader knows why it's not showing, and make sure that mention itself doesn't spam the log!
-rotation. The system should be able to track and control the amount of disk space consumed by logs, while making a reasonable amount of data available for troubleshooting. How much depends on the system and the needs.
-performant. Debug systems need to execute quickly enough to not impact the primary purpose of the system. By designing everything up front, you often get this one out of simple necessity. The simplest first pass is usually a ring buffer that debug statements are copied into, with a separate, lower priority thread writing the ring buffer out to disk. For bonus points, have your crash handler dump the ring buffer too. ;)
-information. I didn't mention this above, but your log functions should almost always be a macro so you can automatically include the filename and line number in them. ie:
#define log(LEVEL,args) writeToLog(__FILE__,__LINE__,LEVEL,args)

So what goes into a text log? Basically, anything you can't determine by any other means. If you were troubleshooting a video game, for instance, and you wondered 'what state was the boss in when he suddenly flew up into the sky?', you probably need to log changes to the boss' state.

Which leads me to the point - you want to keep your information as compact as possible. Logging changes is a pretty good tactic, since the reader can answer any state questions by reading backwards to the last state change. Likewise, if you have an if...then...else statement with two sides, in many cases you need only log one side (the LESS common one), since the reader can determine if the log doesn't appear, the other path was taken. (However, if there's related information you always need, emitting it before or after the IF statement with the data needed to determine which path was taken is also good.)

Good debug lets you read a function beside the logfile, and know exactly what that function did and why. Good debug also does not emit information that you can infer from other information, unless it's so much effort to make that inference that having the computer tell you actually saves time.

As a for instance, consider this:

log(DBG, "Pointer value is %p", ptr);
if (ptr == NULL) {
  log(ERR, "Pointer was NULL, not responding.");
} else {
  log(DBG, "Pointer is valid, doing something useful.");
  log(DBG, "Got %d from doSomethingUseful()");
log(DBG,"Final result is %d", res);

You can certainly tell EXACTLY what this block of code did.

Pointer value is 0x00000000
Pointer was NULL, not responding.
Final result in 0


Pointer value is 0x12345678
Pointer is valid, doing something useful.
Got 42 from doSomethingUseful()
Final result is 42.

It's explicit, but your log will fill pretty fast if you're that verbose. Consider this case, which gives you all the same information:

if (ptr == NULL) {
} else {
log(DBG,"Pointer was %p, Final result is %d", res);

Pointer was 0x0000000, Final result is 0


Pointer was 0x12345678, Final result is 42

Mind you, this does lose the ERR level message in the NULL pointer case, which if it was important, it would be just fine to keep. You can generally consider messages at different levels to be in different classes.

Which brings me to the concept of what goes into a message. In general, you should assign a certain level (such as INFO) for which that level and above is intended for your customers. Customer facing debug messages should attempt to communicate using human language (not numbers). Furthermore, high level customer messages, such as a WARNING or ERROR, must tell the customer what to do about it. If there is nothing they can do, it doesn't justify being a WARNING or an ERROR.

For instance, this is useless:

if (ptr == NULL) {
  log(ERR, "NULL ptr... about to crash.");

You've told the end user, and there is literally nothing they can do about it. Just make it a DBG level for the poor sap who needs to reverse engineer the crash, and maybe spew out some other relevant information if you have it. But don't tell the user. However, this might be helpful.

if (ptr == NULL) {
  log(ERR, "A failure has been detected. Please safely shut down and restart the system. Contact support to collect troubleshooting information.")

Human communication. Directions on what to do about it. Maybe the software should initiate its own shutdown, if that makes sense. Either way, far more useful to the person who suddenly gets the big red message.

That's it for now.

Thursday, August 16, 2018

You're doing it wrong: Coding Styles

There has been lots and lots written about coding styles, and I'm writing still more. My basic premise is the same as most of the other articles - everyone else is wrong, so listen to me.

I see the same basic issues pop up over and over again in coding guidelines. The most egregious, or at least the ones that come to mind as I write this, follow:

  • Detailed Guidelines that cover every conceivable syntax case: Here's the first rule. If your coding guidelines take up more than a single page of normally written text - including properly spaced examples - you've got too many rules. Remember that the purpose of software development is to create code, and the purpose of having coding guidelines is to reduce bugs. If your guideline can't reasonably be defended to reduce common bugs, it isn't needed. This is subjective, but this very basic guideline can trim the size of many lists. If you have too many, people won't even remember them, let alone use them.
  • Getting picky over the little things: It's very common to enforce things like whitespace. This tends to be pedantic after a point. In truth, I approve of little things like aligning lists of initializers or cases. (Some people are ridiculously against this, their defense is they don't want to read code in a code review. Use 'ignore whitespace' then. Aligning the code leads to a faster code review because you can just scan the list, instead of searching each line for the important parts. I'll complain about code reviews again someday.) Ultimately, the purpose of software development is to create code, and the purpose of having coding guidelines is to reduce bugs. It's rare that tabs versus spaces break the code.
  • Created by a secret cabal: it's pretty common that coding guidelines are developed by the senior developers and everyone gets to follow them. But unless you're a mega-corp with a company-wide policy and you can't change the rules or your job, the guidelines should be developed by the entire team - even that kid who just finished college and it's his first day and he hasn't even done orientation or dealt with IT yet. There are two very important reasons - it provides a feeling of contribution, and you all have to deal with them - so you might as well agree on them. Be diplomatic but firm about unnecessary rules - but if someone is very persistent and nobody else really cares - maybe consider adopting the rule, even if it is tabs instead of spaces. In truth, the purpose of software development is to create code, and the purpose of having coding guidelines is to reduce bugs. If people don't like the rules, they won't follow them or enforce them, and you've just wasted everyone's time.
  • Not well communicated: I think just about every rant in every field can include this bullet point. You can't just post a note on a semi-private IM chat channel and say you've communicated the guidelines to the group. Communication means making sure EVERYONE has had their say. There are two important parts there - "EVERYONE", and "making sure". Push notifications might cover everyone (might), but if you didn't make sure, you didn't communicate. You broadcast. There's a difference. In the end, the purpose of software development is to create code, and the purpose of having coding guidelines is to reduce bugs. If people don't even know about the guidelines, it's just wasted time (not to mention that means they didn't get their input as in the previous point).
  • Based on the latest trends: also known as "hype-driven development", trends should never be enforced. Most of the time they don't last, and the rest of the time only one or two of your developers know what they are and what they mean anyway. This means that coding the latest cool trick or using some amazing style that you found on a web page will only serve to confuse everyone else. Do note, the purpose of software development is to create code, and the purpose of having coding guidelines is to reduce bugs. If people don't even understand the intent of the whizz-bang feature, they can't effectively use it.
So what is important, then? Coding guidelines should aim to make the code readable. Coding guidelines should include comments - hell yes they should. Comments describe what the code is supposed to do and why it's doing it the way that it is (comments shouldn't describe what the code is doing unless it's really unclear). Coding guidelines should discourage 'tricks' that make the code hard to read (even if it works and is neato-keen) and encourage styles that let the compiler help find bugs. But most importantly, coding guidelines should exist to help the programmer get their job done. 

A secondary goal is to help the code reviewer get the code reviewed, because of course you are using code reviewers who are enforcing the guidelines, right? The code reviewer has their own code to get written, and needs to get through the review as quickly as possible. Basic syntax rules based on common errors make it easy to check if a code should be rejected without having to analyze whether it's actually faulty, which is a huge time saver. At the same time, reviewers should be generous when considering non-bug-causing issues, like whitespace alignment. A programmer who is running through his code fixing whitespace and unaligning lists and alphabetizing includes is not writing new code or testing existing code, which one would expect, is what we'd prefer they were doing. At least if I was paying them, that's what I'd prefer. In addition, after doing all those meaningless cleanups, they then need to repeat all the testing they previously did. Whitespace cleanup costs real money, and software developers are generally not cheap. Ask yourself if changing that tab to a space is really worth $500 or more (time to checkout, fix, build, re-test, commit, restart the code review, reviewers to come check it again...), or if you can just get it next time.

Monday, May 14, 2018

First run of Dragon's Lair Prototype Boards

In order to advance on the hardware side of this, I learned KiCAD and laid out two PCBs. One is the intended final board, and one is a prototyping board that accepts the same ZIF socket for the flash chips as my EPROM programmer does.

I got the PCBs back from OSH Park, and there are a few screwups...

I haven't checked them electrically yet, but there are two main issues - one is OSH's and one is mine.

The OSH one is those connector tabs you see all around the board. I wouldn't generally care, but they put one on the end of the cartridge connector. This has two side effects. First, I need to file it off to make it fit into the port (okay, whatever).

Secondly, and more importantly, whatever method they use to break apart the boards physically damaged the connector:

It's a little hard to see in this photo, but the board is coming apart at the top layer and has actually torn a millimeter or so between the traces. You can most easily see it in the distortion on the one trace.

There's no similar damage on the bottom, so I assume that's the direction they flexed the board to break it apart from the next one.

I'll have to drop them an email to ask how to avoid that on production. It's likely I can use at least one of these boards for prototyping and testing.

The other problem is I apparently didn't measure out the adapter board correctly, because it's completely wrong. It has two parts - the large DIP footprint and the smaller 10-pin header. I managed to make the holes too small for the DIP part and use the wrong pitch and misaligned the 10-pin header. GO ME!

I have three of the adapters, just in case, so I think I can fake it by bending the very long pins on the 10-pin header, and I'll just have to force it or grind the pins to fit the DIP (or use my I-do-this-every-time trick of soldering smaller leads onto the pins... sigh. Someday I'll remember that sockets use thicker pins...)

Anyway, in the middle of moving, but I need to try and test this out as quick as possible, because the next challenge I'm facing is that the 1gigabit chips are obsolete and getting hard to find. OH BOY.

Edit: As Captain C used to tell me: 'Never underestimate the power of force!' I literally used a hammer. ;)

Wednesday, March 7, 2018

It's not Lego!

While pondering why it is that the state of software development has reached the point that it's at (which very patient long term readers will remember is my usual gripe. Blogger says there are 3 of you! ;) )... where was I?

Right, so anyway, projects seem to be planned very strangely. There are cycles and cycles of proposals, and pre-pre-design documents (though rarely an ACTUAL design document). We have processes designed to break tasks down into a bullet point's worth of work, and we consider that planning. We have code management systems that make actually looking at the code the hardest thing to actually do (I'm looking at YOU, Gitlab...)

And while going through all this in my head and trying to reconcile with the task I'm trying to complete myself, I grumbled to myself "code is hard". And that was the revelation.

Oh, I've always known that it's hard to write good code. Anyone who tells you otherwise is either a liar or a genius, and the odds are heavily stacked in one direction. But I've never had that thought in this context. And that's when it all fell together for me.

Coding is increasingly treated as one of the least important parts of the software development process, maybe second only to testing (which will be a rant one day). And yet coding (and testing) are the only parts of the process that actually matter. They are the only parts that actually produce product at the end of the day.

And so as I was thinking that coding is just brushed off as the least important part of the planning process. It seemed like the assumption was that it was going to be easy to get it right, and testing would be covered by the process. So don't stress that part. And then I realized, there IS a style of development for which this is so, and it's increasingly proposed as the right way to do things.

And that style is: don't do it.

Think about it. You've been on the project that said "can't we just buy a package that does that?" How about "is there a library we can use?"

And so I'm of two minds here. The first one is that, well, yes, re-use is a good thing. Yes, many many many tasks have already been developed in one form or another. And no, I don't desperately want to write a USB stack from scratch. So we can just go out, and take all these pre-written libraries, and write a few lines of glue code, and not only is it likely to work, but it's been mostly pre-tested.(Or, so we blindly assume...) Hell, at this point we're not even the performance bottleneck, so let's just write it in script!

The other mind is the reality that I live in anyway where the exact things we want to do don't exist, and so we're building them. But we still use that building-block mentality. We don't consider that maybe nobody actually knows how to do that, and so it's going to take time to get it right. There is still a world where the hard code isn't written yet.

We've reached a point where many people assume that if something doesn't exist, it's because it's too hard to be possible. Don't you believe it. It only means that someone hasn't done it yet, and nothing more. Get out there. Write the hard code. Get it wrong! Then figure out WHY it was wrong and get it right. Odds are, by that point you'll have created something better than anyone before you did.

Oh, and if you're a project planner, realize that hard tasks take time. ;)

Tuesday, November 7, 2017

Investigation continues...

I went through and verified the pinout of the cart, and burned a new AVR and dropped it on the board to test, but it's still working just as flakey. A couple of things were interesting.. one is that when it partially works, it seems to get most of the bytes right (because the program selection is mostly correct). That seems to suggest either I am lucky or the address code is working. (I suspect the latter - if I was lucky I wouldn't be having an issue).

Putting the logic analyzer on it was all over the place, to the point where I couldn't make sense of what I was seeing. But I also determined that my console's power switch is wearing out... so I need to make a repair there to continue anyway.

One thing that has crossed my mind is that with the new compiler, the code may be running too fast. I check the MDIR and MODE pins immediately after detecting GSEL -- maybe that's too fast? When I get the logic analyzer properly on the bus I will recheck that theory.

This shouldn't even be an issue, but I am leaning back towards the ROM-based cart.... I just want this part to work. ;) But in the meantime, I relaid-out the cart to use the ATTINY861 instead - it's a 20 pin part instead of 14. So it doesn't look as much like a GROM, but it has enough pins to bring out the ISP header. I also positioned it so, in theory, the ZIF socket will fit on the board when in the console. I haven't ordered this board or the new parts yet though...

I probably should test programming one of the 128MB flash chips as well -- if my new programmer can pull that off, maybe I should stop wasting my time with the GROM emulation. (But darn it, this part was supposed to "just work". ;) )