Thursday, October 11, 2018

I'm Doing it Wrong: Learning New Skills

We all need to learn at some point in our lives. If we're computer people, we pretty much need to learn all our lives.

Some people read, some take classes, increasingly over the years, I hit my head against a problem over and over again until the problem or my head breaks. I call this "being stupid". But it's made my head pretty hard.

Most recently, I was finally doing my first VHDL, for the Dragon's Lair cartridge that I've previously mentioned. I was pretty pleased that my understanding of the basics got the basic bank switching latch working pretty easily. And then someone on Twitter asked whether it would support GROM too, so that the cartridge would work on a 2.2 console.

I didn't see why, after all, nobody still willingly uses a 2.2 console, which was notorious in the TI-99 community as being TI's effort to prevent AtariSoft cartridges from working with their machines. Literally, they changed the machine to lockout AtariSoft. Since AtariSoft made some of the best games, and this move made them drop the TI-99 entirely, people were not impressed.

But this person noted that they grew up on a 2.2 console, and were always annoyed that none of the good games worked on their machine. Thus, seeing this new licensed title work on a v2.2 would just feel good. Given that nostalgia is the only sensible reason to still be messing with these computers 40 years after they were released, I had to agree.

I chose a really small CPLD - in fact the smallest I could get away with, so it was a tight fit. Eventually I did get reading the GROM bus, and an internal 8-bit address counter which would provide 256 bytes of GROM -- more than enough to bootstrap the ROM part of the cartridge. However, every time I put the address increment into the design, it would work in simulation but lock up the TI. Sometimes immediately, sometimes at random, but it didn't work.

As I noted - I am brand new to VHDL. I actually HAVE read a book cover to cover (a couple of times - and the smart version of me would give you the title as it was really nice, but this is the dumb version of me. Ask in a comment and I'll look it up later). But practice is different than reality. Further, searching Google for VHDL questions was even worse than searching Google for C questions. There were fewer hits and the answers were typically even less useful (though still usually of the 'why do you want to do that?' vein...)

But I was stuck. Now that I was this close, I knew it was possible and I couldn't release the idea. I WOULD have GROM support. I was prepared to go as low as 16 bytes if I needed to (I was pretty sure less than that wouldn't be enough for the header and startup code).

When you get stuck, it's good to look at things from a different point of view. I thought to myself, "I have GROM working if I don't increment. And I've always told anyone foolish enough to listen that the GPL interpreter doesn't rely on auto-increment anyway, it sets the address for every byte. So maybe I don't NEED to increment for a simple boot..."

So I looked into it. The code that builds the selection menu ("PRESS 1 FOR TI BASIC, 2 FOR DRAGONS LAIR") used the GPL 'MOVE' command, which absolutely does set the address for every single byte. But the code that scans for programs to display actually was coded in the assembly language ROM side of the system.

The cartridge header has two pointers that matter in this case. The first points to the list of programs on the cartridge, the second points to a particular program's boot address. When the ROM code builds the list of programs, it did use auto-increment on those 2-byte pointers.

After thinking about it for a while... I realized that even in 256 bytes, I had tons of room to spare. So I created pointers with repeated bytes:

CARTRIDGE BASE: 0x8000 (not a pointer, just where we are starting)
PROGRAM LIST: 0x8181 (since both bytes had to be the same, this was the earliest I could start)
BOOT ADDRESS: 0x9191 (due to only 256 bytes being decoded, this is 16 bytes later, leaving room for the program name)

So due to the address register size, the real values accessed would be 0x8081 and 0x8091, but the system didn't need to know that!

Then I wrote a quick little GPL program that cleared the screen, loaded the real cartridge ROM vector, and jumped to it.

I was thrilled when I rebooted the TI and my menu option appeared perfectly! Then I selected it, and the screen filled with a strange character and the system locked up.

It didn't take much digging to see what was wrong. The first command in GPL for clearing the screen was "ALL 32", which means to fill the screen with character 32, which is of course, the space. The opcode for ALL is 7. Running in the emulator showed that the screen was filled with character 7, and a dig through the source code to the GPL interpreter (one nice thing about 40 years later is how much has been documented!), and I learned that my assertion about the interpreter not using increment was ALMOST right. Increment WAS used for fetching the arguments to instructions.

Well bugger... I spent an evening trying to work around this. I looked for exploits in the interpreter, I looked for clever opcode abuse that might let me get assembly control, I looked for single byte opcodes that might be useful and I even looked for pairs of bytes that might be interpreted as useful opcodes. Ultimately, I had to conclude that this approach was a dead end, and the only way was to make the CPLD work correctly.

At this point I began to hammer on the code. Change, upload, test, observe. Change, upload, test, observe - for hours at a time. I attached my logic analyzer to various combinations, I ran simulations (which always worked, of course), all to squeeze out just a single bit of new information for my head.

I envision this stage much like grinding in a dungeon crawler game. I keep going up against the boss, and he keeps sending me back to the village inn to recover, but each time I get a tiny bit of XP. I come back a little smarter, and keep trying.

At one point - I got it! Almost. The increment was working but the address was off by one. Confident that I knew what I was doing, I decided to try a different architecture to solve that. The new architecture failed but in my over-confidence, I had failed to backup the almost working code. Sure enough, I had forgotten what I did. Stupid.

Another week of hammering on it led to no results, and so finally I asked for help. Another stupid, and I was worried about showing off my HDL, so I should have asked earlier. Anyway, my friend didn't spot anything immediately wrong, but we ran through some examples on the whiteboard. I gained double-digit XP for that session.

Then I looked at a similar project (but on a larger device). This one did things slightly differently from me, and I had shied away from the approach because it took more logic than what I had been trying (and didn't fit). But I decided to declare a loss, and just port his project.

So I did, stripping out all the pieces that didn't apply to me. This compiled and even fit, but when I tried it - it locked up the console just the same. (I still need to test if that person's full project works on my troubled console, if not, I have some XP I can share with him now...) Anyway, because it didn't "just work", I went back to mine.

By this point I had been starting to learn how to optimize a little bit, and I reworked my code to use a similar increment concept to what this other project did. This was costing me in gates, so I disabled the ROM side to allow me space to increment (removing 14 latched bits saves some amount of logic!) And after a few iterations, GROM was working! Fully! I was so thrilled, but the boss was not defeated yet!

I added the ROM side back in, and sure enough, I was over. But only by ONE macrocell. As we all know, every program can be optimized by one instruction (and has one bug, meaning every program can be optimized to a single instruction which doesn't work). Surely every circuit can be optimized by one component! (No, but bear with me). I noticed the toolchain complaining about the complexity of my delay mechanism...

I'm proud of this, cause this one part earned me the last XP I needed to beat the boss and get a fully working CPLD.

I had a simple mechanism to control my gating of the GROM to the TI bus, because it didn't seem to like it when I was too quick on OR off the bus. I had two signals controlled by the GROM clock... when the TI asked for GROM, first one went high, then on the next cycle the other one did (and gated the memory). When the TI released the bus, they released in the reverse order. This seemed to keep GROM stable. (My previous GROM work used an 8MHz AVR. For all its speed, it's still slow compared to hardware.)

Anyway, I did this using a sequence of IF statements, one for up and one for down. I realized that the tool might be creating a fairly complex circuit to support that exact behaviour, and wondered if I could simplify it. I drew a truth table and realized that sure enough, there was a combitorial relationship. I replaced the two if...then...else blocks with two simple AND/OR statements (still in the clock process). This freed up the one macrocell that I needed.

So that's it.. (Hopefully - my other stupid is declaring success before testing it on other consoles!) I can finally move on to the software. Ask for help early, don't be dumb like me. Even if the help itself can't be provided, you gain knowledge by both asking smart questions and listening to the answers. Eventually that's how you get to the end.

I posted a video of it running here: https://www.youtube.com/watch?v=W4yCPdeB4nc

Sunday, August 26, 2018

You're Doing it Wrong: Debug

One of the last things that gets any attention in any large software project is the debug system - which is amazing because the vast majority of time spent developing a large software project is debugging. You'd think people would want to make it easier. You'd think that EVERYONE would want to make it easier. From the bean counters who are paying ridiculous sums of money for every minute of time, down to the individual contributor who knows that if they actually reported the true number of minutes that went into the product, the company would go bankrupt.

So what does a debug system do, and why does it get so little attention?

A debug system provides the tools and information necessary to identify the causes of undesired behavior. That's it, really, and what that means may vary some from project to project, but in most cases it boils down to some kind of log file that records the state of the software as it runs. Most commonly this is a human readable file (but it doesn't have to be), and is used after the fact to determine how some undesired event occurred.

Why is it so neglected? Well, people just seem to assume that they are perfect, therefore everyone else is perfect too. Perfect people never make mistakes - they never write incorrect code, and they certainly never input bad values or click things in the wrong order. Naturally, the systems people work with are also perfect, and never fail or get viruses or suffer bad RAM or otherwise pass garbage into a program. So, since nothing can every go wrong, there's no need to check for errors or report events that occur. Obviously, people who believe this should not lead software teams.

In general, debug systems are also not on people's radar when they start a new project. They are focused on the awesome new function they want to provide, and are not thinking much about how they will troubleshoot that function when it ships.

In the case where the undesired event is an outright crash, there's usually some form of post-mortem analysis. In Linux, for instance, applications generally write 'core' dumps. Windows applications can do the same thing (in the form of 'minidumps' -- and developers who need to support Windows software would be well advised to enable these.) Both of these are binary files that capture the state of the software at the moment of failure, allowing you to load them into a debugger and examine just about any part of the system that might interest you - a moment frozen in time.

But although core dumps are a great first line of defense, they are often not enough. For instance, you can see that your pointer is NULL, and you can see that you crashed trying to deference it, but you can't figure out WHY it became NULL. This is because you only have a snapshot of the exact instant that the NULL was dereferenced. Some people would just add a NULL check right there and claim success - those people should not lead software teams.

A text log is most application's next line of defense. There's a pretty common evolution of text logging in large applications. The first evolution is just a simple dump of lines of text into a log. To avoid stressing the system or filling the disk, log is kept to a minimum or even programmatically disabled except in special cases. As log lines are added, eventually someone gets a line of text in a spinloop, which fills the disk, so filtering for duplicate lines is added. Then the increasingly complex system needs more and more debug, some of which becomes very frequent and is needed but is not spinning, so we add throttling. Then the amount of debug available starts to become unwieldy, and so it is reorganized into classifications (such as severity levels, and possibly subsystems). Then the fact that debug needs to be manually turned on and recompiled starts to be a problem, so debug is made programmatically controllable. Now people are so used to having this great information that when a problem occurs at 2am, they want the log information. But it was off, as the system can't cope with logging everything all the time. So the writer is optimized and logs are on. Finally, although this step sometimes comes much earlier, someone gets tired of the disk getting filled up, and log rotation is added. At this point, you pretty much have a basic text logging system.

So to recap, you might as well start with:
-classifications. Classify your message severity (DEBUG, INFORMATION, WARNING, ERROR, for instance) and possibly by subsystem, if appropriate.
-dynamic. Make it possible to turn logging on and off by classification. Recommend that users leave it on if possible.
-consistent. All systems should emit debug - but see below for what.
-throttling. Should be able to detect if the same debug is being emitted more frequently than is actually useful to a human reader. If you wouldn't read every single line, the computer probably shouldn't write every single line. "x lines per second" from a single statement usually works well enough, but your needs may vary. Make sure the log includes mention of throttled statements so the reader knows why it's not showing, and make sure that mention itself doesn't spam the log!
-rotation. The system should be able to track and control the amount of disk space consumed by logs, while making a reasonable amount of data available for troubleshooting. How much depends on the system and the needs.
-performant. Debug systems need to execute quickly enough to not impact the primary purpose of the system. By designing everything up front, you often get this one out of simple necessity. The simplest first pass is usually a ring buffer that debug statements are copied into, with a separate, lower priority thread writing the ring buffer out to disk. For bonus points, have your crash handler dump the ring buffer too. ;)
-information. I didn't mention this above, but your log functions should almost always be a macro so you can automatically include the filename and line number in them. ie:
#define log(LEVEL,args) writeToLog(__FILE__,__LINE__,LEVEL,args)

So what goes into a text log? Basically, anything you can't determine by any other means. If you were troubleshooting a video game, for instance, and you wondered 'what state was the boss in when he suddenly flew up into the sky?', you probably need to log changes to the boss' state.

Which leads me to the point - you want to keep your information as compact as possible. Logging changes is a pretty good tactic, since the reader can answer any state questions by reading backwards to the last state change. Likewise, if you have an if...then...else statement with two sides, in many cases you need only log one side (the LESS common one), since the reader can determine if the log doesn't appear, the other path was taken. (However, if there's related information you always need, emitting it before or after the IF statement with the data needed to determine which path was taken is also good.)

Good debug lets you read a function beside the logfile, and know exactly what that function did and why. Good debug also does not emit information that you can infer from other information, unless it's so much effort to make that inference that having the computer tell you actually saves time.

As a for instance, consider this:

log(DBG, "Pointer value is %p", ptr);
if (ptr == NULL) {
  log(ERR, "Pointer was NULL, not responding.");
  res=0;
} else {
  log(DBG, "Pointer is valid, doing something useful.");
  res=ptr->doSomethingUseful();
  log(DBG, "Got %d from doSomethingUseful()");
}
log(DBG,"Final result is %d", res);

You can certainly tell EXACTLY what this block of code did.

Pointer value is 0x00000000
Pointer was NULL, not responding.
Final result in 0

...or...

Pointer value is 0x12345678
Pointer is valid, doing something useful.
Got 42 from doSomethingUseful()
Final result is 42.

It's explicit, but your log will fill pretty fast if you're that verbose. Consider this case, which gives you all the same information:

if (ptr == NULL) {
  res=0;
} else {
  res=ptr->doSomethingUseful();
}
log(DBG,"Pointer was %p, Final result is %d", res);

Pointer was 0x0000000, Final result is 0

...or...

Pointer was 0x12345678, Final result is 42

Mind you, this does lose the ERR level message in the NULL pointer case, which if it was important, it would be just fine to keep. You can generally consider messages at different levels to be in different classes.

Which brings me to the concept of what goes into a message. In general, you should assign a certain level (such as INFO) for which that level and above is intended for your customers. Customer facing debug messages should attempt to communicate using human language (not numbers). Furthermore, high level customer messages, such as a WARNING or ERROR, must tell the customer what to do about it. If there is nothing they can do, it doesn't justify being a WARNING or an ERROR.

For instance, this is useless:

if (ptr == NULL) {
  log(ERR, "NULL ptr... about to crash.");
}

You've told the end user, and there is literally nothing they can do about it. Just make it a DBG level for the poor sap who needs to reverse engineer the crash, and maybe spew out some other relevant information if you have it. But don't tell the user. However, this might be helpful.

if (ptr == NULL) {
  log(ERR, "A failure has been detected. Please safely shut down and restart the system. Contact support to collect troubleshooting information.")
}

Human communication. Directions on what to do about it. Maybe the software should initiate its own shutdown, if that makes sense. Either way, far more useful to the person who suddenly gets the big red message.

That's it for now.


Thursday, August 16, 2018

You're doing it wrong: Coding Styles

There has been lots and lots written about coding styles, and I'm writing still more. My basic premise is the same as most of the other articles - everyone else is wrong, so listen to me.

I see the same basic issues pop up over and over again in coding guidelines. The most egregious, or at least the ones that come to mind as I write this, follow:

  • Detailed Guidelines that cover every conceivable syntax case: Here's the first rule. If your coding guidelines take up more than a single page of normally written text - including properly spaced examples - you've got too many rules. Remember that the purpose of software development is to create code, and the purpose of having coding guidelines is to reduce bugs. If your guideline can't reasonably be defended to reduce common bugs, it isn't needed. This is subjective, but this very basic guideline can trim the size of many lists. If you have too many, people won't even remember them, let alone use them.
  • Getting picky over the little things: It's very common to enforce things like whitespace. This tends to be pedantic after a point. In truth, I approve of little things like aligning lists of initializers or cases. (Some people are ridiculously against this, their defense is they don't want to read code in a code review. Use 'ignore whitespace' then. Aligning the code leads to a faster code review because you can just scan the list, instead of searching each line for the important parts. I'll complain about code reviews again someday.) Ultimately, the purpose of software development is to create code, and the purpose of having coding guidelines is to reduce bugs. It's rare that tabs versus spaces break the code.
  • Created by a secret cabal: it's pretty common that coding guidelines are developed by the senior developers and everyone gets to follow them. But unless you're a mega-corp with a company-wide policy and you can't change the rules or your job, the guidelines should be developed by the entire team - even that kid who just finished college and it's his first day and he hasn't even done orientation or dealt with IT yet. There are two very important reasons - it provides a feeling of contribution, and you all have to deal with them - so you might as well agree on them. Be diplomatic but firm about unnecessary rules - but if someone is very persistent and nobody else really cares - maybe consider adopting the rule, even if it is tabs instead of spaces. In truth, the purpose of software development is to create code, and the purpose of having coding guidelines is to reduce bugs. If people don't like the rules, they won't follow them or enforce them, and you've just wasted everyone's time.
  • Not well communicated: I think just about every rant in every field can include this bullet point. You can't just post a note on a semi-private IM chat channel and say you've communicated the guidelines to the group. Communication means making sure EVERYONE has had their say. There are two important parts there - "EVERYONE", and "making sure". Push notifications might cover everyone (might), but if you didn't make sure, you didn't communicate. You broadcast. There's a difference. In the end, the purpose of software development is to create code, and the purpose of having coding guidelines is to reduce bugs. If people don't even know about the guidelines, it's just wasted time (not to mention that means they didn't get their input as in the previous point).
  • Based on the latest trends: also known as "hype-driven development", trends should never be enforced. Most of the time they don't last, and the rest of the time only one or two of your developers know what they are and what they mean anyway. This means that coding the latest cool trick or using some amazing style that you found on a web page will only serve to confuse everyone else. Do note, the purpose of software development is to create code, and the purpose of having coding guidelines is to reduce bugs. If people don't even understand the intent of the whizz-bang feature, they can't effectively use it.
So what is important, then? Coding guidelines should aim to make the code readable. Coding guidelines should include comments - hell yes they should. Comments describe what the code is supposed to do and why it's doing it the way that it is (comments shouldn't describe what the code is doing unless it's really unclear). Coding guidelines should discourage 'tricks' that make the code hard to read (even if it works and is neato-keen) and encourage styles that let the compiler help find bugs. But most importantly, coding guidelines should exist to help the programmer get their job done. 

A secondary goal is to help the code reviewer get the code reviewed, because of course you are using code reviewers who are enforcing the guidelines, right? The code reviewer has their own code to get written, and needs to get through the review as quickly as possible. Basic syntax rules based on common errors make it easy to check if a code should be rejected without having to analyze whether it's actually faulty, which is a huge time saver. At the same time, reviewers should be generous when considering non-bug-causing issues, like whitespace alignment. A programmer who is running through his code fixing whitespace and unaligning lists and alphabetizing includes is not writing new code or testing existing code, which one would expect, is what we'd prefer they were doing. At least if I was paying them, that's what I'd prefer. In addition, after doing all those meaningless cleanups, they then need to repeat all the testing they previously did. Whitespace cleanup costs real money, and software developers are generally not cheap. Ask yourself if changing that tab to a space is really worth $500 or more (time to checkout, fix, build, re-test, commit, restart the code review, reviewers to come check it again...), or if you can just get it next time.



Monday, May 14, 2018

First run of Dragon's Lair Prototype Boards

In order to advance on the hardware side of this, I learned KiCAD and laid out two PCBs. One is the intended final board, and one is a prototyping board that accepts the same ZIF socket for the flash chips as my EPROM programmer does.

I got the PCBs back from OSH Park, and there are a few screwups...



I haven't checked them electrically yet, but there are two main issues - one is OSH's and one is mine.

The OSH one is those connector tabs you see all around the board. I wouldn't generally care, but they put one on the end of the cartridge connector. This has two side effects. First, I need to file it off to make it fit into the port (okay, whatever).

Secondly, and more importantly, whatever method they use to break apart the boards physically damaged the connector:



It's a little hard to see in this photo, but the board is coming apart at the top layer and has actually torn a millimeter or so between the traces. You can most easily see it in the distortion on the one trace.

There's no similar damage on the bottom, so I assume that's the direction they flexed the board to break it apart from the next one.

I'll have to drop them an email to ask how to avoid that on production. It's likely I can use at least one of these boards for prototyping and testing.

The other problem is I apparently didn't measure out the adapter board correctly, because it's completely wrong. It has two parts - the large DIP footprint and the smaller 10-pin header. I managed to make the holes too small for the DIP part and use the wrong pitch and misaligned the 10-pin header. GO ME!

I have three of the adapters, just in case, so I think I can fake it by bending the very long pins on the 10-pin header, and I'll just have to force it or grind the pins to fit the DIP (or use my I-do-this-every-time trick of soldering smaller leads onto the pins... sigh. Someday I'll remember that sockets use thicker pins...)

Anyway, in the middle of moving, but I need to try and test this out as quick as possible, because the next challenge I'm facing is that the 1gigabit chips are obsolete and getting hard to find. OH BOY.

Edit: As Captain C used to tell me: 'Never underestimate the power of force!' I literally used a hammer. ;)


Wednesday, March 7, 2018

It's not Lego!

While pondering why it is that the state of software development has reached the point that it's at (which very patient long term readers will remember is my usual gripe. Blogger says there are 3 of you! ;) )... where was I?

Right, so anyway, projects seem to be planned very strangely. There are cycles and cycles of proposals, and pre-pre-design documents (though rarely an ACTUAL design document). We have processes designed to break tasks down into a bullet point's worth of work, and we consider that planning. We have code management systems that make actually looking at the code the hardest thing to actually do (I'm looking at YOU, Gitlab...)

And while going through all this in my head and trying to reconcile with the task I'm trying to complete myself, I grumbled to myself "code is hard". And that was the revelation.

Oh, I've always known that it's hard to write good code. Anyone who tells you otherwise is either a liar or a genius, and the odds are heavily stacked in one direction. But I've never had that thought in this context. And that's when it all fell together for me.

Coding is increasingly treated as one of the least important parts of the software development process, maybe second only to testing (which will be a rant one day). And yet coding (and testing) are the only parts of the process that actually matter. They are the only parts that actually produce product at the end of the day.

And so as I was thinking that coding is just brushed off as the least important part of the planning process. It seemed like the assumption was that it was going to be easy to get it right, and testing would be covered by the process. So don't stress that part. And then I realized, there IS a style of development for which this is so, and it's increasingly proposed as the right way to do things.

And that style is: don't do it.

Think about it. You've been on the project that said "can't we just buy a package that does that?" How about "is there a library we can use?"

And so I'm of two minds here. The first one is that, well, yes, re-use is a good thing. Yes, many many many tasks have already been developed in one form or another. And no, I don't desperately want to write a USB stack from scratch. So we can just go out, and take all these pre-written libraries, and write a few lines of glue code, and not only is it likely to work, but it's been mostly pre-tested.(Or, so we blindly assume...) Hell, at this point we're not even the performance bottleneck, so let's just write it in script!

The other mind is the reality that I live in anyway where the exact things we want to do don't exist, and so we're building them. But we still use that building-block mentality. We don't consider that maybe nobody actually knows how to do that, and so it's going to take time to get it right. There is still a world where the hard code isn't written yet.

We've reached a point where many people assume that if something doesn't exist, it's because it's too hard to be possible. Don't you believe it. It only means that someone hasn't done it yet, and nothing more. Get out there. Write the hard code. Get it wrong! Then figure out WHY it was wrong and get it right. Odds are, by that point you'll have created something better than anyone before you did.

Oh, and if you're a project planner, realize that hard tasks take time. ;)