TursiLion's Roar

Wednesday, May 26, 2021

You're Doing It Wrong - AI

AI will definitely destroy society. But not the way you think.

Disclaimer: I'm not a machine learning expert, and I know some will disagree strongly. Get your own blog, ya smarmy bastard. ;)

AI, or more correctly in most cases Machine Learning, is increasingly being used in difficult, abstract problems to give us a yes/no answer to questions even an expert would have trouble with.

There are also, as a result, countless stories of how AFTER training, it was discovered that these machines were skewed. They examined irrelevant details to come up with the answer, or they revealed biases in the dataset (which, honestly, people should have seen long before even starting.)

So machine learning, to give a very simple and mostly wrong description, is the act of wiring up a set of inputs (say, pixels of an image) to a set of outputs (say, "cat", "dog", "martian", "amoeba") through a chain of configurable evaluators. These evaluators, which are not called that by anyone with training in the field, are analogous to neurons in your brain.

The idea is, you show the inputs a picture of a cat, and tell it "cat". The machine tries a few combinations of settings and decides which one gave it "cat" most consistently. You show it a "dog" and it does the same thing, trying to remember the settings for "cat". Repeat for "martian" and "amoeba". Then repeat the whole process a couple of million times with different pictures randomly selected from the internet. The neurons slowly hone in on a collection of settings that generally produce the right output from all on those millions of inputs.

So you're done! You fed your electronic brain five million images, and it classified them with 99% accuracy! Hooray!

Now you give it a picture of a Martian it has never seen before. "Cat", it tells you confidently.

Well... um.. okay, cats have four legs and Martians only three, but, we're only 99% perfect. How about this lovely photo of an amoeba devouring a spore?

"Cat. 99.9% certainty."

"That's not a cat," you reply.

You offer up a beautiful painting made in memorial of a lost canine friend. "Amoeba."

Frustrated, you offer up a cheezburger meme. "Cat," the AI correctly responds.

Relieved, you sit back and accidentally send it a set of twelve stop signs and one bicycle. "Martian."

So what the heck is going on?

Well, first off, you got your 5 million photos from the internet, so it was 80% cats. Thus the AI ended up with a configuration set that favors cats. It decided that abstract blobs and unrealistic strokes, much like the brush strokes in your canine painting, looked a lot like the background of slides on which amoeba were found - it didn't learn anything about amoeba themselves. And tall, thin objects were clearly Martians, since you didn't teach it about anything else that was tall and thin.

Now, machine learning, even in the primitive form we have today, has some value. In very narrow fields it's possible to give a machine enough information that the outputs start to make sense. But the problem is that these narrow field successes have led to trying over and over to apply it to broader questions - questions which are often difficult even for human experts with far more reasoning power.

There are two big problems with machine learning. The first is that in real life, you would never actually know why it made those mistakes. The neuron training sequence is relatively opaque and there are few opportunities to debug incorrect answers. It's a big opaque box even to the people who built it.

The second is data curation. When you create such large datasets, it's very hard - nearly impossible - to ensure it's a good data set. There must be NO details that you don't want the AI to look at. If you are differentiating species, then no backgrounds, no artistic details, even different lightning can be locked in as a differentiator. The AI has NO IDEA what the real world looks like, so it doesn't unconsciously filter out details like we do. To the machine EVERY detail is critical. If I give it a cat on a red background and a dog on a blue background, it is very likely to determine that all animals on a red background are cats, because that is easier to determine than the subtle shape difference.

The dataset must also be all-encompassing. If you leave anything out, than that anything does not exist to the AI, and so providing it that anything automatically means it must be one of the other things. The brain can not choose "never seen before"... at least with most traditional training methods. At best you might get a low confidence score.

Finally, the dataset must be appropriately balanced. There may be cases where a skew is the right answer... for instance, a walking bird in the Antarctic is more likely a penguin than an emu, but if you are classifying people then you need to make sure the dataset contains a good representation in equal proportions of everyone. Sounds pretty hard, doesn't it? Yeah, that's the whole point. It's hard.

And that's a point I've made over and over again. Computing good hard like grammar. People are always looking for shortcuts, and they never work as well as expected. Not only is machine learning being seen as a huge shortcut to hard problems, but people are taking shortcuts creating the machine, and getting poor results. This shouldn't be a surprise. If you know the dataset is incomplete, why are you surprised that the machine doesn't work right? You're supposed to be smart. ;)

The real problem of all this is that people still think if a computer says it, it must be true. This is despite the daily experience with their cell phones, smart TVs, game systems and PCs all being buggy, malfunctioning pieces of crap, somehow the big mainframes at the mega-corporations (which generally don't exist anymore and the ones you are thinking of have less power than your smart watch), somehow those machines get it right.

So as machine learning continues to be used to classify people for risk, recognize people on the street, call out people for debt, etc, people are going to be negatively impacted by the poor training the machines received.

Computers are stupid. They are stupider than the stupidest person you've ever had to work with. They are stupider than your neighbor's yappy dog down the street who barks at the snow. They are stupider than those dumb ants who walk right into the ant trap over and over again. Computers do not understand the world and have no filter for what is relevant and what is not. Don't trust them to tell you what's true.

Friday, January 29, 2021

Complexity - or - You're DEFINITELY Doing it Wrong

Hey, I'm employed again! You know what that means - MORE RANTS!

Any new position is always connected to learning about new systems that you didn't see before - or in this case - that I deliberately steered away from before. And the base takeaway of the last couple weeks is "people love complexity".

From layering systems on top of Git to creating an ecosystem with an entire glossary of new terminology, some people just feel a system isn't worth doing if it isn't layered with system on top of system on top of system. Unfortunately Linux as a platform heavily endorses this approach with a huge library of easily obtainable layers.

I once made the joke that building a project under Linux is like playing an old graphical Sierra adventure game. You need to get a magic potion to save the Princess, but the witch demands you bring her an apple. The orchard can't give you an apple unless you bring them some fertilizer for the trees. The farmer has fertilizer, and he'll trade you for a new lamp for his barn. The lamp maker would love to help, but he's all out of kerosene... and so on for the duration of the quest. Much the same under Linux, just replace the quest items with the next package you need, which depends on the next package, which depends on the next package... sometimes I wonder if anyone wrote any actual code, or if they all just call each other in an infinite loop until someone accidentally reaches Linus' original 286 kernel, which does all the actual work...

So anyway, yeah, if you need to invent a glossary of terms to describe all the new concepts you are introducing to the world of computing, then you are probably not a revolutionary - you are probably over-complicating something we've all been doing for better than half a century. Do you really need 1GB of support tools to generate an HTML page?

It's one thing I loved about embedded, it hadn't reached the point of being powerful enough to support all these layers yet. But those days are rapidly ending. The Raspberry PI Pico is a $4 embedded board powerful enough to generate digital video streams by bitbanging IO. Memory and performance isn't much of a concern anymore.

But let me end on a positive note - unusual for me, I know. Some of these packages produce amazing results and even I'm glad to see them out there. But for Pete's sake, consider whether you really need to add another layer on top of those packages - what are you actually adding? Seriously, poor Pete.

... if I could add a bit from Hitchhiker's Guide to the Galaxy...

"Address the chair!"
"It's just a rock!"
"Well, call it a chair!"
"Why not call it a rock?"

Saturday, January 9, 2021

Let's talk about unit testing...

Since I've wandered back to the employment market, I've had to go through a lot of interview processes. From the very (ridiculously) large to the small, I'm basically being deluged with a slew of new acronyms that I was not deluged with a decade ago when I was last interviewing. And what that basically reinforces for me is that software development continues to be a hype-driven field, with everyone tightly embracing the latest buzzword, because obviously software used to be hard because we weren't doing it this way...

Personally, it would be nice if, instead of thinking of a cute new buzzword for something we've all known for 40 years already, people would just devote energy to writing better code. Education, practice, peer collaboration -- these create better code. Not pinning notecards around the office and telling everyone you're an Aglete now.

And why do we want to create better code? I sort of feel like this message is often lost - and without it you do have to wonder exactly WHY you are building a house of cards out of floppy disks once a week (although people don't when they have the cool buzzword of Habitatience to direct them). But the reason to create better code is so that we spend less time making the code work. It's about making software reliable, and to at least minimal degrees, predictable - and these are things that bugs are not.

Anyway, unit testing is still pretty big, though of course the only right way to unit test is to use someone else's unit test framework, and write standalone blocks of code that run and pass the tests automatically. If you aren't using FTest, you clearly aren't testing at all.

Let me be clear up front - these little test functions are valuable, just rarely in the way that the proponents think. So let's just over-simplify first what I'm talking about.

Basically, the idea is that the developer writes test functions that can be executed one-by-one via a framework. These test functions are intended to exercise the code that has been written, and verify the results are correct. When you're done, you usually get a nice pretty report that can be framed on your wall or turned in to your teacher for extra marks. They show you did the due diligence and prove that your code works!

Or do they? Did you catch the clue? Encyclopedia Brown did.

The creator of the unit tests for a piece of code is usually the developer of that piece of code. Indeed, for some low level functions nobody else could. (Although outside of the scope of this rant, it would be very reasonable for the designer or even the test group to create high level unit tests to verify /function/... but this never happens.) Anyway, the problems with this are several:

First, the developer is testing their own understanding of what the function does. They are not necessarily testing what the function is supposed to do. Indeed, they usually write code that tests that the code they wrote does what they wrote it to do -- in essence they are testing the compiler, not the program. Modern compilers are not infallible, but they are generally good enough that we don't need to test their code generation as a general rule.

Secondly, this is a huge opportunity for a rookie trap. Novice programmers usually only test that a function does what the function is supposed to do. That is, they don't think to test if the function correctly handles bad situations, like invalid inputs. This is a huge hole and often means that half the function is unexercised -- or that the function has no error handling at all. But it will still pass the unit test.

Thirdly, this becomes a sort of a black box test. Similar to the comment above, there's no way to verify that every line of code in the function has been exercised. In fact, it's not even certain that the function behaved the way the developer intended -- only that the output, whatever it is, matched whatever criteria the unit test developer asked for. (And this can range from detailed to very, very basic, but it's still restricted only to the final output.) Correct result for a single input doesn't guarantee correct operation. There is such a thing as dumb luck!

But there is value to these tests. Because they can be (and usually are) run by automatic build scripts, they are fantastic high level validations that a code change didn't fundamentally break anything. Of course, for this to be true, unit tests need to be peer reviewed and they need to include as many cases as are necessary to test ALL paths within the function being tested.

But what about the third point? While meeting the second point more or less addresses it, there is a variable not taken into account: time. What do I mean by that? I mean that in any project large enough that the developers are using automatic build tools with unit tests, that the code is not static. It is being changed, often rapidly. That's why the automatic tools are trying to help.

However, once created by person A, person B modifying a function that already exists rarely goes looking to update the unit test -- particularly if they did not change the function's purpose. However, the unit test was created so that the inputs passed tested all code paths. Now there are new code paths. You no longer know that the unit test is testing everything.

"Well, we'll just tell people to update the unit tests," you exclaim. "Case dismissed, nice try, but that's it."

Hah, I reply. Hah. Good luck.

Look, nobody sets out to be a sloppy or lazy developer, not even many of the cases I've inferred in my rants. But people forget things, they are usually on a tight schedule, and the most heinous of all, their manager usually tells them to "worry about that later". After all, the unit test exists, so that box is checked, and there's no point spending more money on updating it after it already exists. What are we supposed to do, fill in the box? It's already checked!

So look, just assume that your automated tests are going to fall out of date until you hire a new gung-ho intern who finds it, or the original dev adds a new feature and goes to update the unit test they wrote. They are still useful as a regression test - in fact awesomely do. Having unit tests on complex code I've written has saved me a few times. But what do you do between gung-ho interns?

Even if you don't have an automated build tool or haven't got around to implementing your unit test framework yet, the developers can still perform manual unit testing. Stop grinding your teeth - it's not as bad as you think. You have Visual Studio, Eclipse, or GDB, right? Quit your whining. In my day we did unit tests by changing the screen color and we liked it.

It's actually really simple. The developer simply steps through the new code. Modern debuggers allow you to set the program counter and both observe and change variables in real time -- meaning that a developer can walk through all the possible paths of their new function in a matter of minutes without even needing to simulate the real world cases that would trip up every case. This is especially helpful when some of the cases are technically "impossible" (a programmer should never write "impossible" without quotation marks, hardware is involved). Inputs can be changed, the code can be walked through, and then the program counter can be set right back to the last branch and tried again.

It's true that this can take a while if a lot of code is written, and naturally you still need to run the real world tests (to see if it actually works, as opposed to theoretically works), but this is guaranteed to be faster than writing and testing the unit tests. Oh yeah, you missed that part, didn't you? You also have to test your unit tests actually work.

The worst unit test case I ever saw tested a full library of conversion functions by passing 0 to the base one and verifying that 0 came back out. As one might expect, 0 was a special case in this function. The other conversions actually contained off-by-one in about half the cases (and confused bits for bytes in several others - this was hardware based). But the unit test checkbox was marked, verifying that the software was correct, and more importantly, the unit test passed. It wasn't till we tried to use it that things went wrong.

So, I recommend both. Have the developer step through their code. Let's call it the Stepalicious Step. Then after it works, write unit tests as regression tests so that your build server feels like it's contributing. But make sure unit tests are considered first tier code, and go through your usual peer review phase, to avoid only checking the easy case.

"Oh yes, we do Agile, Regression testing, and Stepalicious." Oh, it's no dumber sounding than trusting your source code to a Git...

Thursday, November 12, 2020

On Broken Systems...

I've been doing a lot of scripting for the last couple of weeks on Second Life (in violation of my SurveyWalrus, don't tell him!) One of the things that surprises me about it is just how many of the APIs are unreliable. That is, the statement executes, no error occurs, but it doesn't work every time.

Actually, this is crazy common, especially when dealing with hardware in the real world. Nothing works quite as documented, and often not as you expected it to either. Surprisingly for this blog, I'm not criticizing it, I find it rather endearing. (When it's HARDWARE. Hardware is hard, SL has no excuse ;) )

So what do you do when you execute the command, and nothing happens? Well, there are a few things that tend to help out in that case.

First and foremost, you need to ensure that the command actually happened, and that it happened the way you expected it to. There's little point going any further if you can't prove this. This means instrumenting your code, which I've covered before. It probably also means probing the hardware to make sure what was supposed to trigger the effect actually happened. Oscilloscopes have come down a lot in price over the decades, and even a cheap pocket one from Alibaba is a better diagnostic tool than poking the circuit with a wet finger.

(Disclaimer: don't poke circuits with wet fingers. It's not good for the circuit, it's not good for your finger, it likely won't tell you anything and it looks silly.)

After you have verified that the command is happening the way you expected, just take a moment and double-check the documentation matches your expectation. This step can save you hours of effort. Of course, about one in five times the documentation is also wrong. Optimist.

Okay! So your code is working! The command matches the documentation! And it doesn't work. Now what? Debugging starts. That's right, you don't only have to debug the code you write, you will probably (always) have to debug code you didn't write and hardware you didn't create. And this part you usually can't fix! Man, computers are great!

More common than outright failure is inconsistent operation. That is, sometimes it works and sometimes it doesn't. This was the case with the SL APIs. It's very rare that something which is not actually defective is truly random -- and if it is, you can't work with it anyway. You need to start in on the scientific method and work out which conditions it works in, and which conditions it doesn't. At the over-simplified level, that's:

1) Create a theory
2) Devise a way to test your theory
3) Execute your test and record the results
4) Revise the theory based on the new information
5) Repeat at 2

You'll notice there's no exit condition. Usually, you stop when you get it reliable enough, and that depends on your needs. Or you stop when the hardware guy finally gets tired of your questions and adds some resistors to the circuit, suddenly stabilizing it. ;)

But to get you started - most of the time I've run into inconsistent operation, it has been timing. From the Atari Jaguar to cheap LCD panels to, yes, Second Life, it's really common for both APIs and hardware to drop commands if they come packed too close together. So, spacing them out is a good first test.

What if it is truly random (or you just can't narrow it down any further?), and you have no choice but to use it. Well, first, push back really hard, because you really don't want to have to support your code on this broken system for the next 10 years, do you? It's not over when you push it to Gitlab!

But if you really can't, well, you need to improve the odds of success. Can you safely execute the command twice? Safely means that it's okay if the command works once and fails once, and still okay if it executes twice. Apparently the NES needs to do this workaround on the controller port if making heavy use of the sample channel, for instance. Alternately, is there a way to VERIFY the command, and repeat it if it failed? Is there another way to accomplish the same thing, and bypass the broken command? Even if it's slower, that is probably better than unreliable.

In the case of outright failure, then you have really two possible causes: either the device/api is actually broken, or you are commanding it wrong. In the former case, you probably can't do very much about it -- and if you are still reading here, you probably can't prove it either. So you need to figure out what you are doing wrong.

Unfortunately, this one is much harder to advise - it all comes down to experience. Think about similar APIs you have implemented, and compare to the information you have. Does it make sense to try a different byte order? What about bit order? Is there an off-by-one error? (This is common in software and hardware both!) Make sure, if you are working with hardware, that it is safe to send bad data on purpose. Set up an isolated test bed, and try different things. Use the scientific method again, and you might just figure it out!

Then you can enjoy a coffee and go tease the hardware guy that the software guy found their bug. That's always fun. ;)

Monday, October 12, 2020

VGMComp2 - Looking Back

Many years ago, I undertook a project to come up with a simple compression format for music files on the TI-99/4A. My goals were simple, and somewhat selfish. There was a music format called VGM that supported the chip, and music from platforms that used it, like the Sega Master System, was easily obtainable. However, the files recorded every write to the sound chip along with timing information, and tended to be very large.

I built a system that stripped out the channel-specific data and moved all timing to separate streams - thus this four channel sound chip now had 12 streams of data: tone, volume, and timing. With all the streams looking the same, I implemented a combination of RLE and string compression and got them down to a reasonable size. There were a number of hacks for special cases I noticed, but ultimately it was working well enough to release. It was, in fact, used in a number of games and even a demo for the TI, so it was a success.

But it always bothered me. Why did I need the hacks? Why did it use so much CPU time? Could I do better? I spent a fair bit of time, on and off, coming up with ways to improve it. And finally, I convinced myself that I could. The new scheme was similar, but reduced the four time streams to just one, and changed out one of the lessor-used compression flags for a different idea. My thinking was that even if all else was equal, going from 12 streams down to 9 would buy me 25% CPU back.

But it didn't. In fact, the new playback code barely performed as well as the old. Even after recoding it in assembly, and heavy optimization, it was still reporting only about 10% better CPU usage than the old one. It took a lot of debugging to understand why, and what I finally realized was that the old format was simply better at determining when NO work was needed - it simply checked the four time streams. The new format needed to check the timestream and the four volume channels. This means that the best case (no work at all) was slightly faster on the old player than the new one. But the new one was markedly better in the worst case (all channels need work), just because the actual work per channel was simplified some.

Compression itself didn't really give me the wins I hoped for either. After creating specific test cases and walking through each decompress case (and so debugging them), compression was better, but not amazingly so. The best cases, true, were about 25% smaller than the old compressor, but the worst cases were pretty much on par, and that only with the most rigorous searches.

What I finally had to admit to myself, in both cases, was that the years of hacks and tricks and outright robberies in the original compressor had created something that was pretty hard to beat. But, it was also impossible to maintain, rather locked in the features it could support, and most importantly, I did beat it. Maybe not by much, but 10% on a slow computer is not a bad win.

And that, really, was something else I had to admit to myself. The TI is a slow computer. Even back in the day it was not terribly speedy. I tend to forget sometimes, working on my 3GHz computer that the 3MHz clock of the TI is a thousand times slower than my modern PC. And that's ignoring all the speedups that modern computers enjoy. (It's kind of a shame how much of that power modern OS's steal, but I guess that's a different rant.) Anyway, the point is that even writing all 8 registers on the sound chip every frame takes almost 1% of the system's CPU. And that's just writing the same value to all of them. That I can decompress and playback complex music in an average of 10-20% CPU is maybe not as awful as I felt when I first realized it.

There's of course another advantage to this new version. It was a goal to also support the second sound chip used in the ColecoVision Phoenix - the AY-8910. Borrowed from the MSX to make porting games from it simpler, this became a standard of sorts in the Coleco SGM add-on from OpCode, and so supporting it, at least in a casual manner, seemed worthwhile. This goal expanded when a member of the TI community announced that he'd be ressurecting the SID Blaster - a SID add-on card for the TI-99/4A. So, I made the toolchain support both of these chips -- although I cheated. A lot.

In the case of the AY, it wasn't so bad. I just ignored the envelope generator and treated it like another SN with a limited noise channel but better frequency range. The SID was trickier. I still did the same abuse - I ignored the envelope generator and treated it like another SN, but with only three channels. Unfortunately, the SID required some trickery because the envelope generator was necessary to set the volume. Fortunately for me, the trickery appeared to work. ;)

I have to admit that I'm not convinced that using both chips together will be acceptable, performance wise. 20% doesn't sound bad -- but that's on average. If both chips experience a full load on the same frame, it could be more than double that. On the other hand, if you can get away with running the tunes at 30hz and alternate the sound chips, that would be fine. That would likely be what I'd do.

Anyway, there was yet one more goal, and that was a robust set of tools to surround the new players. In the end, I created nearly 50 separate tools. And being very silly, many of them look Windows specific (but they are all just console apps and will port trivially, someday). But we have player libraries for the ColecoVision and the TI, a reference player for the PC, a dozen sample applications, 10 audio conversion tools (including from complex sources such as MOD and YM2612), and over 20 simple tools for shaping and manipulating the intermediate sound data. I have no doubt it's very intimidating, but short of tracking the data yourself (which, frankly, is a better route than converting), I believe there's no better toolset for getting a tune playing on this hardware.

Of course, if you can track it yourself, you can still use this toolset to get from tracker to hardware. ;)

I do intend to use this going forward, of course. The first user will probably be Super Space Acer, as that's near the top of my list (Classic99 is ahead of it). Though that game is nearly done, it will benefit from the improvements, and I need to finish it and port it around. With luck, once people have a chance to figure out the new process, they'll use it as well. I'll have to do some videos.

Anyway, the toolset is up at Github, and eventually on my website too, once I get that updated.

https://github.com/tursilion/vgmcomp2

(BTW: I very, very, very rarely log into the Github website. Using the ticket system and sending me notes there is all well and good, but generally I just push my project and move on. That's why I use Git in the first place, because SIMPLE. My point is - expect turnaround times to be really slow if that's how you reach out to me. I'm not ignoring you. I just haven't seen it yet. I say this because logging in to get the URL there, I noticed some stuff waiting for me. ;) )