Saturday, September 17, 2016

Atari Jaguar Programming Causes Brain Damage - Confirmation

Well, not necessarily. But after a few hours last night I had a crazy headache that I ended up getting out of bed to take advil for. ;)
Anyway, I took a break from work and from my pending TI project, both of which are breaking my self-esteem at alarming rates, to wrap up a support project for my Atari Jaguar cartridge boards. When I first tested them, I attempted to burn a slightly modified version of Tempest (just some text string edits) -- only to find it didn't boot.
I realized after thinking it through that Jaguar carts are tested against an MD5 sum (to avoid modification and bad contacts). So the MD5 hash would need to be updated too. That's not unusual, many game systems had a checksum or such to confirm the game would work. The problem on the Jaguar is that the hash is buried in the proprietarily encrypted portion of the boot header - so I'd need to re-encrypt it too.
That's not the end of the world... the tools were discovered years ago and are out there. To date I'd used the Atari ST encryption tool to create a "fast boot" header for the Skunkboard... and learned a bit there. One interesting thing about that project was the discovery that the Jaguar CD subverted the boot process - our first pass didn't even work plugged into a Jag CD.
Atari Jaguar cartridges start with an encrypted boot header, broken into 65 byte blocks. Each block is encrypted with a full 520-bit key (I kind of wonder if that didn't violate export restrictions back in the mid-90s? I could Google but I won't...). It takes about half a second to decrypt one block, and the normal cartridge has ten of them. The code is decrypted into the GPU, where it is then executed. The code runs an MD5 hash on the cartridge, compares it to the one that was embedded, and if all looks good, it writes the magic value 0x03D0DEAD to the first GPU RAM address and exits. On exit, the BIOS checks for the magic value, and boots the cart if it sees it, or red-screens if it doesn't.
There is a small complication in altering this code in that the Encryption tool writes several values to fixed addresses before it encrypts the boot - specifically it stores the MD5 hash, some state information, and the first and last address of the cart. So our Skunkboard boot needed to be tolerant of that (we just left the areas unused).
00F035AC: MOVEI   $00F03566,R00    (9800) ; address to pass ROM check
00F035B2: MOVEI   $03D0DEAD,R04    (9804) ; magic value to unlock 68k
00F035B8: JUMP    (R00)            (D000) ; go do it!
00F035BA: NOP                      (E400) ; delay slot
That was literally it - we just wrote the magic value and jumped back to the startup code to handle the return. I got the code encrypted by hex-editing the Atari ST program and re-running it.
The CD unit works a little differently. Intentionally or just because they could (it's not clear to me why), they leave the GPU busy on a little VLM demo (courtesy of Yak!), and decrypt these blocks into the DSP instead. The DSP is a nearly-identical processor to the GPU, so okay, that's cute. But then the CD BIOS makes several hard-coded fix-ups to absolute addresses in the decrypted code (without checking if it's the code that it expects). Then it jumps past the first part of the decrypted code into a later entry point to do the MD5 hash.
So when we updated the Skunkboard boot for the CD unit, first we had to add a second block (because the jump point is past the first block), meaning we went from half a second to a full second boot. Then we had to document and avoid the manual patch areas. Finally, we had to be able to run on both the GPU and the DSP, despite them having different address bases. But, our code was extremely simple, so it wasn't hard to make it fit. We borrowed some Atari code to handle the device independence (although, since we duplicated the work, it ended up not being necessary), and it was fine.
 MOVEI #$00FFF000,R1  ; AND mask for address
 MOVEI #$00000EEC,R2  ; Offset to chip control register
 MOVEI #$03D0DEAD,R4  ; magic value for proceeding
 MOVE PC,R0           ; get the PC to determine DSP or GPU
 AND R1,R0            ; Mask out the relevant bits
 STORE R4,(R0)        ; write the code
 SUB R2,R0            ; Get control register (G_CTRL or D_CTRL)
 MOVEQ #0,R3          ; Clear R3 for code below
 JR GAMEOVR           ; wait for it to take effect
 STORE R3,(R0)        ; stop the GPU
; Need an offset of $48 - this data is overwritten by the encrypt tool
; with the MD5 sum.
 MOVEI #$0,R0
 MOVEI #$0,R0
 MOVEI #$0,R0
 MOVEI #$0,R0
 MOVEI #$0,R0
 MOVEI #$0,R0
; JagCD entry point (same for now)
 ; There is a relocation at $4A that we can't touch
 MOVEI #$0,R0         ; dummy value
 ; real boot starts here
 MOVEI #$00FFF000,R1  ; AND mask for address
 MOVEI #$0,R0         ; This movei is hacked by the encryption tool
 MOVEI #$0,R0         ; This movei is hacked by the encryption tool
 MOVEI #$00000EEC,R2  ; Offset to chip control register
 MOVEI #$03D0DEAD,R4  ; magic value for proceeding
 MOVE PC,R0           ; get the PC to determine DSP or GPU
 AND R1,R0            ; Mask out the relevant bits
 STORE R4,(R0)        ; write the code
 SUB R2,R0            ; Get control register (G_CTRL or D_CTRL)
 MOVEQ #0,R3          ; Clear R3 for code below
 JR GAMEOVR2          ; wait for it to take effect
 STORE R3,(R0)        ; stop the DSP
Despite the extra code for device independence (which, as I noted, was unnecessary in the end), it's still pretty much the same. I actually released a 'makefastboot' tool which would prepend this header to any cart to make it boot in 1 second instead of 5.
So, we come back to today. I wanted to update the above tool to not only add the fast boot to my new carts, but to add a simple checksum that could be externally updated, disabled, etc. I figured "how much code can a checksum take? Should fit easily." I updated the patching tool to calculate and write the checksum, as well as let me set the cartridge width and speed parameters all in one.
Well... it turns out that when you have only 20-30 bytes available, and a true RISC instruction set (not one of those fully populated instruction sets that people call RISC today because it didn't have 3D acceleration 30 years ago), a checksum function gets a little tight!
But, after some tight maneuvering and some lessons from the local contortionist, I got the code to fit. The winning realization was when I realized that I did not need to store the 0x03D0DEAD value manually in the memory, OR test if the checksum worked. I just added that magic value to the checksum in the header itself. That way, all the loop had to do was subtract bytes from the checksum, and if everything was good, it would be left with 0x03D0DEAD. I just wrote the result and exited - the Jaguar BIOS would check if it passed or not! The code itself jumped around a bit, over the hash blocks and into the second area, but it looked good. The CD entry point wasted a few bytes jumping into the GPU entry point, but I was satisfied.
Unfortunately, it crashed. And then I ran into the second problem with Atari Jaguar programming... there are no decent debug tools. Especially to debug GPU code that needs to be decrypted before it can even be examined and that the BIOS wipes out of paranoia after success or failure.
I dug out an old emulator that I'd used for debugging back in the day, and after a little time poking around I'd found some of the hooks I'd put in to aid with debugging. Fortunately, it included a disassembler, so I hacked in a GPU run-trace that executed when my encrypted code started to execute.
When my code started, I found to my surprise that the relative jumps were going all over the place, but certainly not where I intended. After some cursing, head scratching, and pacing, I finally decided to RTFM. Which informed me that relative jumps have a range of -15/+16 words. My jumps were far larger. The MD5 block itself was 20 words long. And a non-relative jump requires an address in a register, meaning that some how I'd need 6 bytes for a long plus the 2 for the jump. Some counting confirmed that was true for all but one of my jumps. Because of all the crap I had to squeeze around, that I didn't even use, I was out of space. Time for a change of approach.
I went hunting through my archives and, on the AtariHQ CD, I found source for a PC version of the encryption tool. It took a little more digging to find the private and public keys (and two versions of the public key... I had to just test to see which one was right). But after a little fiddling I had a working version of the tool.
With the ability to control the encryption now under my command, I was able to add a mode that did NOT patch the code before encryption. No MD5 hash, no patching the start and stop addresses. Now all I had to worry about were the little fixups that the CD BIOS did. I rewrote the code to not skip over the MD5 hash, meaning the first block was entirely free to use (65 whole bytes! yes!), and had lots of room for the CD entry to load up a register and jump to the same entry point.
(Sorry I'm not showing the code bits here... I didn't save the intermediaries. We'll talk about proper use of source control another day...)
So I fired it up, and made the emulator disassemble all of GPU RAM when it started, so I could verify that all looked good. I had a bit of trouble with the decryption... although only 65 bytes of each block are /used/, the actual code works on multiples of 32-bits, meaning I needed to preserve 132 bytes of the encrypted data for it to actually work. Anyway, with the dump I was able to compare to the source code and prove that the decryption was working.
And it failed. I tried tracing the entire run, but it was a 2 megabyte cartridge, and the debugger output was slowing things down immensely. Finally I tweaked it up to just output the last 10 checksum steps. From that I could see that the result was way off.
I did a little inverse math and calculated what the Jaguar thought the checksum should be, and injected that into the header. It booted! Excellent. So now I knew that the checksum code worked.
I went back to my patching tool, and poked around trying to figure out why it was failing to generate the same checksum. I thought at first it was off by one (reading one too many or one too few words). Turned out to be off by 8192! How? Well.. there's an 8k overall header on the cart, and so my checksum code just skips over that (it's the decrypted area or unused, so technically already proven). However, when I /calculated/ the checksum, I forgot to skip it. ;) Fixing that fixed the checksum.
And now the cart booted! Great! I started packaging it up.. then reluctantly decided I better test the CD entry point. "What could go wrong?" I asked. "It's just a jump."
I had previously hacked some really limited support for the CD BIOS into my emulator for this very reason - for proving the Skunkboard. But the DSP code in the emulator was a lot different than the GPU, the author had attempted pipeline emulation and it had less debug help. This took a long time to just get into a state where I could prove it was even RUNNING, let alone what it was doing.
Ultimately, though, and back and forth testing with older carts and proven systems, I was able to prove it, and, surprise bloody surprise, it didn't work.
This caught me off guard. The Skunkboard boot was still working! So again, disassembled all memory when it started up and had a look-see. What did I see but a huge block of 0xff bytes right in the middle of my code, suspiciously aligned with the MD5 sum that I wasn't using anymore.
Yep, that's right, the CD BIOS scribbled over that memory to obfuscate the MD5 sum before jumping to the later entry point. My new entry point which just jumps back to the main code. Never had a problem on the Skunkboard code because both entry points just ran in their own space. %$@$#@.
I still had a lot of room in the CD entry area, but after counting bytes, there was not enough to duplicate the whole function. The GPU version of the code used more than 65 bytes, so spilled into the CD area. So what to do...?
Ultimately, I noticed that the checksum routine itself was intact, it was just the post-checksum that got overwritten. So, I got clever and implemented the checksum code as a subroutine. The DSP version could call it and then handle it's own post-checksum code, and the GPU version could do the same on its side. It didn't matter if the GPU-specific code was overwritten when the DSP was running, since it'd never be used. Subroutines are a bit messy on the Jaguar RISC, instead of anything traditional you just store the PC in a register. But I was able to use that later for the device independent code (which, again, probably ended up unnecessary, but at this point the headache was starting... ;) )
While I was tracing the DSP version to understand the failures, I also realized that I needed to take into account the different address of the DSP RAM versus GPU RAM (they don't use local memory addresses, but system global addresses), and THEN I discovered that they don't even load to the same relative offset within RAM... meaning the code needed to be fully position-independent (except for the one jump.) (You can see the offset in the "SHARED+" line at the CD entry point.)
I present to you the final version of the code, which works on both GPU and DSP boots:
; Jaguar cart encrypted boot.; We need to deal with two entry points, one for Console and one for CD.
; This uses my custom version of the encryption tool that doesn't overwrite
; huge blocks, so the only patches we need to watch out for are the CD's.
; this frees up a lot of space for code.
; However, the CD overwrites a lot of data, particularly the MD5 hash,
; meaning you can't use that space after all if you want to work with the CD.
; This code works, but I don't necessarily have ALL the black-out areas marked.
; (Note, too, the CD unit runs this code on the DSP, not the GPU.)
 .org $00F035AC
; entry point for console
; before we unlock the system, do a quick checksum of the cart
; unlike the official code, we take the data for the checksome
; from unencrypted memory after the boot vector:
; 0x400 - boot cart width (original)
; 0x404 - boot cart address (original)
; 0x408 - flags (original)
; 0x410 - start address to sum
; 0x414 - number of 32-bit words to sum
; 0x418 - 32-bit checksum + 0x03D0DEAD (this is important! that's the wakeup value!)
; By adding the key to the checksum, and subtracting as we go through
; the cart, we can just write the result to GPU RAM and not spend any code
; on comparing it - the console will do that for us. That really helped this fit.
; Be careful.. SMAC does NOT warn when a JR is out of range. Only got 15 words!
; from here we have 32 bytes until the DSP program blocks out
 JR gpumode           ; skip over

 movei #$800410,r14   ; location of data
 load (r14),r15       ; get start
 load (r14+1),r2      ; get count - offset is in longs
 load (r14+2),r8      ; get desired checksum + key
 load (r15),r4        ; get long
 subq #1,r2           ; count down
 addqt #4,r15         ; next address
 jr NZ,chklp          ; loop if not done on r2
 sub r4,r8            ; (delay) subtract value from checksum, result should be 0x3D0DEAD at end
 jump (r6)            ; back to caller
; 30/32
; have to break it up, cause the CD unit wipes the hash memory... GPU is okay though
 move pc,r6           ; set up for return
 addq #14,r6
 movei #SHARED,r0
 jump (r0)

 MOVEI #$00FFF000,R1  ; AND mask for address
 AND R1,R6            ; mask out the relevant bits of PC

 MOVEQ #0,R3          ; Clear R3 for code below

 MOVEI #$00000EEC,R2  ; Offset to chip control register
 STORE R8,(R6)        ; write the code (checksum result, hopefully 3d0dead)
 SUB R2,R6            ; Get control register (G_CTRL or D_CTRL)
 JR GAMEOVR           ; wait for it to take effect
 STORE R3,(R6)        ; stop the GPU/DSP
; 68
; .org $f035f4 <-- doesn't work, we'll just pad then
 dc.w $5475,$7273
; JagCD entry point
; should be at f035f4 (72/$48)
; we should have 50 bytes here
 ; There is a CD relocation at $4A that we can't touch, the MOVEI covers it
 MOVEI #$12345678,R9  ; this movei is hacked by the CD boot

 MOVE PC,R6           ; prepare for subroutine
 ADDQ #14,R6
 MOVEI #SHARED+$180B8,R0  ; prepare for long jump (DSP offset included)
 JUMP (R0)            ; go do it
 NOP                  ; delay slot

 MOVEI #$00FFF000,R1  ; AND mask for address
 AND R1,R6            ; mask out the relevant bits of PC

 MOVEQ #0,R3          ; Clear R3 for code below

 MOVEI #$00000EEC,R2  ; Offset to chip control register
 STORE R8,(R6)        ; write the code (checksum result, hopefully 3d0dead)
 SUB R2,R6            ; Get control register (G_CTRL or D_CTRL)
 JR DGAMEOVR          ; wait for it to take effect
 STORE R3,(R6)        ; stop the GPU/DSP
; 44/50

Anyway... it's nice to have it finally done. I'll post a comment below when I get it up on Github, I'm late for a meeting right now. ;)


  1. Now up:

  2. Hey can you provide a excutable version of makefastboot ? thanks :D

    1. Sorry about that, I don't check for comments often enough. ;) Windows binary is now up on the github page in the root.