Saturday, September 17, 2016

Atari Jaguar Programming Causes Brain Damage - Confirmation

Well, not necessarily. But after a few hours last night I had a crazy headache that I ended up getting out of bed to take advil for. ;)
 
Anyway, I took a break from work and from my pending TI project, both of which are breaking my self-esteem at alarming rates, to wrap up a support project for my Atari Jaguar cartridge boards. When I first tested them, I attempted to burn a slightly modified version of Tempest (just some text string edits) -- only to find it didn't boot.
 
I realized after thinking it through that Jaguar carts are tested against an MD5 sum (to avoid modification and bad contacts). So the MD5 hash would need to be updated too. That's not unusual, many game systems had a checksum or such to confirm the game would work. The problem on the Jaguar is that the hash is buried in the proprietarily encrypted portion of the boot header - so I'd need to re-encrypt it too.
 
That's not the end of the world... the tools were discovered years ago and are out there. To date I'd used the Atari ST encryption tool to create a "fast boot" header for the Skunkboard... and learned a bit there. One interesting thing about that project was the discovery that the Jaguar CD subverted the boot process - our first pass didn't even work plugged into a Jag CD.
 
Atari Jaguar cartridges start with an encrypted boot header, broken into 65 byte blocks. Each block is encrypted with a full 520-bit key (I kind of wonder if that didn't violate export restrictions back in the mid-90s? I could Google but I won't...). It takes about half a second to decrypt one block, and the normal cartridge has ten of them. The code is decrypted into the GPU, where it is then executed. The code runs an MD5 hash on the cartridge, compares it to the one that was embedded, and if all looks good, it writes the magic value 0x03D0DEAD to the first GPU RAM address and exits. On exit, the BIOS checks for the magic value, and boots the cart if it sees it, or red-screens if it doesn't.
 
There is a small complication in altering this code in that the Encryption tool writes several values to fixed addresses before it encrypts the boot - specifically it stores the MD5 hash, some state information, and the first and last address of the cart. So our Skunkboard boot needed to be tolerant of that (we just left the areas unused).
 
00F035AC: MOVEI   $00F03566,R00    (9800) ; address to pass ROM check
00F035B2: MOVEI   $03D0DEAD,R04    (9804) ; magic value to unlock 68k
00F035B8: JUMP    (R00)            (D000) ; go do it!
00F035BA: NOP                      (E400) ; delay slot
 
That was literally it - we just wrote the magic value and jumped back to the startup code to handle the return. I got the code encrypted by hex-editing the Atari ST program and re-running it.
 
The CD unit works a little differently. Intentionally or just because they could (it's not clear to me why), they leave the GPU busy on a little VLM demo (courtesy of Yak!), and decrypt these blocks into the DSP instead. The DSP is a nearly-identical processor to the GPU, so okay, that's cute. But then the CD BIOS makes several hard-coded fix-ups to absolute addresses in the decrypted code (without checking if it's the code that it expects). Then it jumps past the first part of the decrypted code into a later entry point to do the MD5 hash.
 
So when we updated the Skunkboard boot for the CD unit, first we had to add a second block (because the jump point is past the first block), meaning we went from half a second to a full second boot. Then we had to document and avoid the manual patch areas. Finally, we had to be able to run on both the GPU and the DSP, despite them having different address bases. But, our code was extremely simple, so it wasn't hard to make it fit. We borrowed some Atari code to handle the device independence (although, since we duplicated the work, it ended up not being necessary), and it was fine.
 
 MOVEI #$00FFF000,R1  ; AND mask for address
 MOVEI #$00000EEC,R2  ; Offset to chip control register
 MOVEI #$03D0DEAD,R4  ; magic value for proceeding
 MOVE PC,R0           ; get the PC to determine DSP or GPU
 AND R1,R0            ; Mask out the relevant bits
 STORE R4,(R0)        ; write the code
 SUB R2,R0            ; Get control register (G_CTRL or D_CTRL)
 MOVEQ #0,R3          ; Clear R3 for code below
GAMEOVR:
 JR GAMEOVR           ; wait for it to take effect
 STORE R3,(R0)        ; stop the GPU
 
; Need an offset of $48 - this data is overwritten by the encrypt tool
; with the MD5 sum.
 NOP
 NOP
 MOVEI #$0,R0
 MOVEI #$0,R0
 MOVEI #$0,R0
 MOVEI #$0,R0
 MOVEI #$0,R0
 MOVEI #$0,R0
 
; JagCD entry point (same for now)
Main:
 ; There is a relocation at $4A that we can't touch
 MOVEI #$0,R0         ; dummy value
 ; real boot starts here
 MOVEI #$00FFF000,R1  ; AND mask for address
 MOVEI #$0,R0         ; This movei is hacked by the encryption tool
 MOVEI #$0,R0         ; This movei is hacked by the encryption tool
 MOVEI #$00000EEC,R2  ; Offset to chip control register
 MOVEI #$03D0DEAD,R4  ; magic value for proceeding
 MOVE PC,R0           ; get the PC to determine DSP or GPU
 AND R1,R0            ; Mask out the relevant bits
 STORE R4,(R0)        ; write the code
 SUB R2,R0            ; Get control register (G_CTRL or D_CTRL)
 MOVEQ #0,R3          ; Clear R3 for code below
GAMEOVR2:
 JR GAMEOVR2          ; wait for it to take effect
 STORE R3,(R0)        ; stop the DSP
Despite the extra code for device independence (which, as I noted, was unnecessary in the end), it's still pretty much the same. I actually released a 'makefastboot' tool which would prepend this header to any cart to make it boot in 1 second instead of 5.
 
So, we come back to today. I wanted to update the above tool to not only add the fast boot to my new carts, but to add a simple checksum that could be externally updated, disabled, etc. I figured "how much code can a checksum take? Should fit easily." I updated the patching tool to calculate and write the checksum, as well as let me set the cartridge width and speed parameters all in one.
 
Well... it turns out that when you have only 20-30 bytes available, and a true RISC instruction set (not one of those fully populated instruction sets that people call RISC today because it didn't have 3D acceleration 30 years ago), a checksum function gets a little tight!
 
But, after some tight maneuvering and some lessons from the local contortionist, I got the code to fit. The winning realization was when I realized that I did not need to store the 0x03D0DEAD value manually in the memory, OR test if the checksum worked. I just added that magic value to the checksum in the header itself. That way, all the loop had to do was subtract bytes from the checksum, and if everything was good, it would be left with 0x03D0DEAD. I just wrote the result and exited - the Jaguar BIOS would check if it passed or not! The code itself jumped around a bit, over the hash blocks and into the second area, but it looked good. The CD entry point wasted a few bytes jumping into the GPU entry point, but I was satisfied.
 
Unfortunately, it crashed. And then I ran into the second problem with Atari Jaguar programming... there are no decent debug tools. Especially to debug GPU code that needs to be decrypted before it can even be examined and that the BIOS wipes out of paranoia after success or failure.
 
I dug out an old emulator that I'd used for debugging back in the day, and after a little time poking around I'd found some of the hooks I'd put in to aid with debugging. Fortunately, it included a disassembler, so I hacked in a GPU run-trace that executed when my encrypted code started to execute.
 
When my code started, I found to my surprise that the relative jumps were going all over the place, but certainly not where I intended. After some cursing, head scratching, and pacing, I finally decided to RTFM. Which informed me that relative jumps have a range of -15/+16 words. My jumps were far larger. The MD5 block itself was 20 words long. And a non-relative jump requires an address in a register, meaning that some how I'd need 6 bytes for a long plus the 2 for the jump. Some counting confirmed that was true for all but one of my jumps. Because of all the crap I had to squeeze around, that I didn't even use, I was out of space. Time for a change of approach.
 
I went hunting through my archives and, on the AtariHQ CD, I found source for a PC version of the encryption tool. It took a little more digging to find the private and public keys (and two versions of the public key... I had to just test to see which one was right). But after a little fiddling I had a working version of the tool.
 
With the ability to control the encryption now under my command, I was able to add a mode that did NOT patch the code before encryption. No MD5 hash, no patching the start and stop addresses. Now all I had to worry about were the little fixups that the CD BIOS did. I rewrote the code to not skip over the MD5 hash, meaning the first block was entirely free to use (65 whole bytes! yes!), and had lots of room for the CD entry to load up a register and jump to the same entry point.
 
(Sorry I'm not showing the code bits here... I didn't save the intermediaries. We'll talk about proper use of source control another day...)
 
So I fired it up, and made the emulator disassemble all of GPU RAM when it started, so I could verify that all looked good. I had a bit of trouble with the decryption... although only 65 bytes of each block are /used/, the actual code works on multiples of 32-bits, meaning I needed to preserve 132 bytes of the encrypted data for it to actually work. Anyway, with the dump I was able to compare to the source code and prove that the decryption was working.
 
And it failed. I tried tracing the entire run, but it was a 2 megabyte cartridge, and the debugger output was slowing things down immensely. Finally I tweaked it up to just output the last 10 checksum steps. From that I could see that the result was way off.
 
I did a little inverse math and calculated what the Jaguar thought the checksum should be, and injected that into the header. It booted! Excellent. So now I knew that the checksum code worked.
 
I went back to my patching tool, and poked around trying to figure out why it was failing to generate the same checksum. I thought at first it was off by one (reading one too many or one too few words). Turned out to be off by 8192! How? Well.. there's an 8k overall header on the cart, and so my checksum code just skips over that (it's the decrypted area or unused, so technically already proven). However, when I /calculated/ the checksum, I forgot to skip it. ;) Fixing that fixed the checksum.
 
And now the cart booted! Great! I started packaging it up.. then reluctantly decided I better test the CD entry point. "What could go wrong?" I asked. "It's just a jump."
 
I had previously hacked some really limited support for the CD BIOS into my emulator for this very reason - for proving the Skunkboard. But the DSP code in the emulator was a lot different than the GPU, the author had attempted pipeline emulation and it had less debug help. This took a long time to just get into a state where I could prove it was even RUNNING, let alone what it was doing.
 
Ultimately, though, and back and forth testing with older carts and proven systems, I was able to prove it, and, surprise bloody surprise, it didn't work.
 
This caught me off guard. The Skunkboard boot was still working! So again, disassembled all memory when it started up and had a look-see. What did I see but a huge block of 0xff bytes right in the middle of my code, suspiciously aligned with the MD5 sum that I wasn't using anymore.
 
Yep, that's right, the CD BIOS scribbled over that memory to obfuscate the MD5 sum before jumping to the later entry point. My new entry point which just jumps back to the main code. Never had a problem on the Skunkboard code because both entry points just ran in their own space. %$@$#@.
 
I still had a lot of room in the CD entry area, but after counting bytes, there was not enough to duplicate the whole function. The GPU version of the code used more than 65 bytes, so spilled into the CD area. So what to do...?
 
Ultimately, I noticed that the checksum routine itself was intact, it was just the post-checksum that got overwritten. So, I got clever and implemented the checksum code as a subroutine. The DSP version could call it and then handle it's own post-checksum code, and the GPU version could do the same on its side. It didn't matter if the GPU-specific code was overwritten when the DSP was running, since it'd never be used. Subroutines are a bit messy on the Jaguar RISC, instead of anything traditional you just store the PC in a register. But I was able to use that later for the device independent code (which, again, probably ended up unnecessary, but at this point the headache was starting... ;) )
 
While I was tracing the DSP version to understand the failures, I also realized that I needed to take into account the different address of the DSP RAM versus GPU RAM (they don't use local memory addresses, but system global addresses), and THEN I discovered that they don't even load to the same relative offset within RAM... meaning the code needed to be fully position-independent (except for the one jump.) (You can see the offset in the "SHARED+" line at the CD entry point.)
 
I present to you the final version of the code, which works on both GPU and DSP boots:
 
; Jaguar cart encrypted boot.; We need to deal with two entry points, one for Console and one for CD.
; This uses my custom version of the encryption tool that doesn't overwrite
; huge blocks, so the only patches we need to watch out for are the CD's.
; this frees up a lot of space for code.
; However, the CD overwrites a lot of data, particularly the MD5 hash,
; meaning you can't use that space after all if you want to work with the CD.
; This code works, but I don't necessarily have ALL the black-out areas marked.
; (Note, too, the CD unit runs this code on the DSP, not the GPU.)
.gpu
 .org $00F035AC
; entry point for console
; before we unlock the system, do a quick checksum of the cart
; unlike the official code, we take the data for the checksome
; from unencrypted memory after the boot vector:
;
; 0x400 - boot cart width (original)
; 0x404 - boot cart address (original)
; 0x408 - flags (original)
; 0x410 - start address to sum
; 0x414 - number of 32-bit words to sum
; 0x418 - 32-bit checksum + 0x03D0DEAD (this is important! that's the wakeup value!)
;
; By adding the key to the checksum, and subtracting as we go through
; the cart, we can just write the result to GPU RAM and not spend any code
; on comparing it - the console will do that for us. That really helped this fit.
; Be careful.. SMAC does NOT warn when a JR is out of range. Only got 15 words!
; from here we have 32 bytes until the DSP program blocks out
ENTRY:
 JR gpumode           ; skip over
 nop

SHARED:
 movei #$800410,r14   ; location of data
 load (r14),r15       ; get start
 load (r14+1),r2      ; get count - offset is in longs
 load (r14+2),r8      ; get desired checksum + key
chklp:
 load (r15),r4        ; get long
 subq #1,r2           ; count down
 addqt #4,r15         ; next address
 jr NZ,chklp          ; loop if not done on r2
 sub r4,r8            ; (delay) subtract value from checksum, result should be 0x3D0DEAD at end
 jump (r6)            ; back to caller
 nop
; 30/32
 
; have to break it up, cause the CD unit wipes the hash memory... GPU is okay though
gpumode:
 move pc,r6           ; set up for return
 addq #14,r6
 movei #SHARED,r0
 jump (r0)
 nop

 MOVEI #$00FFF000,R1  ; AND mask for address
 AND R1,R6            ; mask out the relevant bits of PC

 MOVEQ #0,R3          ; Clear R3 for code below

 MOVEI #$00000EEC,R2  ; Offset to chip control register
 STORE R8,(R6)        ; write the code (checksum result, hopefully 3d0dead)
 SUB R2,R6            ; Get control register (G_CTRL or D_CTRL)
GAMEOVR:
 JR GAMEOVR           ; wait for it to take effect
 STORE R3,(R6)        ; stop the GPU/DSP
; 68
 
; .org $f035f4 <-- doesn't work, we'll just pad then
 dc.w $5475,$7273
 
; JagCD entry point
; should be at f035f4 (72/$48)
; we should have 50 bytes here
dspmode:
 ; There is a CD relocation at $4A that we can't touch, the MOVEI covers it
 MOVEI #$12345678,R9  ; this movei is hacked by the CD boot

 MOVE PC,R6           ; prepare for subroutine
 ADDQ #14,R6
 MOVEI #SHARED+$180B8,R0  ; prepare for long jump (DSP offset included)
 JUMP (R0)            ; go do it
 NOP                  ; delay slot

 MOVEI #$00FFF000,R1  ; AND mask for address
 AND R1,R6            ; mask out the relevant bits of PC

 MOVEQ #0,R3          ; Clear R3 for code below

 MOVEI #$00000EEC,R2  ; Offset to chip control register
 STORE R8,(R6)        ; write the code (checksum result, hopefully 3d0dead)
 SUB R2,R6            ; Get control register (G_CTRL or D_CTRL)
DGAMEOVR:
 JR DGAMEOVR          ; wait for it to take effect
 STORE R3,(R6)        ; stop the GPU/DSP
; 44/50

 END
 
Anyway... it's nice to have it finally done. I'll post a comment below when I get it up on Github, I'm late for a meeting right now. ;)
 

Saturday, September 3, 2016

Carts Resolved

As I posted on my Twitter... the issue was that the Chinese handheld clone I use (a 'PocketGame') was not compatible with Ecco Jr or Tides of Time, leading me to believe I'd made an error.
 
I measured all the voltages for all three games, and they were correct.
 
Then I desoldered the banking chips and manually jumpered the games, again, the same result.
 
I fixed my hack of Gens to support the new scheme - the code all worked fine (on the plus side I fixed some sprite corruption I'd seen - the VDP DMA didn't take banking into account.)
 
I assumed bad burn and burnt new EPROMs. This time I tried the first one in my test cart, and when Ecco Jr again failed, I got very suspicious. I went and got the original PCB for Ecco Jr, which I had thought defective (and was the catalyst for me to do all this...) It didn't work in the handheld.
 
So, I went and got my real Genesis. Then I needed to find a screen that took S-Video, since my Genesis is modded (I /like/ S-Video, damn it!) Turns out I only own one, but that was enough. And Ecco Jr worked fine there. Tried my test cart, worked.
 
Since I had new EPROMs burned, I repaired the first cart AND finished a second. I then put Ecco Jr in a new shell (I had destroyed the original label by covering it with my 3-way label, so that's disappointing. Hopefully I can download a nice copy of the label and reprint it). Both carts are working fine... and I am done for the moment.
 
If anyone wants a Thunder Force or an Ecco, I have three TF and 1 Ecco spare... they're $30 plus shipping to cover my costs. I can also make Jaguar repros since I have some spare PCBs there, but those will cost a bit more (because Jaguar) and you need to provide your own shells.
 
 
 

Friday, September 2, 2016

Carts, carts, carts...

Might have gone a little overboard here, but maybe that's good. I tried some new things and they worked out...
 
First off I went back and made a label for the Thunder Force III Multicart I made a while back. This was my labor of love to the series - full of patches to unlock everything I wanted from the series. ;) It was my first cartridge PCB from scratch, and it worked okay. I still have a few to get rid of though...
 
 
After that, a guy online followed my guide and created a Shinobi cart. He did a great job on the menu, and gave me a copy to help finish it. He wanted to use 4MB EPROMs instead of 2MB, so I modified the board for him. It still takes 2, but the whole multicart fits in just one now, so it's a bit cheaper to build, too.
 
 
 
But, after building that test cart, I saw a few things that still were not quite perfect, like the top left IC still bumping the posts inside the cart... so I tweaked that up. And while tweaking it up, I got to thinking that the one other series I wanted to do was Ecco the Dolphin. The only issue with Ecco... two of the games are 1MB, like my cart supports, but one of them is 2MB.
 
After some thought, I think I came up with a design that puts the second game on the second chip, so it can have the larger range, while keeping the first two on the first chip. Because I'm dumb, I decided to use 2MB chips. ;) (But, I used my last one, so one is a 4MB chip doubled up anyway). I went ahead and modified the PCB, but, I didn't order any at the moment. I just used jumpers for the build...
 
 
 
That's still burning, which is why I'm not in bed yet! Almost. ;) But the new layout lets me jumper for 2MB or 4MB chips (although Ebay suggests there's no reason to get 2MB anymore)... it also lets me jumper the second chip for 1MB or 2MB banks, so my Ecco cart works more easily. ;)
 
Ecco was a software challenge too, though. Since I needed to fit two 1MB games in the first 2MB chip, I had to figure out the menu software. Turns out Ecco 1 has about 34k of unused space at the end of the cart. A little compression wasn't quite enough to get my 44k menu down, but stripping out a lot of the unused BASIC runtime was. ;)
 
Finally... earlier this week I put together a Jaguar cartridge PCB. It was the first time I did a shape, and the first task for my new calipers. ;) I made a few small mistakes, but nothing too critical, and it seems to work fine. Games normally run 32-bits wide on the Jaguar, but changing the header lets most run at 16 bits (as we learned on the Skunkboard). For my test cartridge I forgot to increase the configured bus speed to make up for the narrower width, but that didn't seem to impact it anyway. I have some sockets coming to make testing easier.
 
Anyway, I really disliked the original layout, and since I needed to fix it anyway, I redrew the lines a little more curvy, cause curvy PCBs are rare and fun. I also added the missing save EEPROM, since it turns out those are not too hard to get after all. This cart also jumpers for 2MB or 4MB, but since that's the size of a single Jaguar game I never intended multicart on this one. Just cheap single carts. ;)
 
 
Parts are funny... people sell empty Jag shells for $3 to $4 each. But for the Genesis, just grab entire boxes of sports titles for 25 cents a pop (as a bonus, a lot of sports games have 32k RAM chips I can put into my TIs. ;) ).
 
Well... picture layout isn't very good here, but at least it's easy. Looks like the burn is done, let me go finish that cart and see what happens.