Super FX is a Super NES enhancement chip developed by Argonaut Games and Nintendo. It's also known as the "Graphical Support Unit" (short for "GSU") for its greater graphical capabilities compared to the S-CPU whereas its first revision, used for Star Fox, uses the name "Mathematical, Argonaut, Rotation, & Input/Output" or short MARIO chip. It also is know for the use in Super Mario World 2: Yoshi's Island.
During this article, GSU refers to Super FX whereas CPU refers to the Super NES CPU.
The embedded co-processor has got a base clock speed of 10.74 MHz which is four times as fast as the S-CPU which uses a base block of 2.68 Mhz. Its features includes but not limited to:
- A RISC-like processor where most opcodes have an instruction size of one byte and are executed in a single cycle when in cache.
- 512 bytes of cache RAM for faster processing of instructions.
- A large memory capactiy, a total capacity of 8 MiB ROM, of which two MiB are shared by CPU and GSU, and 256 KiB RAM, of which 128 KiB are shared by CPU and GSU.
- A separate bus for ROM and RAM to handle memory in parallel
- Paralell processing with the CPU
- Fast Bitmap to Planar conversion
- Pipeline processing to fetch opcodes twice as fast, effectively increasing the processing speed to 21.48 MHz.
Memory and Bus
One advantage of Super FX is that it naturally supports ROMs with a size of up to 8 MiB. However, Super FX can only use the first two MiB of a ROM. Similarly, even though a single cartridge may have up to 256 KiB of SRAM, only half of them can be used by SRAM. As a result, there is a difference between the CPU and GSU mapping.
This is how the ROM is mapped from the perspective of the CPU:
|$3000-$3FFF||Super FX registers|
|$40-$5F||$0000-$FFFF||Mirror of ROM in banks $00-$3F (HiROM)|
|$3000-$3FFF||Super FX registers|
The GSU memory map looks similar to the CPU mapping but only with access to ROM and SRAM as well as only access to 2 MiB of ROM and 128 KiB of SRAM. As a result, it looks more like this:
|$40-$5F||$0000-$FFFF||Mirror of ROM in banks $00-$3F (HiROM)|
|$80-$FF||$0000-$FFFF||Mirror of $00-$7F|
Finally, it should be noted that banks $40-$5F are HiROM mirrors of the LoROM banks $00-$3F interlaced so addresses such as $008000 and $400000 are identical.
ROM on Super FX can be up to 8 MiB large. However, even though the size is theoretically doable, the largest published Super FX games had a ROM size of 2 MiBs. As a result, not many emulators (not even BSNES as of 2021) emulate this feature properly and such ROM sizes only exist in homebrewing and modding.
Super FX has also has got limitations with ROM access. This first example is the ROM size. Even though a Super FX ROM can be as large as 8 MiB (in theory), the GSU's data bus is only connected to 2 MiB of it. Said portion of ROM which the processor can access is called GamePak ROM while the additional 6 MiB, called the Super NES ROM, can only be accessed by the CPU. A second limitation is the clock speed. The GSU is clocked very fast at 10,74 Mhz (which can be doubled with pipelining) but ROM is clocked at 3,58 Mhz (unlike on SA-1 where ROM is clocked at the same frequence the coprocessor). As a result, all opcodes, when they are executed in ROM, take at least three cycles to process while reads from ROM are very slow.
To counteract against this limitation, the GSU uses a buffering system. In order to load a value from ROM, the bank must first be set in ROMBR (set by
ROMB in the GSU code) while the address is set in register R14. Any write to R14 initiates the buffering process. The value then can be retrieved with
GETB and similar opcodes (referred to just
GETB from now on).
The GSU doesn't wait during the process of fetching. As a result, it is possible to call
GETB some time after writing to R14. In fact, this is even recommend because calling
GETB during the process of fetching will cause the GSU to halt up five cycles for executing
GETB. Similarly, writing to
ROMB will halt the code as well until the data has been fetched enabling cache during the process of fetching will mess up the ROM value instead.
In order to improve the speed of the processor, it includes a 512 byte large memory with the same frequency as the GSU as a cache. This allows Super FX to execute a single opcode with only a single cycle (i.e. three times as fast as in regular use) and also increment. The cache is separated into 32 16-byte blocks, each with a flag which denotes that a block is used, and is indexed by the cache base register (CBR) which is used to keep track of where cache has been invoked. The cache flags and CBR are reset whenever a zero is written to the GO flag but simply halting or invoking the GSU preserves the content of the cache which in term can be read by the CPU.
Cache on the SNES is mapped to addresses $xx:3100-$xx:32FF ($xx = $00-$3F, $80-$BF), though the start of the cached code (see below) also is dependent on the CBR (start = $3100 + (CBR & 0x1FF)).
There are two methods to fill the cache:
- Manual caching
- Automatic caching
Manual caching involes using the CPU and transfer the code to cache. Keep in mind that Super FX will only execute cache which has been set to be used. That means, the code has to be transfered as a full 16 byte block, though strictly speaking, only the last byte of the block (address $XXXF) needs to be written to count the block as used. In order to execute GSU code in cache, R15 has to be set to $0000-$01FF.
Automatic caching is instead handled by the GSU itself. This is handled either by the dedicated `CACHE` opcode or by a `LJMP` opcode, the latter because of the lack of bank information for the cache. Super FX will store the current PC with the lowest nibble masked out into the CBR (CBR = PC & 0xFFF0) and write each executed opcode into the cache starting from CBR & 0x1FF and fill down all the rest. As a result, this use of cache is recommend for loops and the first loop will always run slower than the remaining loops. Should all the blocks be used, any remaining code will be left uncached and executed in the original memory.
Another side effect is that cache is neither ROM nor GamePak RAM so any Super FX code in cache can run in parallel to the SNES even when access to ROM and RAM is given to the CPU instead of GSU.
In order to boost the processing speed of the GSU, it uses a technique called "pipeline processing". Pipelining exists to a lesser extend on the 65c816, where opcodes are prepared while the remaining bytes of the opcodes are fetched (this is notable on single byte opcodes such as
INC which take two cycles to process, the same ammount of cycles as
LDA #$xx, a two byte opcode) but the GSU takes it further by fetching the next opcode during the current opcode's execution.
This runs the code double as fast, virtually increasing the clock speed to 21.48 MHz at the cost of adding more care of the coder whenever R15 — the programm counter — is changed, either through branching or through direct writes to R15. In addition, internal processes such as multiplication and memory access can't take advantage of the virtual increment in clock speed.
Pipelining is controlled with bit 0 of CLSR, where clear means pipelining disabled and set means pipelining enabled. It's accessable on the CPU side with the address $3037.
For the most part, having pipelining enabled doesn't change the code. The only exceptions are when R15 is modified outside of fetching opcodes i.e. the use of branches and writes to R15 (e.g.
IWT R15,#$8000) as well as halting the GSU such as with
STOP. In these cases, the opcode following that instruction will be executed after the branch. A common solution for the unwanted opcode execution is to put a dummy
NOP after the R15 modifying opcode such as in this example:
BCC SkipCode NOP ; Dummy NOP ... SkipCode:
This proves to be inefficient, though. Most Super FX opcodes are one byte larger and take 3 or 1 cycles (depending on whether the code is executed in cache or not) to execute including
NOP. As a result, opcodes such as
SUB R0 can be put as a dummy opcode in place of
NOP. The following example demonstrates it:
BCC SkipCode ;NOP ; Redundant SUB R0 ; Set R0 to zero STW (R1) SkipCode: IWT R0,#$1234
In the above code, R0 gets overwritten by
SUB no matter whether the branch is taken. However, it later gets overwritten by
IWT. As a result, it doesn't matter whether the dummy opcode is a
SUB from the branch for the calculation. In fact, using a
NOP increases the code size by one byte which can matter when the code is executed in cache as well as increase the cycle count by 1 or 3 cycles each time the branch is executed, causing in total a minor speed penality.
This doesn't work for every opcode, though. Opcodes such as
ADD #x (which is internally prefixed with
WITH (register prefixes which change which register is the source or destination),
BRA $10 (branches which always use two bytes) and
IWT (immediate value transfers which use two or three bytes) can only be used as a dummy opcode in very specific circumstances. The latter two can even misalign the program counter not unlike executing
LDA #$12 when A is in 16-bit mode on the 65816. The following example demonstrates it:
BCC SkipCode BRA Error ; Caution! ... SkipCode: ... Error: INC R1 ; Will instead be read with the BRA as "BRA $D1".
In addition, the dummy opcode after a
STOP must be a
NOP because putting the GSU in WAIT will not clear the opcode in the pipeline. Extra care should be taken when pipeliend code is executed and a `STOP` is located at $XXXF in cache as the pipelined code may be located in unused cache and a value in RAM will be loaded instead.