Super FX: Difference between revisions
(→Pipeline Processing: Fixed some faulty formatting with code blocks (it's my first time using them).) |
|||
Line 97: | Line 97: | ||
=== Pipeline Processing === | === Pipeline Processing === | ||
In order to boost the processing speed of Super FX, it uses a technique called "pipeline processing". Pipelining exists to a lesser extend on the [[65816]], where opcodes are prepared while the remaining bytes of the opcodes are fetched (this is notable on single byte opcodes such as <code>INC</code> which take two cycles to process, the same ammount of czcles as <code>LDA #$xx</code>, a two byte opcode) but the GSU takes it further by fetching the next opcode during the current opcode's executes. This runs the code double as fast, virtually increasing the clock speed to 21.48 MHz at the cost of adding more care of the coder whenever R15 —the programm counter— is changed, either through branching or through direct writes to R15. In addition, internal processes such as multiplication and memory access can't take advantage of the virtual increment in clock speed. | In order to boost the processing speed of Super FX, it uses a technique called "pipeline processing". Pipelining exists to a lesser extend on the [[65816]], where opcodes are prepared while the remaining bytes of the opcodes are fetched (this is notable on single byte opcodes such as <code>INC</code> which take two cycles to process, the same ammount of czcles as <code>LDA #$xx</code>, a two byte opcode) but the GSU takes it further by fetching the next opcode during the current opcode's executes. | ||
This runs the code double as fast, virtually increasing the clock speed to 21.48 MHz at the cost of adding more care of the coder whenever R15 —the programm counter— is changed, either through branching or through direct writes to R15. In addition, internal processes such as multiplication and memory access can't take advantage of the virtual increment in clock speed. | |||
Pipelining is controlled with bit 0 of CLSR, where clear means pipelining disabled and set means pipelining enabled. It's accessable on the CPU side with the address $3037. | Pipelining is controlled with bit 0 of CLSR, where clear means pipelining disabled and set means pipelining enabled. It's accessable on the CPU side with the address $3037. | ||
Line 107: | Line 108: | ||
SkipCode:</nowiki> | SkipCode:</nowiki> | ||
This proves to be inefficient, though. Most GSU opcodes are one byte larger and take 3 or 1 cycles (depending on whether the code is executed in cache or not) to execute including <code>NOP</code>. As a result, opcodes such as <code>SUB R0</code> can be put as a dummy opcode in place of <code>NOP</code>. The following example demonstrates it: | This proves to be inefficient, though. Most GSU opcodes are one byte larger and take 3 or 1 cycles (depending on whether the code is executed in cache or not) to execute including <code>NOP</code>. As a result, opcodes such as <code>SUB R0</code> can be put as a dummy opcode in place of <code>NOP</code>. The following example demonstrates it: | ||
<nowiki>BCC SkipCode | <nowiki>BCC SkipCode | ||
;NOP ; Redundant | ;NOP ; Redundant | ||
Line 114: | Line 115: | ||
SkipCode: | SkipCode: | ||
IWT R0,#$1234</nowiki> | IWT R0,#$1234</nowiki> | ||
In the above code, R0 gets overwritten by the <code>SUB</code> no matter whether the branch is taken. However, it later gets overwritten by the <code>IWT</code>. As a result, it doesn't | In the above code, R0 gets overwritten by the <code>SUB</code> no matter whether the branch is taken. However, it later gets overwritten by the <code>IWT</code>. As a result, it doesn't matter whether the dummy opcode is a <code>NOP</code> or the <code>SUB</code> from the branch for the calculation. In fact, using a <code>NOP</code> increases the code size by one byte which can matter when the code is executed in cache as well as increase the cycle count by 1 or 3 cycles each time the branch is executed, causing in total a minor speed penality. | ||
This doesn't work for every opcode. Opcodes such as <code>ADD #x</code> (which is internally prefixed with <code>ALT2</code>), <code>WITH</code> (register prefixes which change which register is the source or destination), <code>BRA $10</code> (branches which always use two bytes) and <code>IWT</code> (immediate value transfer which use two or three bytes) can only be used as a dummy opcode in very specific circumstances. The latter two can even misalign the program counter not unlike executing <code>LDA #$12</code> when A is in 16-bit mode. The following example demonstrates it: | This doesn't work for every opcode, though. Opcodes such as <code>ADD #x</code> (which is internally prefixed with <code>ALT2</code>), <code>WITH</code> (register prefixes which change which register is the source or destination), <code>BRA $10</code> (branches which always use two bytes) and <code>IWT</code> (immediate value transfer which use two or three bytes) can only be used as a dummy opcode in very specific circumstances. The latter two can even misalign the program counter not unlike executing <code>LDA #$12</code> when A is in 16-bit mode. The following example demonstrates it: | ||
<nowiki>BCC SkipCode | <nowiki>BCC SkipCode | ||
BRA Error ; Caution! | BRA Error ; Caution! |
Revision as of 17:59, 28 October 2021
Super FX is a Super NES enhancement chip developed by Argonaut Games and Nintendo. It's also known as the "Graphical Support Unit" (short for "GSU") for its greater graphical capabilities compared to the S-CPU whereas its first revision, used for Star Fox, uses the name "Mathematical, Argonaut, Rotation, & Input/Output" or short MARIO chip. It also is know for the use in Super Mario World 2: Yoshi's Island.
During this article, GSU refers to Super FX whereas CPU refers to the Super NES CPU.
Features
The embedded co-processor has got a base clock speed of 10.74 MHz which is four times as fast as the S-CPU which uses a base block of 2.68 Mhz. Its features includes but not limited to:
- A RISC-like processor where most opcodes have an instruction size of one byte and are executed in a single cycle when in cache.
- 512 bytes of cache RAM for faster processing of instructions.
- A large memory capactiy, a total capacity of 8 MiB ROM, of which two MiB are shared by CPU and GSU, and 256 KiB RAM, of which 128 KiB are shared by CPU and GSU.
- A separate bus for ROM and RAM to handle memory in parallel
- Paralell processing with the CPU
- Fast Bitmap to Planar conversion
- Pipeline processing to fetch opcodes twice as fast, effectively increasing the processing speed to 21.48 MHz.
Technical Information
Hardware Registers
Memory and Bus
Memory Map
One advantage of Super FX is that it naturally supports ROMs with a size of up to 8 MiB. However, Super FX can only use the first two MiB of a ROM. Similarly, even though a single cartridge may have up to 256 KiB of SRAM, only half of them can be used by SRAM. As a result, there is a difference between the CPU and GSU mapping.
This is how the ROM is mapped from the perspective of the CPU:
Banks | Address | Description |
---|---|---|
$00-$3F | $0000-$1FFF | WRAM mirror |
$2100-$21FF | PPU registers | |
$3000-$3FFF | Super FX registers | |
$4200-$43FF | CPU registers | |
$6000-$7FFF | SRAM mirror | |
$8000-$FFFF | ROM (LoROM) | |
$40-$5F | $0000-$FFFF | Mirror of ROM in banks $00-$3F (HiROM) |
$60-$6F | $0000-$FFFF | Unmapped |
$70-$71 | $0000-$FFFF | SRAM |
$7C-$7D | $0000-$FFFF | Backup RAM |
$7E-$7F | $0000-$FFFF | WRAM |
$80-$BF | $0000-$1FFF | WRAM mirror |
$2100-$21FF | PPU registers | |
$3000-$3FFF | Super FX registers | |
$4200-$43FF | CPU registers | |
$6000-$7FFF | SRAM mirror | |
$8000-$FFFF | ROM (LoROM) | |
$C0-$FF | $8000-$FFFF | ROM (HiROM) |
The GSU memory map looks similar to the CPU mapping but only with access to ROM and SRAM as well as only access to 2 MiB of ROM and 128 KiB of SRAM. As a result, it looks more like this:
Banks | Address | Description |
---|---|---|
$00-$3F | $0000-$7FFF | Unmapped |
$8000-$FFFF | ROM (LoROM) | |
$40-$5F | $0000-$FFFF | Mirror of ROM in banks $00-$3F (HiROM) |
$60-$6F | $0000-$FFFF | Unmapped |
$70-$71 | $0000-$FFFF | SRAM |
$72-$7F | $0000-$FFFF | Unmapped |
$80-$FF | $0000-$FFFF | Mirror of $00-$7F |
Finally, it should be noted that banks $40-$5F are HiROM mirrors of the LoROM banks $00-$3F interlaced so addresses such as $008000 and $400000 are identical.
ROM
RAM
Cache
Pipeline Processing
In order to boost the processing speed of Super FX, it uses a technique called "pipeline processing". Pipelining exists to a lesser extend on the 65816, where opcodes are prepared while the remaining bytes of the opcodes are fetched (this is notable on single byte opcodes such as INC
which take two cycles to process, the same ammount of czcles as LDA #$xx
, a two byte opcode) but the GSU takes it further by fetching the next opcode during the current opcode's executes.
This runs the code double as fast, virtually increasing the clock speed to 21.48 MHz at the cost of adding more care of the coder whenever R15 —the programm counter— is changed, either through branching or through direct writes to R15. In addition, internal processes such as multiplication and memory access can't take advantage of the virtual increment in clock speed.
Pipelining is controlled with bit 0 of CLSR, where clear means pipelining disabled and set means pipelining enabled. It's accessable on the CPU side with the address $3037.
For the most part, having pipelining enabled doesn't change the code. The only exceptions are when R15 is modified outside of fetching opcodes i.e. the use of branches and writes to R15 (e.g. IWT,#$8000
) as well as halting the GSU such as with STOP
. In these cases, the next opcode is fetched. A common solution is to put a dummy NOP
i.e. a NOP
after the R15 modifying opcode such as in this example:
BCC SkipCode NOP ; Dummy NOP ... SkipCode:
This proves to be inefficient, though. Most GSU opcodes are one byte larger and take 3 or 1 cycles (depending on whether the code is executed in cache or not) to execute including NOP
. As a result, opcodes such as SUB R0
can be put as a dummy opcode in place of NOP
. The following example demonstrates it:
BCC SkipCode ;NOP ; Redundant SUB R0 ; Set R0 to zero STW (R1) SkipCode: IWT R0,#$1234
In the above code, R0 gets overwritten by the SUB
no matter whether the branch is taken. However, it later gets overwritten by the IWT
. As a result, it doesn't matter whether the dummy opcode is a NOP
or the SUB
from the branch for the calculation. In fact, using a NOP
increases the code size by one byte which can matter when the code is executed in cache as well as increase the cycle count by 1 or 3 cycles each time the branch is executed, causing in total a minor speed penality.
This doesn't work for every opcode, though. Opcodes such as ADD #x
(which is internally prefixed with ALT2
), WITH
(register prefixes which change which register is the source or destination), BRA $10
(branches which always use two bytes) and IWT
(immediate value transfer which use two or three bytes) can only be used as a dummy opcode in very specific circumstances. The latter two can even misalign the program counter not unlike executing LDA #$12
when A is in 16-bit mode. The following example demonstrates it:
BCC SkipCode BRA Error ; Caution! ... SkipCode: ... Error: INC R1 ; Will instead be read with the BRA as "BRA $D1".
In addition, the dummy opcode after a STOP
must be a NOP
because putting the GSU in WAIT will not clear the opcode in the pipeline.