We've just updated MediaWiki and its underlying software. If anything doesn't look or work quite right, please mention it to us. --RanAS

SA-1 Hardware Behavior

From SnesLab
(Redirected from Cumulative Sum)
Jump to: navigation, search

This article has a focus on the undefined/unknown behavior of the SA-1 hardware chip. For the main article, see SA-1.

Registers

  • $2200-$22FF are write-only registers.
  • $2300-$23FF are read-only registers.
  • SNES can only read $2300. The rest is open-bus.

Arithmetic Control

ROM used for testing: https://raw.githubusercontent.com/VitorVilela7/SnesSpeedTest/multest/sa1-multest.smc

On the heading, you can see the values that will be stored on each test. $2251-$2254 and $2250.

The first five bytes is reads from $2250-$2254. It should return SA-1 "open-bus".

The next columns comes the test results 1 - 9, which is detailed on validation tests.

A/X controls $2250 value.

D-pad controls $2251-$2252.

Y/B and Start/Select controls $2253-$2254.


$2250: ---- --cd

d: division flag.

c: cumulative sum.


if cumulative sum flag is set, multiplication is always used regardless of division flag. The other flags does absolutely nothing.

Multiplication

  • Bit 0 and 1 of $2250 is off.
  • Total wait time is five 10.74 MHz cycles.
  • That means only a single NOP opcode is needed between STA $2254 and LDA $2306, since LDA $2306 itself takes three 10.74 MHz cycles already.
  • It appears to start assigning the result from the most significant bit and each bit gets shifted until the last bit.

Division

  • Bit 0 of $2250 is set and bit 1 is off.
  • Input: $2251-$2252 (dividend) is signed while $2253-$2254 (divisor) is unsigned.
  • Output: $2306-$2307 (quotient) is signed while $2308-$2309 (remainder) is unsigned.
  • Division by zero leads to $2308-$2309 (remainder) being a copy the absolute value of $2251-$2252 (i.e. if value > $7FFF, value = (value ^ $FFFF) + 1) and $2306-$2307 (quotient) is set to $FFFF for positive values and $0001 for negative values.
Validation tests
  • Applies for division so far.

Expected values after waiting five 10.74 MHz clock cycles. On "Clock cycles" column, it's documented different measured values on different clock cycle waits.


Assume A 16-bit and DP = $2300. It has been tested with:

  1. 8 cycles after writing to $2254 and reading $2306 first: sta $2253; nop; xba; lda $2306
  2. 7 cycles after writing to $2254 and reading $2306 first: sta $2253; nop; nop; lda $2306
  3. 6 cycles after writing to $2254 and reading $2306 first: sta $2253; xba; lda $2306
  4. 5 cycles after writing to $2254 and reading $2306 first: sta $2253; nop; lda $2306
  5. 4 cycles after writing to $2254 and reading $2306 first: sta $2253; lda $2306 // program counter after this op is odd.
  6. 3 cycles after writing to $2254 and reading $2306 first: sta $2253; lda $2306 // program counter after this op is even.
  7. 2 cycles after writing to $2254 and reading $2306 first: sta $2253; lda $06
  8. 2 cycles after writing to $2254 and reading $2308 first: sta $2253; lda $08
  9. 3 cycles after writing to $2254 and reading $2308 first: sta $2253; lda $2308 // pc is even.


$2251 $2253 $2306 $2308 Clock cycles
#$87F8 #$FFFB #$0000 #$7808 Any?
#$87F8 #$0000 #$0001 #$7808 Any?
#$80F8 #$0000 #$0001 #$7F08 Any?
#$7FF8 #$0000 #$FFFF #$7F08 Any?
#$8000 #$0000 #$0001 #$8000 Any?
#$8000 #$8000 #$FFFF #$0000 $2308 is #$8000 for 8th test.
#$8000 #$492A #$FFFF #$36D6 $2308 is #$8000 for 8th test. $2308 toggles between #$3600 and #$36D6 for 9th test.
#$8000 #$0101 #$FF81 #$0081 $2308 is #$0FC0 for 8th test. $2308 toggles between #$0094 and #$0081 for 9th test.
#$53FE #$0037 #$0186 #$0034 $2308 is #$01FE for 8th test. $2308 toggles between #$00FE and #$0034 for 9th test.

Overall, division seems to be extremely stable and fast. Only $2308 is affected with too early readings. It's worth questioning that the previous division operation was exactly the same one (since each test is done sequentially), does it can potentially make the results come out correctly when read early? It's a question for the next batch tests.

Cumulative Sum

  • Bit 1 of $2250 is set. Bit 0 doesn't care.
  • It accumulates consecutive multiplications. Basically a multiply-with-add circuit.
  • It ignores the division mode flag completely.

Memory Map

ROM

  • Speed: 5.37 MHz on SA-1 and 2.68 MHz on SNES (3.58 MHz if FastROM is used and accessed on banks $80-$FF).
  • However because the ROM has a 16-bit data bus, the effective speed is 10.74 MHz on SA-1 CPU.
    • The chip will always make word reads from it when SNES CPU accesses it. Thus, effectively making the SA-1 chip access the ROM at 2.68 MHz (but with 5.37 MHz effective speed).
  • Cycle penalty: because of the 16-bit databus, the SA-1 will get one or more 10.74 MHz cycle penalties when:
    • Jumps or reads an odd address.
    • Only read a single byte from the ROM.
    • Branch and end up not using the higher 16-bit word.
  • Overall, cycle penalties happen when it's not possible to take full advantage of the 16-bit data bus of the ROM and when it has the multiplex accesses together the SNES CPU.

I-RAM

  • Speed: 10.74 MHz on SA-1 and 3.58 MHz on SNES.
  • Cycle penalty is not well-explained so far.
  • Maximum size is 2 kB.
  • Mapped on banks $00-$3F and $80-$BF.
  • Accessible on addresses $3000-$37FF on SNES CPU memory map.
  • Accessible on addresses $3000-$37FF and $0000-$07FF on SA-1 CPU memory map.

BW-RAM

  • Speed: 5.37 MHz on SA-1 and 2.68 MHz on SNES.
  • Two-phase access: When SNES attempts reading or writing to it, SA-1 must wait before accessing it. Effective speed is 2.68 MHz on worst case for SA-1 CPU.
  • Mapped on banks $40-$4F on SNES CPU side (up to 1 MiB).
  • Mapped on banks $40-$5F on SA-1 CPU side (up to 2 MiB).
  • SA-1 has enough pins for mapping up to 256 KB of BW-RAM. Assign #$08 to $00:FFD8.

Virtual Memory

  • Virtual memory is only present on SA-1 CPU memory map.
  • It's mapped on banks $60-$7F.
  • Allows reading only two or four bits of the BW-RAM, useful for storing packed 4bpp or 2bpp pixels.

Variable Length Bit Processing

  • Only works with even address. The least significant bit is completely ignored.
    • That means ROM addresses like $00:8001 will be read as $00:8000 instead.
  • The address expects the actual SA-1 ROM memory map.
    • However, regions that does not map to ROM will just mirror to the bank 0 instead.
      • For example, $00:0000-$00:7FFF is the same as $00:8000-$00:FFFF.
    • For banks $40-$7F, *everything* is bank 0 mirror (LoROM!)
      • $41:8000 will actually mirror to $00:8000 (LoROM) and NOT $01:8000 like expected.
  • Banks $C0-$FF works as expected (HiROM).

Overall, divide the memory map in 32 KB blocks:

  • $00-$3F; $80-$BF: LoROM memory (starting at $8000)
  • $C0-$FF: HiROM memory map
  • Everything else: Mirror to bank 0.

Why are the odd addresses completely ignored? This is simple to explain: because the SA-1 ROM has a 16-bit bus, so the address is halved when tossed to the ROM circuit, discarding the low byte. It does not add any special treatment for reading the high byte first or something.

Automatic Mode

  • Only works when you assign number of shift bits and automatic mode flag ($2258) after you assign the ROM address to $2259-$225B.
  • However that makes the shifting already happen since the beginning.

Registers

$2258 settings gets reset to #$00 after you assign to $2259-$225B.

FastROM

You can enable FastROM on a SA-1 ROM.

However when enabling it, the SA-1 chip is paused whenever SNES attempts a fast ROM reading, but it won't crash. Its controller is smart enough to allocate all of its I/O resources and hand to the S-CPU.

The behavior is only present on the actual FastROM accesses. Accessing though DMA or banks $00-$7F, the SA-1 CPU speed is not affected and it will treat as SlowROM reads.

This is extremely clever and even more if you consider that's only possible with the chip monitoring changes to the $420D register.

Usage

Is there any point in using FastROM on a SA-1 ROM if the chip will end up getting paused? Yes!

You can always trigger an IRQ when calling SA-1 on WRAM. That means if you only use SA-1 only or S-CPU only per time, you can get the best speed of both with 10.73 and 3.58 MHz, with one paused while the other one is running.

However, there is no trace of games using this behavior.