Software - 65816 - Speed considerations

DEC, INC, and shifts

With 16-bit data (m flag = 0), incrementing or decrementing a pointer or index twice is common. For example:

INC LABEL ; 7 cycles for dp, 9 cycles for abs
INC LABEL ; 7 cycles for dp, 9 cycles for abs

takes 14 cycles using direct page addressing or 18 cycles using absolute addressing. It is faster to use:

LDA LABEL ; 4 cycles for dp, 5 cycles for abs
INC       ; 2 cycles
INC       ; 2 cycles
STA LABEL ; 4 cycles for dp, 5 cycles for abs

which takes 12 cycles using direct page addressing or 14 cycles using absolute addressing. (The X or the Y register could be used instead. The cycle count is the same)

The same speed optimization can be made with DEC, ASL, LSR, ROL, and ROR (although shifting just twice is less common than incrementing or decrementing twice).

Bank 0 memory move up (or fill)

A very fast (3 to 3.5 cycles per byte) bank 0 memory move up (or down if the source and destination blocks do not overlap) is possible using the PEI instruction. The key is a sequence of 128 consecutive PEI instructions:

PEIFE PEI $FE
PEIFC PEI $FC
PEIFA PEI $FA
PEIF8 PEI $F8
; etc.
PEI06 PEI $06
PEI04 PEI $04
PEI02 PEI $02
PEI00 PEI $00
      DEY
      BEQ SKIP
      TDC
      SEC
      SBC #$100
      TCD
      JMP PEIFE
SKIP  DEC LAST
      BEQ DONE
      INY
      LDA LASTD
      TCD
      JMP (LASTPEI)
DONE

Example:

  • Start of source = $10DF
  • End of source = $1234
  • End of destination = $1236

A memory move up (e.g. the MVP instruction) works from the end of the block to the start of the block. The process is:

  1. Because the end of source is an odd address, handle this byte specially: move the byte at $1234 to $1236
  2. Clear the x and m flags (16-bit registers)
  3. Save the D register and the stack pointer (it can be saved in X by using TSX)
  4. Set the stack pointer to $1235 (since the byte has already been moved to $1236)
  5. Set the D register to $1200
  6. Store 2 in LAST
  7. Because the start of source is an even address, store $10E0 in LASTD
  8. Store the address of PEI1E in LASTPEI (there are $1E + 2 bytes from $10E0 to $10FF)
  9. Set Y to 2 and JMP to PEI32 (since the byte has already been moved from $1234)
  10. Restore the stack pointer and the D register
  11. Because the start of source is an even address, handle this byte specially: move the byte at $10DF to $10E1

The first time through the loop, PEI32 to the DEY handles the source range $1200 through $1233. The low byte of the D register is zero, so PEI takes 6 cycles and moves 2 bytes, thus taking 3 cycles per byte moved.

After the DEY, Y is 1, so $100 is subtracted from the D register and it jumps to PEIFE. PEIFE to the DEY handles the source range $1100 through $11FF. The low byte of the D register is still zero, so PEI takes 3 cycles per byte moved.

After the DEY, Y is 0, so LAST is decremented. LAST is then 1, so Y is incremented to 1, the D register is set to $10E0, and it jumps to PEI1E. PEI1E to the DEY handles the source range $10E0 through $10FF. Because the low byte of the address of the start of source is non-zero, the low byte of the D register is non-zero, so PEI takes 7 cycles, or 3.5 cycles per byte moved.

After the DEY, Y is 0, so LAST is decremented. LAST is then zero so it exits.

Because the direct page and the stack are both bank 0, this technique is limited to bank 0.

This technique can be used to fill memory, but there is one gotcha: store the fill byte at two locations:

  1. The end of the fill range
  2. The end of the fill range - 1

then use memory move up where:

  • start of source = start of fill range + 2
  • end of source = end of fill range
  • start of destination = end of fill range - 2

The reason that storing the fill byte only at the end of the range, then using

  • start of source = start of fill range + 1
  • end of source = end of fill range
  • start of destination = end of fill range - 1

doesn't work is that PEI reads two bytes from the direct page, then writes two bytes to the stack, rather than reading one byte, writing one byte, reading one byte, and writing one byte.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License