Software 65816 Parameters On Stack

Background

On the 6502, you're limited to a 256 byte direct page in memory $0000-$00FF, which are best treated as a chunk of virtual CPU registers. Then, you had a stack at $0100-$01FF.

If you wanted to support reentrant programming without lots of manipulation in direct page, you'd find the otherwise underused addressing mode (dp,X) to be utterly invaluable. But then you have to manipulate the X register, which becomes a burden in its own right.

With the 65816, the direct page and stack both can overlap, and allow you to reference stack locations using direct-page addressing modes. The trick is to treat S as the CPU's native stack pointer, but D (the direct page base register) as a frame pointer.

Calling Procedures

Passing Parameters

There are a myriad of ways you can stuff things onto the stack. PHA is pretty obvious; but, there is also PHX, PHY now too. For numeric constants known at program assembly time, you can also use PEA (push effective address), PEI (push effective indirectly fetched address), and PER (push effective PC-relative address).

One caveat to be aware of is that PEA, PEI, and PER always push 16-bit values, regardless of the settings of the E, M, or X flags.

; ...
lda #1
pha
pea 2
jsr addThem
; ...

Left versus Right

If I have this chunk of high-level code:

result = addThem(1, 2);

in what order is the 1 and 2 pushed?

Whether to push arguments from left-to-right or right-to-left depends entirely on the programming language used to write the procedure you are calling. Due to C's epic influence on the computer industry, most people write procedures so that parameters are pushed right-to-left, meaning the right-most parameter is pushed first. This has the advantage of supporting variable-length argument lists (e.g., as with printf) without having to maintain an implicit, hidden parameter count variable. However, the latter is a perfectly valid approach too — I understand that some Pascal compilers do this.

To illustrate the differences:

; ,-- left-to-right  ,-- right-to-left
    pea 1             pea 2
    pea 2             pea 1
    jsr addThem       jsr addThem

Collect Your Garbage

When a procedure returns, you'll need to (eventually) clean up after yourself. You put the stack cruft there, you clean it up! (Or not; Pascal compilers typically do this task for you! But, for this example, we don't, because it turns out to produce faster code.)

If you only have a handful of items to pop off the stack, a bunch of PLA/PLX/PLY instructions can be used. For larger blocks of arguments, however, you can resort to more industrial-grade cleanup code:

; ...
jsr someProcedure
tsc
sec
sbc #someValueHere
tcs
; ...

What is the cut-off? Well, the above code from TSC to TCS is 9 cycles (assuming a 16-bit accumulator). PLA and friends takes 5 cycles (also assuming 16-bit wide registers). Therefore, if you're concerned about speed, you'll want to use the SBC-method if you have 2 or more items to pop. If you're more worried about space, you should use the SBC-method if you have more than 6 items to pop off. By the way, for two 16-bit pulls, note that a pair of INC A instructions at 2 cycles per instruction is faster than a SEC followed by a SBC immediate.

Assuming that the x flag is clear (16-bit index registers) and that the procedure preserves the X register (and returns with the x flag clear), another option is to use TSX after reserving space for the results (but before pushing input parameters), then using TXS after the procedure returns to discard the input parameters. For example:

; assumes 16-bit X register
;
phx               ; reserve space for the result(s)
phx
tsx               ; keep track of where the result(s) start
pea 0             ; push input parameters
pea $1234
jsl someProcedure
txs               ; adjust the stack pointer to point to the results

TSX is 2 cycles and TXS is 2 cycles for a total of 4 cycles, which is faster than a 5 cycle PLA when the m flag is clear (16-bit accumulator) or a 5 cycle PLX or PLY when the x flag is clear (16-bit index registers). If the procedure preserves all 16 bits of the accumulator and the m flag is clear (16-bit accumulator), TSC and TCS (which also take 2 cycles each) can be used instead of TSX and TXS.

Since all that pushing and popping takes time (one reason why Forth is faster than C on the Terbium architectures so far released), it is in your best interest to push and pop only when necessary. If you can re-use a stack cell between calls, do it! You'll thank yourself for doing so.

Returning Values

There are two methods to return values. The first is in CPU registers. This is pretty obvious, and is what I demonstrate below. But sometimes, that's not very convenient. If you want to produce two 32-bit numbers, for example, how do you return them? The preferred approach is to allocate space in the stack frame for a return value to be placed. For example (and you'll see code similar to this all over the place in Apple IIgs code):

; ...
pea 0  ; results
pea 0
pea param1
pha
pea param3
phx
jsr someProcedure
; results are stored at 11,s and 13,s

TIP: place your results to minimize caller overhead

When possible, return results in registers.

Otherwise, allocate space for the result first when calling a procedure:

pea 0  ; result
pea ...
pea ...
jsr aProcedure
pla
pla
pla ; A = result

pea 0
pea ..
pea ..
pea ..
pea ..
jsr anotherProcedure
tsc
sec
sbc #8
tcs
pla ; A = result

What about recycling an input parameter for use as a result? This is certainly doable; however, I doubt you'll find much compiler-produced code that performs this optimization though. It's probably best, for greatest compatibility with compilers, to keep the results separate from the input parameters.

Example

Let's write a tokenizer for a string. We're going to use asm_strtok_r, which we define in the next section, to do the grunt work for us. Notice how we recycle stack cells which are known to be "dead." This amortizes the time required fiddling with stack pointers and the like.

.data

myString:
  .byte "Hello world, how are you?",0

crlf:
  .byte 13,10,0

delims:
  .byte " ",0

.code
.proc printTokens
  pea 0         ; space for savePtr

  tsc           ; address of savePtr
  inc a
  pha
  pea delims    ; address of delimiters
  pea myString  ; address of string to tokenize

nextToken:
  jsr asm_strtok_r
  beq noMoreTokens

  sta 1,s       ; re-use a stack slot
  jsr printf
  lda crlf
  sta 1,s
  jsr printf

  lda #0        ; subsequent calls to strtok require NULL input pointer
  sta 1,s
  bra nextToken

noMoreTokens:
  pla           ; clean up the asm_strtok_r stack frame
  pla
  pla
  pla           ; release the savePtr space.
  rts

Writing Procedures

Receiving Arguments

I strongly advise you to preserve your working registers as the very first thing you do. It will liberate the code from constant PHx/PLx inside the code. Another way of looking at this is that caller-saves semantics applies; all internal CPU registers are best understood to be scratch registers.

Next, you'll need to establish your frame pointer. Since your routine doesn't know where D was pointing to at first, you must preserve it. Then, you reset its value to equal that of S.

.proc someProcedure
  pha  ; save working registers
  phx
  phy
  phd  ; save frame pointer and reset accordingly.
  tsc
  tcd

At this point, you have what is known as a "stack frame" pointed to by D. With the code above, it looks like this:

$01  Previous caller's D register
$03  Previous caller's Y register
$05  Previous caller's X register
$07  Previous caller's A register
$09  Previous caller's return address
$0B+ any parameters passed by the caller, but not in registers, appear here.

You can use direct-page addressing to reference the data in this frame.

;
  ldy #offset
  lda (11),y

You can even push and pop normal values, without affecting the direct page offsets. However, since most of your state appears in direct page now, and that all registers are scratch registers, the need to preserve stuff on the stack is rare. It does happen, though, particularly if you need to do stack-relative addressing. We'll see examples of this below.

;
  pha
  phx
  pla
  sta (11),y
  plx

Remember that the stack pointer is a byte pointer, not a word pointer, and that its behavior is analogous to a text cursor (e.g., it points to where the next byte goes, not where the last pushed byte is). This is why you always have to add 1 to each offset in the stack frame. Your registers are pointing as follows:

+---------+
|   ...   |
+---------+
|   RPC   |
+---------+
|    Yh   | -.
+---------+  |
|    Yl   |  |
+---------+  |
|    Xh   |  |
+---------+  |- Only those registers which your procedure modifies need be saved.
|    Xl   |  |
+---------+  |
|    Ah   |  |
+---------+  |
|    Al   | -'
+---------+
|    Dh   | -.
+---------+  |- This points to the previous stack frame, minus 1.
|    Dl   | -'
+---------+
|   ...   |     Temporary, statically allocated storage can appear here (e.g., local variables).
+---------+
|   ...   | <-- D points here.  S does too, until the next push.
+---------+
|   ...   |     Temporary, dynamically allocated storage can appear here (e.g., local variables of computed size).
+---------+
|   ...   | <-- S points here after all your pushing is done.
+---------+

Finally, regardless of where your stack pointer is set, you can quickly and easily exit from your routine like so:

;
  tdc
  tcs

  ; pop statically-allocated variables here,
  ; since they sit above the caller's D register.

  pld  ; remember to restore the previous caller's D register!
  ply
  plx
  pla
  rts
.endproc

Example Implementing strtok()

Here is an example routine that basically implements the C function strtok(), useful for simple input parsing in C. We invoke it with a stub, like this:

char *strtok(char *input, char *delims) {
  static char *savePtr;
  return asm_strtok_r(input, delims, savePtr);
}

The assembly code for for asm_strtok_r appears as follows:

.proc asm_strtok_r
    ; token = asm_strtok_r(input, delimiters, savePtr);
    ;
    ; Args:
    ; $01 = previous caller's D register
    ; $03 = return address from JSR.
    ;
    ; $05 = input pointer
    ;       On the first call, points to the start of the string
    ;       you want to parse.  On subsequent calls, this must be
    ;       NULL.
    ;
    ; $07 = delimiters pointer
    ;       Pointer to a string containing the delimiters used to
    ;       isolate one token from another.  This may change from
    ;       call to call.
    ;
    ; $09 = save pointer pointer
    ;       To preserve reentrancy, this must point to a character
    ;       pointer so that asm_strtok_r() can pick up from where
    ;       it left off the last time asm_strtok_r() was called.
    ;
    ; Result:
    ; NULL if no further tokens are in the input string.  Otherwise,
    ; a pointer to a NUL-terminated token.
    ;
    ; Preparations:
    ; 16-bit registers.

    ; Set up our stack frame.

    phd
    tsc
    tcd

    ; D points to one past our return address, so we may now refer
    ; to our inputs via direct page references.

    ; To start off with, we'll initialize our savePtr if it hasn't
    ; already been done.

    lda 5
    bne savePtrAlreadyValid
    sta (9)

savePtrAlreadyValid:
    lda (9)
    sta 5

    ; Our next step is to skip over the leading delimiters.  We'll
    ; use a subroutine called isInDelimiters, which we'll show below.

    phx
    ldx #0
    sep #P_M
    .a8

skipLeadingDelimiters:
    lda a:0,x
    beq noMoreTokens
    jsr isInDelimiters
    bne finishedScanningPastLeadingDelimiters
    inx
    bra skipLeadingDelimiters

    ; If we reach the end of the input string while still scanning through
    ; delimiters, then there obviously aren't any more tokens left to parse.

noMoreTokens:
    rep #P_M
    .a16
    plx
    pld
    lda #0
    rts

    ; Having skipped over all leading delimiters, it is now necessary to
    ; find the end of the current token.  A NUL always terminates a token.
    ; Others are provided in the delimiters string.

finishedScanningPastLeadingDelimiters:
    stx 5

lookForEndOfToken:
    lda a:0,x
    beq foundToken
    jsr isInDelimiters
    beq foundToken
    inx
    bra lookForEndOfToken

foundToken:
    stz a:0,x
    inx

    ; Update our savePtr so that we can resume from where we left off.

    rep #P_M
    .a16
    txa
    sta (9)

    lda 5
    plx
    pld
    rts
.endproc

To check if a character is in the delimiter set or not, we scan through the delimiter string. Since this is not a C-callable procedure, we don't bother with the stack frame business, and just leave D as it is. However, we do need a bit of extra space since we run out of usable registers (we cannot use X because, like Y, it must be 16-bits wide). So we actually use a stack-relative addressing mode to temporarily hold our search character.

.proc isInDelims
    ; Checks to see if the character in A is specified
    ; in the list of delimiters.
    ;
    ; Args:
    ; We are not establishing any new frame here, so our
    ; direct page offsets are all the same.
    ;
    ; $01 = previous caller's D register
    ; $03 = Return address from callee
    ; $05 = input (we don't use this)
    ; $07 = delimiters
    ; $09 = savePtr (we don't use this either)
    ;
    ; Result:
    ; EQ if the character IS in the set.  NE otherwise.
    ; 
    ; Preparations:
    ; A is 8-bits wide, X/Y are 16-bits wide.

    pha
    phy

    ldy 7

nextDelimiter:
    lda a:0,y
    beq characterNotInSet
    cmp 3,s   ; because 1,s is the preserved Y value.
    beq characterInSet
    iny
    bra nextDelimiter

characterNotInSet:
    ply
    pla
    rep #P_Z
    rts

characterInSet:
    ply
    pla
    sep #P_Z
    rts

Example Implementing index()

It turns out that our isInDelims procedure is actually pretty much exactly what is needed to implement the C function index(). Let's see how we can turn the above function into something that uses stacked parameters. We also alter the order of preserving registers, to show that it's doable, and in some cases, even preferable:

.proc index
    ; Finds the first occurrence of character c in string s.
    ;
    ; Args:
    ; $01 = previous D register
    ; $03 = preserved Y
    ; $05 = preserved A
    ; $07 = return address
    ; $09 = s
    ; $0B = c
    ;
    ; Returns:
    ; Address of the character c if found, NULL otherwise.
    ;
    ; Preparation:
    ; All registers are 16-bits.

    pha
    phy
    phd
    tsc
    tcd

    ldy 9
    sep #P_M
    .a8

tryAnother:
    lda a:0,y
    beq characterNotFound
    cmp 11
    beq characterFound
    iny
    bra tryAnother

characterNotFound:
    ldy #0

characterFound:
    sty 5   ; Sets the return "A" value.
    rep #P_M
    .a16
    pld
    ply
    pla
    rts
.endproc

I'll leave it as an exercise to the reader to retrofit strtok() to use index() as defined above.

Advanced Concepts

Many folks think that a hardware-preserved return address is a mandatory prerequisite to supporting subroutines. It turns out, this is not the case. It's just really nice to have, since it is the most common use-case.

You do need to preserve the return address, but compilers of higher-level languages are able to hard-code the return addresses at each call-site if required. It turns out this is much, much more powerful a concept, because you can do all sorts of neat things like "structured returns."

This is called continuation passing style, because the "continuation" (return address for our purposes) is handed around explicitly as a first-class parameter. Well, lo-and-behold, if you use D as a stack frame pointer, you have ready access to the implicitly generated continuation.

Popular 8-bit operating systems only rarely exploited this feature. One notable exception is the Commodore 64/128 version of GEOS, which had a number of kernel functions which took their parameters inline to the program. Even the Commodore 128 had a kernel procedure to print an inline string (search for PRIMM on the 6502.org website). The procedure had to make sure to "fix up" the return address so as to not accidentally return to the data you were working with, instead of the code you meant.

Even on modern 32-bit and 64-bit systems, this is rare. Only Forth, Lisp, and derived languages truly provide explicit support for manipulating continuations in this manner. However, an increasing bastion of functional programming languages are able to properly model such code, and will readily produce code that outperforms C in many contexts, due to the greater opportunities for optimization that are possible.

High-level Assembly with Continuations

One of the most fun things you can do in assembly coding is making it higher level without the aid of a higher level language. Inline text printing subroutines, like Commodore's PRIMM, come to mind. A lot of times, though, you'll want different control structures. Range-checking applications come to mind. Wouldn't it be nice if we could write something like:

; compute some value in A here

jsr dispatchOnRange
.word $0000, $00FF, itIsZeroPage
.word $0100, $01FF, itIsStackPage
.word $0200, $0800, itIsKernelSpace
.word $0801, $9FFF, itIsBASICSpace
.word $A000, $BFFF, itIsBASICROM
.word $C000, $CFFF, itIsUnused
.word $D000, $DFFF, itIsIOSpace
.word $E000, $FFFF, itIsKernelROM

jsr finishDoingWhateverHere

How would we implement dispatchOnRange? Here's one idea:

.proc dispatchOnRange
  ; $01 = caller's D
  ; $03 = caller's P
  ; $04 = caller's Y
  ; $06 = caller's X
  ; $08 = caller's A
  ; $0A = caller's return address (1 less than base address of structure)

  pha
  phx
  phy
  php
  phd
  tsc
  tcd

  ldy #1

again:
  lda (10),y  ; A = lower bound of current row
  cmp 8
  bcs isGreaterOrEqual
  tya
  adc #6
  tay
  bra again

isGreaterOrEqual:
  iny
  iny
  lda (10),y
  cmp 8
  bcc dispatch
  tya
  adc #3  ; remember carry is set!
  tay
  bra again

dispatch:
  iny
  iny
  lda (10),y
  dec a
  sta 10

  pld
  plp
  ply
  plx
  pla
  rts

The current code requires that all ranges be accounted for. I'll leave it as an exercise to the reader to derive a version of the routine that doesn't require this limitation.

Returning Values in Inaccessible Registers

Continuation passing also properly models a variety of different multitasking methods as well. And that brings us to the whole point of this intellectual exercise: calling a subroutine is not conceptually different than task switching. We can exploit this to make our coding lives easier, particularly if you intend on returning values in registers you normally don't have access to, or even returning multiple values. The trick is to preserve the caller's state on the stack. We've already seen an example of this above.

.proc dispatchOnRange
  ; $01 = caller's D
  ; $03 = caller's P
  ; $04 = caller's Y
  ; $06 = caller's X
  ; $08 = caller's A
  ; $0A = caller's return address (1 less than base address of structure)

  pha
  phx
  phy
  php
  phd
  tsc
  tcd

To switch to another thread, all you really need to do is change the S register in such a way that you can recover it at a later time. Or, if you wanted to compute some complex stuff and return the results in one or more registers:

; stuff computed here; jumps to a common exit point

return:
  stx 8   ; set A on return
  sty 4   ; pass Y on to caller as-is
  lda 3   ; load P
  ora #FM ; manipulate it somehow
  sta 3

  ; Recover caller's state

  pld
  plp
  ply
  plx
  pla
  rts

The only requirement is that you make sure your offsets are correct; remember that PHP pushes only an 8-bit value on the stack!

PCLSR'ing

When it comes to writing multitasking operating systems, you may find yourself in a situation where a kernel function has to be interrupted for some reason, but you ended up losing context such that you cannot restart. What do you do then?

There are two solutions. The Unix way involves setting the return code to a special sentinel that says, "I gotta go, but call me again and we'll finish up." This way isn't conceptually clean, but it works, and is proven.

Another method is called "PCLSRing," (pronounced PC-losering) which is just as tried and true as the Unix method. The core concept here is that a system call is, from the application programmer's perspective, indistinguishable from any other CPU instruction. When you think about it, that's pretty much true. And as with any other interrupted CPU instruction, it needs to be restarted. But how do you restart it?

Simple:

.proc interruptableKernelProcedure
  php  ; $09
  pha  ; $07
  phx  ; $05
  phy  ; $03
  phd  ; $01
  tsc
  tcd

pclsr:
  lda 10
  sec
  sbc #3  ; point PC back at the JSR that called this function
  sta 10

  ; At this point, make sure A, X, Y, and P are properly updated

leave:
  pld
  ply
  plx
  pla
  plp
  rts
.endproc
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License