Table of Contents
|
Background
On the 6502, you're limited to a 256 byte direct page in memory $0000-$00FF, which are best treated as a chunk of virtual CPU registers. Then, you had a stack at $0100-$01FF.
If you wanted to support reentrant programming without lots of manipulation in direct page, you'd find the otherwise underused addressing mode (dp,X) to be utterly invaluable. But then you have to manipulate the X register, which becomes a burden in its own right.
With the 65816, the direct page and stack both can overlap, and allow you to reference stack locations using direct-page addressing modes. The trick is to treat S as the CPU's native stack pointer, but D (the direct page base register) as a frame pointer.
Calling Procedures
Passing Parameters
There are a myriad of ways you can stuff things onto the stack. PHA is pretty obvious; but, there is also PHX, PHY now too. For numeric constants known at program assembly time, you can also use PEA (push effective address), PEI (push effective indirectly fetched address), and PER (push effective PC-relative address).
One caveat to be aware of is that PEA, PEI, and PER always push 16-bit values, regardless of the settings of the E, M, or X flags.
; ...
lda #1
pha
pea 2
jsr addThem
; ...
Left versus Right
If I have this chunk of high-level code:
result = addThem(1, 2);
in what order is the 1 and 2 pushed?
Whether to push arguments from left-to-right or right-to-left depends entirely on the programming language used to write the procedure you are calling. Due to C's epic influence on the computer industry, most people write procedures so that parameters are pushed right-to-left, meaning the right-most parameter is pushed first. This has the advantage of supporting variable-length argument lists (e.g., as with printf) without having to maintain an implicit, hidden parameter count variable. However, the latter is a perfectly valid approach too — I understand that some Pascal compilers do this.
To illustrate the differences:
; ,-- left-to-right ,-- right-to-left
pea 1 pea 2
pea 2 pea 1
jsr addThem jsr addThem
Collect Your Garbage
When a procedure returns, you'll need to (eventually) clean up after yourself. You put the stack cruft there, you clean it up! (Or not; Pascal compilers typically do this task for you! But, for this example, we don't, because it turns out to produce faster code.)
If you only have a handful of items to pop off the stack, a bunch of PLA/PLX/PLY instructions can be used. For larger blocks of arguments, however, you can resort to more industrial-grade cleanup code:
; ...
jsr someProcedure
tsc
sec
sbc #someValueHere
tcs
; ...
What is the cut-off? Well, the above code from TSC to TCS is 9 cycles (assuming a 16-bit accumulator). PLA and friends takes 5 cycles (also assuming 16-bit wide registers). Therefore, if you're concerned about speed, you'll want to use the SBC-method if you have 2 or more items to pop. If you're more worried about space, you should use the SBC-method if you have more than 6 items to pop off. By the way, for two 16-bit pulls, note that a pair of INC A instructions at 2 cycles per instruction is faster than a SEC followed by a SBC immediate.
Assuming that the x flag is clear (16-bit index registers) and that the procedure preserves the X register (and returns with the x flag clear), another option is to use TSX after reserving space for the results (but before pushing input parameters), then using TXS after the procedure returns to discard the input parameters. For example:
; assumes 16-bit X register
;
phx ; reserve space for the result(s)
phx
tsx ; keep track of where the result(s) start
pea 0 ; push input parameters
pea $1234
jsl someProcedure
txs ; adjust the stack pointer to point to the results
TSX is 2 cycles and TXS is 2 cycles for a total of 4 cycles, which is faster than a 5 cycle PLA when the m flag is clear (16-bit accumulator) or a 5 cycle PLX or PLY when the x flag is clear (16-bit index registers). If the procedure preserves all 16 bits of the accumulator and the m flag is clear (16-bit accumulator), TSC and TCS (which also take 2 cycles each) can be used instead of TSX and TXS.
Since all that pushing and popping takes time (one reason why Forth is faster than C on the Terbium architectures so far released), it is in your best interest to push and pop only when necessary. If you can re-use a stack cell between calls, do it! You'll thank yourself for doing so.
Returning Values
There are two methods to return values. The first is in CPU registers. This is pretty obvious, and is what I demonstrate below. But sometimes, that's not very convenient. If you want to produce two 32-bit numbers, for example, how do you return them? The preferred approach is to allocate space in the stack frame for a return value to be placed. For example (and you'll see code similar to this all over the place in Apple IIgs code):
; ...
pea 0 ; results
pea 0
pea param1
pha
pea param3
phx
jsr someProcedure
; results are stored at 11,s and 13,s
TIP: place your results to minimize caller overhead
When possible, return results in registers.
Otherwise, allocate space for the result first when calling a procedure:
pea 0 ; result
pea ...
pea ...
jsr aProcedure
pla
pla
pla ; A = result
pea 0
pea ..
pea ..
pea ..
pea ..
jsr anotherProcedure
tsc
sec
sbc #8
tcs
pla ; A = result
What about recycling an input parameter for use as a result? This is certainly doable; however, I doubt you'll find much compiler-produced code that performs this optimization though. It's probably best, for greatest compatibility with compilers, to keep the results separate from the input parameters.
Example
Let's write a tokenizer for a string. We're going to use asm_strtok_r, which we define in the next section, to do the grunt work for us. Notice how we recycle stack cells which are known to be "dead." This amortizes the time required fiddling with stack pointers and the like.
.data
myString:
.byte "Hello world, how are you?",0
crlf:
.byte 13,10,0
delims:
.byte " ",0
.code
.proc printTokens
pea 0 ; space for savePtr
tsc ; address of savePtr
inc a
pha
pea delims ; address of delimiters
pea myString ; address of string to tokenize
nextToken:
jsr asm_strtok_r
beq noMoreTokens
sta 1,s ; re-use a stack slot
jsr printf
lda crlf
sta 1,s
jsr printf
lda #0 ; subsequent calls to strtok require NULL input pointer
sta 1,s
bra nextToken
noMoreTokens:
pla ; clean up the asm_strtok_r stack frame
pla
pla
pla ; release the savePtr space.
rts
Writing Procedures
Receiving Arguments
I strongly advise you to preserve your working registers as the very first thing you do. It will liberate the code from constant PHx/PLx inside the code. Another way of looking at this is that caller-saves semantics applies; all internal CPU registers are best understood to be scratch registers.
Next, you'll need to establish your frame pointer. Since your routine doesn't know where D was pointing to at first, you must preserve it. Then, you reset its value to equal that of S.
.proc someProcedure
pha ; save working registers
phx
phy
phd ; save frame pointer and reset accordingly.
tsc
tcd
At this point, you have what is known as a "stack frame" pointed to by D. With the code above, it looks like this:
$01 Previous caller's D register
$03 Previous caller's Y register
$05 Previous caller's X register
$07 Previous caller's A register
$09 Previous caller's return address
$0B+ any parameters passed by the caller, but not in registers, appear here.
You can use direct-page addressing to reference the data in this frame.
;
ldy #offset
lda (11),y
You can even push and pop normal values, without affecting the direct page offsets. However, since most of your state appears in direct page now, and that all registers are scratch registers, the need to preserve stuff on the stack is rare. It does happen, though, particularly if you need to do stack-relative addressing. We'll see examples of this below.
;
pha
phx
pla
sta (11),y
plx
Remember that the stack pointer is a byte pointer, not a word pointer, and that its behavior is analogous to a text cursor (e.g., it points to where the next byte goes, not where the last pushed byte is). This is why you always have to add 1 to each offset in the stack frame. Your registers are pointing as follows:
+---------+
| ... |
+---------+
| RPC |
+---------+
| Yh | -.
+---------+ |
| Yl | |
+---------+ |
| Xh | |
+---------+ |- Only those registers which your procedure modifies need be saved.
| Xl | |
+---------+ |
| Ah | |
+---------+ |
| Al | -'
+---------+
| Dh | -.
+---------+ |- This points to the previous stack frame, minus 1.
| Dl | -'
+---------+
| ... | Temporary, statically allocated storage can appear here (e.g., local variables).
+---------+
| ... | <-- D points here. S does too, until the next push.
+---------+
| ... | Temporary, dynamically allocated storage can appear here (e.g., local variables of computed size).
+---------+
| ... | <-- S points here after all your pushing is done.
+---------+
Finally, regardless of where your stack pointer is set, you can quickly and easily exit from your routine like so:
;
tdc
tcs
; pop statically-allocated variables here,
; since they sit above the caller's D register.
pld ; remember to restore the previous caller's D register!
ply
plx
pla
rts
.endproc
Example Implementing strtok()
Here is an example routine that basically implements the C function strtok(), useful for simple input parsing in C. We invoke it with a stub, like this:
char *strtok(char *input, char *delims) {
static char *savePtr;
return asm_strtok_r(input, delims, savePtr);
}
The assembly code for for asm_strtok_r appears as follows:
.proc asm_strtok_r
; token = asm_strtok_r(input, delimiters, savePtr);
;
; Args:
; $01 = previous caller's D register
; $03 = return address from JSR.
;
; $05 = input pointer
; On the first call, points to the start of the string
; you want to parse. On subsequent calls, this must be
; NULL.
;
; $07 = delimiters pointer
; Pointer to a string containing the delimiters used to
; isolate one token from another. This may change from
; call to call.
;
; $09 = save pointer pointer
; To preserve reentrancy, this must point to a character
; pointer so that asm_strtok_r() can pick up from where
; it left off the last time asm_strtok_r() was called.
;
; Result:
; NULL if no further tokens are in the input string. Otherwise,
; a pointer to a NUL-terminated token.
;
; Preparations:
; 16-bit registers.
; Set up our stack frame.
phd
tsc
tcd
; D points to one past our return address, so we may now refer
; to our inputs via direct page references.
; To start off with, we'll initialize our savePtr if it hasn't
; already been done.
lda 5
bne savePtrAlreadyValid
sta (9)
savePtrAlreadyValid:
lda (9)
sta 5
; Our next step is to skip over the leading delimiters. We'll
; use a subroutine called isInDelimiters, which we'll show below.
phx
ldx #0
sep #P_M
.a8
skipLeadingDelimiters:
lda a:0,x
beq noMoreTokens
jsr isInDelimiters
bne finishedScanningPastLeadingDelimiters
inx
bra skipLeadingDelimiters
; If we reach the end of the input string while still scanning through
; delimiters, then there obviously aren't any more tokens left to parse.
noMoreTokens:
rep #P_M
.a16
plx
pld
lda #0
rts
; Having skipped over all leading delimiters, it is now necessary to
; find the end of the current token. A NUL always terminates a token.
; Others are provided in the delimiters string.
finishedScanningPastLeadingDelimiters:
stx 5
lookForEndOfToken:
lda a:0,x
beq foundToken
jsr isInDelimiters
beq foundToken
inx
bra lookForEndOfToken
foundToken:
stz a:0,x
inx
; Update our savePtr so that we can resume from where we left off.
rep #P_M
.a16
txa
sta (9)
lda 5
plx
pld
rts
.endproc
To check if a character is in the delimiter set or not, we scan through the delimiter string. Since this is not a C-callable procedure, we don't bother with the stack frame business, and just leave D as it is. However, we do need a bit of extra space since we run out of usable registers (we cannot use X because, like Y, it must be 16-bits wide). So we actually use a stack-relative addressing mode to temporarily hold our search character.
.proc isInDelims
; Checks to see if the character in A is specified
; in the list of delimiters.
;
; Args:
; We are not establishing any new frame here, so our
; direct page offsets are all the same.
;
; $01 = previous caller's D register
; $03 = Return address from callee
; $05 = input (we don't use this)
; $07 = delimiters
; $09 = savePtr (we don't use this either)
;
; Result:
; EQ if the character IS in the set. NE otherwise.
;
; Preparations:
; A is 8-bits wide, X/Y are 16-bits wide.
pha
phy
ldy 7
nextDelimiter:
lda a:0,y
beq characterNotInSet
cmp 3,s ; because 1,s is the preserved Y value.
beq characterInSet
iny
bra nextDelimiter
characterNotInSet:
ply
pla
rep #P_Z
rts
characterInSet:
ply
pla
sep #P_Z
rts
Example Implementing index()
It turns out that our isInDelims procedure is actually pretty much exactly what is needed to implement the C function index(). Let's see how we can turn the above function into something that uses stacked parameters. We also alter the order of preserving registers, to show that it's doable, and in some cases, even preferable:
.proc index
; Finds the first occurrence of character c in string s.
;
; Args:
; $01 = previous D register
; $03 = preserved Y
; $05 = preserved A
; $07 = return address
; $09 = s
; $0B = c
;
; Returns:
; Address of the character c if found, NULL otherwise.
;
; Preparation:
; All registers are 16-bits.
pha
phy
phd
tsc
tcd
ldy 9
sep #P_M
.a8
tryAnother:
lda a:0,y
beq characterNotFound
cmp 11
beq characterFound
iny
bra tryAnother
characterNotFound:
ldy #0
characterFound:
sty 5 ; Sets the return "A" value.
rep #P_M
.a16
pld
ply
pla
rts
.endproc
I'll leave it as an exercise to the reader to retrofit strtok() to use index() as defined above.
Advanced Concepts
Many folks think that a hardware-preserved return address is a mandatory prerequisite to supporting subroutines. It turns out, this is not the case. It's just really nice to have, since it is the most common use-case.
You do need to preserve the return address, but compilers of higher-level languages are able to hard-code the return addresses at each call-site if required. It turns out this is much, much more powerful a concept, because you can do all sorts of neat things like "structured returns."
This is called continuation passing style, because the "continuation" (return address for our purposes) is handed around explicitly as a first-class parameter. Well, lo-and-behold, if you use D as a stack frame pointer, you have ready access to the implicitly generated continuation.
Popular 8-bit operating systems only rarely exploited this feature. One notable exception is the Commodore 64/128 version of GEOS, which had a number of kernel functions which took their parameters inline to the program. Even the Commodore 128 had a kernel procedure to print an inline string (search for PRIMM on the 6502.org website). The procedure had to make sure to "fix up" the return address so as to not accidentally return to the data you were working with, instead of the code you meant.
Even on modern 32-bit and 64-bit systems, this is rare. Only Forth, Lisp, and derived languages truly provide explicit support for manipulating continuations in this manner. However, an increasing bastion of functional programming languages are able to properly model such code, and will readily produce code that outperforms C in many contexts, due to the greater opportunities for optimization that are possible.
High-level Assembly with Continuations
One of the most fun things you can do in assembly coding is making it higher level without the aid of a higher level language. Inline text printing subroutines, like Commodore's PRIMM, come to mind. A lot of times, though, you'll want different control structures. Range-checking applications come to mind. Wouldn't it be nice if we could write something like:
; compute some value in A here
jsr dispatchOnRange
.word $0000, $00FF, itIsZeroPage
.word $0100, $01FF, itIsStackPage
.word $0200, $0800, itIsKernelSpace
.word $0801, $9FFF, itIsBASICSpace
.word $A000, $BFFF, itIsBASICROM
.word $C000, $CFFF, itIsUnused
.word $D000, $DFFF, itIsIOSpace
.word $E000, $FFFF, itIsKernelROM
jsr finishDoingWhateverHere
How would we implement dispatchOnRange? Here's one idea:
.proc dispatchOnRange
; $01 = caller's D
; $03 = caller's P
; $04 = caller's Y
; $06 = caller's X
; $08 = caller's A
; $0A = caller's return address (1 less than base address of structure)
pha
phx
phy
php
phd
tsc
tcd
ldy #1
again:
lda (10),y ; A = lower bound of current row
cmp 8
bcs isGreaterOrEqual
tya
adc #6
tay
bra again
isGreaterOrEqual:
iny
iny
lda (10),y
cmp 8
bcc dispatch
tya
adc #3 ; remember carry is set!
tay
bra again
dispatch:
iny
iny
lda (10),y
dec a
sta 10
pld
plp
ply
plx
pla
rts
The current code requires that all ranges be accounted for. I'll leave it as an exercise to the reader to derive a version of the routine that doesn't require this limitation.
Returning Values in Inaccessible Registers
Continuation passing also properly models a variety of different multitasking methods as well. And that brings us to the whole point of this intellectual exercise: calling a subroutine is not conceptually different than task switching. We can exploit this to make our coding lives easier, particularly if you intend on returning values in registers you normally don't have access to, or even returning multiple values. The trick is to preserve the caller's state on the stack. We've already seen an example of this above.
.proc dispatchOnRange
; $01 = caller's D
; $03 = caller's P
; $04 = caller's Y
; $06 = caller's X
; $08 = caller's A
; $0A = caller's return address (1 less than base address of structure)
pha
phx
phy
php
phd
tsc
tcd
To switch to another thread, all you really need to do is change the S register in such a way that you can recover it at a later time. Or, if you wanted to compute some complex stuff and return the results in one or more registers:
; stuff computed here; jumps to a common exit point
return:
stx 8 ; set A on return
sty 4 ; pass Y on to caller as-is
lda 3 ; load P
ora #FM ; manipulate it somehow
sta 3
; Recover caller's state
pld
plp
ply
plx
pla
rts
The only requirement is that you make sure your offsets are correct; remember that PHP pushes only an 8-bit value on the stack!
PCLSR'ing
When it comes to writing multitasking operating systems, you may find yourself in a situation where a kernel function has to be interrupted for some reason, but you ended up losing context such that you cannot restart. What do you do then?
There are two solutions. The Unix way involves setting the return code to a special sentinel that says, "I gotta go, but call me again and we'll finish up." This way isn't conceptually clean, but it works, and is proven.
Another method is called "PCLSRing," (pronounced PC-losering) which is just as tried and true as the Unix method. The core concept here is that a system call is, from the application programmer's perspective, indistinguishable from any other CPU instruction. When you think about it, that's pretty much true. And as with any other interrupted CPU instruction, it needs to be restarted. But how do you restart it?
Simple:
.proc interruptableKernelProcedure
php ; $09
pha ; $07
phx ; $05
phy ; $03
phd ; $01
tsc
tcd
pclsr:
lda 10
sec
sbc #3 ; point PC back at the JSR that called this function
sta 10
; At this point, make sure A, X, Y, and P are properly updated
leave:
pld
ply
plx
pla
plp
rts
.endproc