ack/doc/6500.doc

. \" $Header$"
.RP
.ND Dec 1984
.TL
.B
A backend table for the 6500 microprocessor
.R
.AU
Jan van Dalen
.AB
The backend table is part of the Amsterdam Compiler Kit (ACK).
It translates the intermediate language family EM to a machine
code for the MCS6500 microprocessor family.
.AE
.bp
.DS C
.B
THE MCS6500 MICROPROCESSOR.
.R
.DE
.NH 0
Introduction
.PP
Why a back end table for the MCS6500 microprocessor family.
Although the MCS6500 microprocessor family has an simple
instruction set and internal structure, it is used in a
variety of microcomputers and homecomputers.
This is because of is low cost.
As an example the Apple II, a well known and width spread
microprocessor, uses the MCS6502 CPU.
Also the BBC homecomputer, whose popularity is growing day
by day uses the MCS6502 CPU.
The BBC homecomputer is based on the MCS6502 CPU although
better and stronger microprocessors are available.
The designers of Acorn computer Industries have probably
choosen for the MCS6502 because of the amount of software
available for this CPU.
Since its width spreaded use, a variaty of software
will be needed for it.
One can think of games!!, administration programs,
teaching programs, basic interpreters and other application
programs.
Even do it will not be possible to run the total compiler kit
on a MCS6500 based computer, it is possible to write application
programs in a high level language, such as Pascal or C on a
minicomputer.
These application programs can be tested and compiled on that
minicomputer and put in a ROM (Read Only Memory), for example,
cso that it an be executed by a MCS6500 CPU.
The strategy of writing testprograms on a minicomputer,
compile it and then execute it on a MCS6500 based
microprocessor is used by the development of the back end.
The minicomputer used is M68000 based one, manufactured by
Bleasdale Computer Systems Ltd..
The micro- or homecomputer used is a BBC microcomputer,
manufactured by Acorn Computer Ltd..
.NH
The MOS Technology MCS6500
.PP
The MCS6500 is as a family of CPU devices developed by MOS
Technology [1].
The members of the MCS6500 family are the same chips in a
different housing.
The MCS6502, the big brother in the family, can handle 64k
bytes of memory, while for example the MCS6504 can only handle
8k bytes of memory.
This difference is due to the fact that the MCS6502 is in a
40 pins house and the MCS6504 has a 28 pins house, so less
address lines are available.
.bp
.NH
The MCS6500 CPU programmable registers
.PP
The MCS6500 series is based on the same chip so all have the
same programmable registers.
.sp 9
.NH 2
The accumulator A.
.PP
The accumulator A is the only register on which the arithmetic
and logical instructions can be used.
For example, the instruction ADC (add with carry) adds the
contents of the accumulator A and a byte from memory or data.
.NH 2
The index register X.
.PP
As the name suggests this register can be used for some
indirect addressing modes.
The modes are explaned below.
.NH 2
The index register Y.
.PP
This register is, just as the index register X, used for
certain indirect addressing modes.
These addressing modes are different from the modes which
use index register X.
.NH 2
The program counter PC
.PP
This is the only 16-bit register available.
It is used to point to the next instruction to be
carried out.
.NH 2
The stack pointer SP
.PP
The stack pointer is an 8-bit register, so the stack can contain
at most 256 bytes.
The CPU always appends 00000001 as highbyte of any stack address,
which means that memory locations
.B
0100
.R
through
.B
01FF
.R
are permanently assigned to the stack.
.sp 12
.NH 2
The status register
.PP
The status register maintains six status flags and a master
interrupt control bit.
.br
These are the six status flags:
    Carry        (c)
    Zero         (z)
    Overflow     (o)
    Sign         (n)
    Decimal mode (d)
    Break        (b)


The bit (i) is the master interrupt control bit.
.NH
The MCS6500 memory layout.
.PP
In the MCS6500 memory space three area's have special meaning.
These area's are:
.IP 1)
Top page.
.IP 2)
Zero page.
.IP 3)
The stack.
.PP
MCS6500 memory is divided up into pages.
These pages consist 256 bytes.
So in a memory address the highbyte denotes the page number
and the lowbyte the offset within the page.
.NH 2
Top page.
.PP
When a MCS6500 is restared it jumps indirect via memory address
.B
FFFC.
.R
At
.B
FFFC
.R
(lowbyte) and
.B
FFFD
.R
(highbyte) there must be the address of the bootstrap subroutine.
When a break instruction (BRK) occurs or an interrupt takes place,
the MCS6500 jumps indirect through memory address
.B
FFFE.
.R
.B
FFFE
.R
and
.B
FFFF
.R
thus, must contain the address of the interrupt routine.
The former only goes for maskeble interrupt.
There also exist a nonmaskeble interrupt.
This cause the MCS6500 to jump indirect through memory address
.B
FFFA.
.R
So the top six bytes of memory are used by the operating system
and therefore not available for the back end.
.NH 2
Zero page.
.PP
This page has a special meaning in the sence that addressing
this page uses special opcodes.
Since a page consists of 256 bytes, only one byte is needed
for addressing zero page.
So an instruction which uses zero page occupies two bytes.
It also uses less clock cycle's while carrying out the instruction.
Zero page is also needed when indirect addressing is used.
This means that when indirect addressing is used, the address must
reside in zero page (two consecutive bytes).
In this case (the back end), zero page is used, for example
to hold the local base, the second local base, the stack pointer
etc.
.NH 2
The stack.
.PP
The stack is described in paragraph 3.5 about the MCS6500
programmable registers.
.NH
The memory adressing modes
.PP
MCS6500 memory reference instructions use direct addressing,
indexed addressing, and indirect addressing.
.NH 2
direct addressing.
.PP
Three-byte instructions use the second and third bytes of the
object code to provide a direct 16-bit address:
therefore, 65.536 bytes of memory can be addressed directly.
The commonly used memory reference instructions also have a two-byte
object code variation, where the second byte directly addresses
one of the first 256 bytes.
.NH 2
Base page, indexed addressing.
.PP
In this case, the instruction has two bytes of object code.
The contents of either the X or Y index registers are added to the
second  object code byte in order to compute a memory address.
This may be illustrated as follows:
.sp 15
Base page, indexed addressing, as illustrated above, is
wraparound - which means that there is no carry.
If the sum of the index register and second object code byte contents
is more than
.B
FF
.R
, the carry bit will be dicarded.
This may be illustrated as follows:
.sp 9
.NH 2
Absolute indexed addressing.
.PP
In this case, the contents of either the X or Y register are added
to a 16-bit direct address provided by the second and third bytes
of an instruction's object code.
This may be illustrated as follows:
.sp 10
.NH 2
Indirect addressing.
.PP
Instructions that use simple indirect addressing have three bytes of
object code.
The second and third object code bytes provide a 16-bit address;
therefore, the indirect address can be located anywhere in
memory.
This is straightforward indirect addressing.
.NH 3
Pre-indexed indirect addressing.
.PP
In this case, the object code consists of two bytes and the
second object code byte provides an 8-bit address.
Instructions that use pre-indexed indirect addressing add the contents
of the X index register and the second object code byte to access
a memory location in the first 256 bytes of memory, where the
indirect address will be found:
.sp 18
When using pre-indexed indirect addressing, once again wraparound
addition is used, which means that when the X index register contents
are added to the second object code byte, any carry will be discarded.
Note that only the X index register can be used with pre-indexed
addressing.
.NH 3
Post-indexed indirect addressing.
.PP
In this case, the object code consists of two bytes and the
second object code byte provides an 8-bit address.
Now the second object code byte indentifies a location
in the first 256 bytes of memory where an indirect address
will be found.
The contents of the Y index register are added to this indirect
address.
This may be illustrated as follows:
.sp 18
Note that only the Y index register can be used with post-indexed
indirect addressing.
.bp
.NH
What the CPU has and doesn't has.
.PP
Although the designers of the MCS6500 CPUs family state that
there is nothing very significant about the short stack (only
256 bytes) this stack caused problems for the back end.
The designers say that a 256-byte stack usually is sufficient
for any typical microcomputer, this is only true if the stack
is used only for return addresses of the JSR (jump to
subroutine) instruction.
But since the EM machine is suppost to be a stack machine and
high level languages need the ability of parameters and
locals in there procedures and function, this short stack
is unsufficiant.
So an software stack is implemented in this back end, requiring two
additional subroutines for stack handling.
These two stack handling subroutines slow down the processing time
of a program since the stack is used heavely.
.PP
Since parameters and locals of EM procedures are offseted
from the localbase of that procedure, indirect addressing
is havily used.
Offsets are positive (for parameters) and negative (for
local variables).
As explaned before the addressing modes the MCS6500 have a
post indexed indirect addressing mode.
This addressing mode can only handle positive offsets.
This raises a problem for accessing the local variables
I have chosen for the next solution.
A second local base is introduced.
This second local base is the real local base subtracted by
a constant BASE.
In the present situation of the back end the value of BASE
is 240.
This means that there are 240 bytes reseved for local
variables to be indirect addressed and 14 bytes for
the parameters.
.DS C
.B
THE CODE GENERATOR.
.R
.DE
.NH 0
Description of the machine table.
.PP
The machine description table consists of the following sections:
.IP 1.
The macro definitions.
.IP 2.
Constant definitions.
.IP 3.
Register definitions.
.IP 4.
Token definitions.
.IP 5.
Token expressions.
.IP 6.
Code rules.
.IP 7.
Move definitions.
.IP 8.
Test definitions.
.IP 9.
Stack definitions.
.NH 2
Macro definitions.
.PP
The macro definitions at the top of the table are expanded
by the preprocessor on occurence in the rest of the table.
.NH 2
Constant definitions.
.PP
There are three constants which must be defined at first.
The are:
.IP EM_WSIZE: 11
Number of bytes in a machine word.
This is the number of bytes a simple
.B
loc
.R
instruction will put on the stack.
.IP EM_PSIZE:
Number of bytes in a pointer.
This is the number of bytes a
.B
lal
.R
instruction will put on the stack.
.IP EM_BSIZE:
Number of bytes in the hole between AB and LB.
The calling sequence only saves LB on the stack so this
constant is equal to the pointer size.
.NH 1
Register definitions.
.PP
The only important register definition is the definition of
the registerpair AX.
Since the rest of the machine's registers Y, PC, ST serve
special purposes, the code generator cannot use them.
.NH 2
Token definitions
.PP
There is a fake token.
This token is put in the table, since the code generator generator
complains if it cannot find one.
.NH 2
Token expression definitions.
.PP
The token expression is also a fake one.
This token expression is put in the table, since the code generator
generator complains if it cannot find one.
.NH 2
Code rules.
.PP
The code rule section is the largest section in the table.
They specify EM patterns, stack patterns, code to be generated,
etc.
The syntax is:
.IP code rule:
EM pattern '|' stack pattern '|' code '|'
stack replacement '|' EM replacement '|'
.PP
All patterns are optional, however there must be at least one
pattern present.
If the EM pattern is missing the rule becomes a rewriting
rule or a
.B
coercion
.R
to be used when code generation cannot continue because of an
invalid stack pattern.
The code rules are preceeded by the word CODE:.
.NH 3
The EM pattern.
.PP
The EM pattern consists of a list of EM mnemonics followed by
a boolean expression. Examples:
.sp 1
.br
.B
loe
.R
.sp 1
will match a single
.B
loe
.R
instruction,
.sp 1
.br
.B
loc loc cif
.R
$1==2 && $2==8
.sp 1
is a pattern that will match
.sp 1
.br
.B
loc
.R
2
.br
.B
loc
.R
8
.br
.B
cif
.R
.sp 1
and
.sp 1
.br
.B
lol
inc
stl
.R
$1==$3
.sp 1
will match for example
.sp 1
.br
.B
lol
.R
6
.br
.B
inc
.R
.br
.B
stl
.R
6
.sp 1
A missing boolean expession evaluates to TRUE.
.PP
The code generator will match the longest EM pattern on every occasion,
if two patterns of the same length match the first in the table
will be chosen, while all patterns of length greater than or equal
to three are considered to be of the same length.
.NH 3
The stack pattern.
.PP
The only stack pattern that can occur is R16, which means that the
registerpair AX contains the word on top of the stack.
If this is not the case a coersion occurs.
This coersion generates a "jsr Pop", which means that the top
of the stack is popped and stored in the registerpair AX.
.NH 3
The code part.
.PP
The code part consists of three parts, stack cleanup, register
allocation, and code to be generated.
All of these may be omitted.
.NH 4
Stack cleanup.
.PP
When generating something like a branch instruction it might be
needed to empty the fake stack, that is, remove the AX registerpair.
This is done by the instruction remove(ALL)
.NH 4
Register allocation.
.PP
If the machine code to be generated uses the registerpair AX,
this is signaled to the code generator by the allocate(R16)
instruction.
If the registerpair AX resides on the fake stack, this will result
in a "jsr Push", which means that the registerpair AX is pushed on
the stack and will be free for further use.
If registerpair AX is not on the fake stack nothing happens.
.NH 4
Code to be generated.
.PP
Code to be generated is specified as a list of items of the following
kind:
.IP 1)
A string in double quotes("This is a string").
This is copied to the codefile and a newline ('\n') is appended.
Inside the string all normal C string conventions are allowed,
and substitutions can be made of the following sorts.
.RS
.IP a)
$1, $2 etc. These are the operand of the corresponding EM
instructions and are printed according to there type.
To put a real '$' inside the string it must be doubled ('$$').
.IP b)
%[1], %[2.reg], %[b.1] etc. these have there obvious meaning.
If they describe a complete token (%[1]) the printformat for
the token is used.
If they stand fo a basic term in an expression they will be
printed according to their type.
To put a real '%' inside the string it must be doubled ('%%').
.IP c)
%( arbitrary expression %). This allows inclusion of arbitrary
expressions inside strings.
Usually not needed very often, so that the akward notation
is not too bad.
Note that %(%[1]%) is equivalent to %[1].
.RE
.NH 3
stack replacement.
.PP
The stack replacement is a possibly empty list of items to be
pushed on the fake stack.
Three things can occur:
.IP 1)
%[1] is used if the registerpair AX was on the fake stack and is
to be pushed back onto it.
.IP 2)
%[a] is used if the registerpair AX is allocated with allocate(R16)
and is to be pushed onto the fake stack.
.IP 3)
It can also be empty.
.NH 3
EM replacement.
.PP
In exeptional cases it might be useful to leave part of the an EM
pattern undone.
For example, a
.B
sdl
.R
instruction might be split into two
.B
stl
.R
instructions when there is no 4-byte quantity on the stack.
The EM replacement part allows one to express this.
Example:
.sp 1
.br
.B
stl
.R
$1
.B
stl
.R
$1+2
.sp 1
The instructions are inserted in the stream so they can match
the first part of a pattern in the next step.
Note that since the code generator traverses the EM instructions
in a strict linear fashion, it is impossible to let the EM
replacement match later parts of a pattern.
So if there is a pattern
.sp 1
.br
.B
loc
stl
.R
$1==0
.sp1
and the input is
.sp 1
.br
.B
loc
.R
0
.B
sdl
.R
4
.sp 1
the
.B
loc
.R
0
will be processed first, then the
.B
sdl
.R
might be split into two
.B
stl
.R
's but the pattern cannot match now.
.NH 3
Move definitions.
.PP
This definition is a fake. This definition is put in the
table, since the code generator generator complains if it
cannot find one.
.NH 3
Test definitions.
.PP
Test definitions aren't used by the table.
.NH 3
Stack definitions.
.PP
When the generator has to push the registerpair AX, it must
know how to do so.
The machine code to be generated is defined here.
.NH 1
Some remarks.
.PP
The above description of the machine table is
a description of the table for the MCS6500.
It uses only a part of the possibilities which the code generator
generator offers.
For a more precise and detailed description see [2].
.DS C
.B
THE BACK END TABLE.
.R
.DE
.NH 0
Introduction.
.PP
The code rules are divided in 15 groups.
These groups are:
.IP 1.
Load instructions.
.IP 2.
Store instructions.
.IP 3.
Integer arithmetic instructions.
.IP 4.
Unsigned arithmetic instructions.
.IP 5.
Floating point arithmetic instructions.
.IP 6.
Pointer arithmetic instructions.
.IP 7.
Increment, decrement and zero instructions.
.IP 8.
Convert instructions.
.IP 9.
Logical instructions.
.IP 10.
Set manipulation instructions.
.IP 11.
Array instructions.
.IP 12.
Compare instructions.
.IP 13.
Branch instructions.
.IP 14.
Procedure call instructions.
.IP 15.
Miscellaneous instructions.
.PP
From all of these groups one or two typical EM pattern will be explained
in the next paragraphs.
Comment is placed between /* and */ (/* This is a comment */).
.NH
The instructions.
.NH 2
The load instructions.
.PP
In this group a typical instruction is
.B
lol
.R
.
A
.B
lol
.R
instruction pushes the word at local base + offset, where offset
is the instructions argument, onto the stack.
Since the MCS6500 can only offset by 256 bytes, as explaned at the
memory addressing modes, there is a need for two code rules in the
table.
One which can offset directly and one that must explicit
calculate the address of the local.
.NH 3
The lol instruction with indirect offsetting.
.PP
In this case an indirect offsetted load from the second local base
is possible.
The table content is:
.sp 1
.br
.B
lol
.R
IN($1) | |
.br
allocate(R16)	/* allocate registerpair AX */
.br
"ldy #BASE+$1"	/* load Y with the offset from the second
.br
					      local base */
.br
"lda (LBl),y"	/* load indirect the lowbyte of the word */
.br
"tax"		/* move register A to register X */
.br
"iny"		/* increment register Y (offset) */
.br
"lda (LBl),y"	/* load indirect the highbyte of the word */
.br
| %[a] | |	/* push the word onto the fake stack */
.NH 3
The lol instruction whose offset is to big.
.PP
In this case, the library subroutine "Lol" is used.
This subroutine expects the offset in registerpair AX, then
calculates the address of the local or parameter, and loads
it into registerpair AX.
The table content is:
.sp 1
.br
.B
lol
.R
| |
.br
allocate(R16)	/* allocate registerpair AX */
.br
"lda #[$1].h"	/* load highbyte of offset into register A */
.br
"ldx #[$1].l"	/* load lowbyte of offset into register X */
.br
"jsr Lol"	/* perform the subroutine */
.br
| %[a] | |	/* push word onto the fake stack */
.NH 2
The store instructions.
.PP
In this group a typical instruction is
.B
stl.
.R
A
.B
stl
.R
instruction poppes a word from the stack and stores it in the word
at local base + offset, where offset is the instructions argument.
Here also is the need for two code rules in the table as a result
of the offset limits.
.NH 3
The stl instruction with indirect offsetting.
.PP
In this case it an indirect offsetted store from the second local
base is possible.
The table content is:
.sp 1
.br
.B
stl
.R
IN($1) | R16 |	/* expect registerpair AX on top of the
.br
							fake stack */
.br
"ldy #BASE+1+$1"  /* load Y with the offset from the
.br
						second local base */
.br
"sta (LBl),y"	/* store the highbyte of the word from A */
.br
"txa"		/* move register X to register A */
.br
"dey"		/* decrement offset */
.br
"sta (LBl),y"	/* store the lowbyte of the word from A */
.br
| | |
.NH 3
The stl instruction whose offset is to big.
.PP
In this case the library subroutine 'Stl' is used.
This subroutine expects the offset in registerpair AX, then
calculates the address, poppes the word stores it at its place.
The table content is:
.sp 1
.br
.B
stl
.R
| |
.br
allocate(R16)	/* allocate registerpair AX */
.br
"lda #[$1].h"	/* load highbyte of offset in register A */
.br
"ldx #[$1].l"	/* load lowbyte of offset in register X */
.br
"jsr Stl"	/* perform the subroutine */
.br
| | |
.NH 2
Integer arithmetic instructions.
.PP
In this group typical instructions are
.B
adi
.R
and
.B
mli.
.R
These instructions, in this table, are implemented for 2-byte
and 4-byte integers.
The only arithmetic instructions available on the MCS6500 are
the ADC (add with carry), and SBC (subtract with not(carry)).
Not(carry) here means that in a subtraction, the one's complement
of the carry is taken.
The absence of multiply and division instructions forces the
use of subroutines to handle these cases.
Because there are no registers left to perform on the multiply
and division, zero page is used here.
The 4-byte integer arithmetic is implemented, because in C there
exists the integer type long.
A user is freely to use the type long, but will pay in performance.
.NH 3
The adi instruction.
.PP
In case of the
.B
adi
.R
2 (and
.B
sbi
.R
2) instruction there are many EM
patterns, so that the instruction can be performed in line in
most cases.
For the worst case there exists a subroutine in the library
which deals with the EM instruction.
In case of a
.B
adi
.R
4 (or
.B
sbi
.R
4) there only is a subroutine to deal with it.
A table content is:
.sp 1
.br
.B
lol lol adi
.R
(IN($1) && IN($2) && $3==2) | | /* is it in range */
.br
allocate(R16)	/* allocate registerpair AX */
.br
"ldy #BASE+$1+1" /* load Y with offset for first operand */
.br
"lda (LBl),y"	/* load indirect highbyte first operand */
.br
"pha"		/* save highbyte first operand on hard_stack */
.br
"dey"		/* decrement offset first operand */
.br
"lda (LBl),y"	/* load indirect lowbyte first operand */
.br
"ldy #BASE+$2"	/* load Y with offset for second operand */
.br
"clc"		/* clear carry for addition */
.br
"adc (LBl),y"	/* add the lowbytes of the operands */
.br
"tax"		/* store lowbyte of result in place */
.br
"iny"		/* increment offset second operand */
.br
"pla"		/* get highbyte first operand */
.br
"adc (LBl),y"	/* add the highbytes of the operands */
.br
| %[a] | |	/* push the result onto the fake stack */
.NH 3
The mli instruction.
.PP
The
.B
mli
.R
2 instruction uses most the subroutine 'Mlinp'.
This subroutine expects the multiplicand in zero page
at locations ARTH, ARTH+1, while the multiplier is in zero
page locations ARTH+2, ARTH+3.
For a description of the algorithms used for multiplication and
division, see [3].
A table content is:
.sp  1
.br
.B
lol lol mli
.R
(IN($1) && IN($2) && $3==2) | |
.br
allocate(R16)	/* allocate registerpair AX */
.br
"ldy #BASE+$1"	/* load Y with offset of multiplicand */
.br
"lda (LBl),y"	/* load indirect lowbyte of multiplicand */
.br
"sta ARTH"	/* store lowbyte in zero page */
.br
"iny"		/* increment offset of multiplicand */
.br
"lda (LBl),y"	/* load indirect highbyte of multiplicand */
.br
"sta ARTH+1"	/* store highbyte in zero page */
.br
"ldy #BASE+$2"	/* load Y with offset of multiplier */
.br
"lda (LBl),y"	/* load indirect lowbyte of multiplier */
.br
"sta ARTH+2"	/* store lowbyte in zero page */
.br
"iny"		/* increment offset of multiplier */
.br
"lda (LBl),y"	/* load indirect highbyte of multiplier */
.br
"sta ARTH+3"	/* store highbyte in zero page */
.br
"jsr Mlinp"	/* perform the multiply */
.br
| %[a] | |	/* push result onto fake stack */
.NH 2
The unsgned arithmetic instructions.
.PP
Since unsigned addition an subtraction is performed in the same way
as signed addition and subtraction, these cases are dealt with by
an EM replacement.
For mutiplication and division there are special subroutines.
.NH 3
Unsigned addition.
.PP
This is an example of the EM replacement strategy.
.sp 1
.br
.B
lol lol adu
.R
	| | | |
.B
lol
.R
$1
.B
lol
.R
$2
.B
adi
.R
$3 |
.NH 2
Floating point arithmetic.
.PP
Floating point arithmetic isn't implemented in this table.
.NH 2
Pointer arithmetic instructions.
.PP
A typical pointer arithmetic instruction is
.B
adp
.R
2.
This instruction adds an offset and a pointer.
A table content is:
.sp 1
.br
.B
adp
.R
	| | | |
.B
loc
.R
$1
.B
adi
.R
2 |
.NH 2
Increment, decrement and zero instructions.
.PP
In this group a typical instruction is
.B
inl
.R
, which increments a local or parameter.
The MCS6500 doesn't have an instruction to increment the
accumulator A, so the 'ADC' instruction must be used.
A table content is:
.sp 1
.br
.B
inl
.R
IN($1) | |
.br
allocate(R16)	/* allocate registerpair AX */
.br
"ldy #BASE+$1"	/* load Y with offset of the local */
.br
"clc"		/* clear carry for addition */
.br
"lda (LBl),y"	/* load indirect lowbyte of local */
.br
"adc #1"	/* increment lowbyte */
.br
"sta (LBl),y"	/* restore indirect the incremented lowbyte */
.br
"bcc 1f"	/* if carry is clear then ready */
.br
"iny"		/* increment offset of local */
.br
"lda (LBl),y"	/* load indirect highbyte of local */
.br
"adc #0"	/* add carry to highbyte */
.br
"sta (LBl),y\\n1:"  /* restore indirect the highbyte */
.PP
If the offset of the local or parameter is to big, first the
local or parameter is fetched, than incremented, and then
restored.
.NH 2
Convert instructions.
.PP
In this case there are two convert instructions
which really do something.
One of them is in line code, and deals with the extension of
a character (1-byte) to an integer.
The other one is a subroutine which handles the conversion
between 2-byte integers and 4-byte integers.
.NH 3
The in line conversion.
.PP
The table content is:
.sp 1
.br
.B
loc loc cii
.R
$1==1 && $2==2 | R16 |
.br
"txa"		/* see if sign extension is needed */
.br
"bpl 1f"	/* there is no need for sign extension */
.br
"lda #0FFh"	/* sign extension here */
.br
"bne 2f"	/* conversion ready */
.br
"1: lda #0\\n2:"	/* no sign extension here */
.NH 2
Logical instructions.
.PP
A typical instruction in this group is the logical
.B
and
.R
on two 2-byte words.
The logical
.B
and
.R
on groups of more than two bytes (max 254)
is also possible and uses a library subroutine.
.NH 3
The logical and on 2-byte groups.
.PP
The table content is:
.sp 1
.br
.B
and
.R
$1==2 | R16 |	/* one group must be on the fake stack */
.br
"sta ARTH+1"	/* temporary save of first group highbyte */
.br
"stx ARTH"	/* temporary save of first group lowbyte */
.br
"jsr Pop"	/* pop second group from the stack */
.br
"and ARTH+1"	/* logical and on highbytes */
.br
"pha"		/* temporary save the result's highbyte */
.br
"txa"		/* logical and can only be done in A */
.br
"and ARTH"	/* logical and on lowbytes */
.br
"tax"		/* restore results lowbyte */
.br
"pla"		/* restore results highbyte */
.br
| %[1] | |	/* push result onto fake stack */
.NH 2
Set manipulation instructions.
.PP
A typical EM pattern in this group is
.B
loc inn zeq
.R
$1>0 && $1<16 && $2==2.
This EM pattern works on sets of 16 bits.
Sets can be bigger (max 256 bytes = 2048 bits), but than a
library routine is used instead of in line code.
The table content of the above EM pattern is:
.sp 1
.br
.B
loc inn zeq
.R
$1>0 && $1<16 && $2==2 | R16 |
.br
"ldy #$1+1"	/* load Y with bit number */
.br
"stx ARTH"	/* cannot rotate X, so use zero page */
.br
"1: lsr a"	/* right shift A */
.br
"ror ARTH"	/* right rotate zero page location */
.br
"dey"		/* decrement Y */
.br
"bne 1b"	/* shift $1 times */
.br
"bcc $1"	/* no carry, so bit is zero */
.NH 2
Array instructions.
.PP
In this group a typical EM pattern is
.B
lae lar
.R
defined(rom(1,3)) | | | |
.B
lae
.R
$1
.B
aar
.R
$2
.B
loi
.R
rom(1,3).
This pattern uses the
.B
aar
.R
instruction, which is part of a typical EM pattern:
.sp 1
.br
.B
lae aar
.R
$2==2 && rom(1,3)==2 && rom(1,1)==0 | R16 | /* registerpair AX contains
the index in the array */
.br
"pha"		/* save highbyte of index */
.br
"txa"		/* move lowbyte of index to A */
.br
"asl a"		/* shift left lowbyte == 2 times lowbyte */
.br
"tax"		/* restore lowbyte */
.br
"pla"		/* restore highbyte */
.br
"rol a"		/* rotate left highbyte == 2 times highbyte */
.br
| %[1] | adi 2 | /* push new index, add to lowerbound array */
.NH 2
Compare instructions.
.PP
In this group all EM patterns are performed by calling
a subroutine.
Subroutines are used here because comparison is only
possible byte by byte.
This means a lot of code, and since compare are used frequently
a lot of in line code would be generated, and thus reducing
the space left for the software stack.
These subroutines can be found in the library.
.NH 2
Branch instructions.
.PP
A typical branch instruction is
.B
beq.
.R
The table content for it is:
.sp 1
.br
.B
beq
.R
| R16 |
.br
"sta BRANCH+1"	/* save highbyte second operand in zero page */
.br
"stx BRANCH"	/* save lowbyte second operand in zero page */
.br
"jsr Pop"	/* pop the first operand */
.br
"cmp BRANCH+1" 	/* compare the highbytes */
.br
"bne 1f"	/* there not equal so go on */
.br
"cpx BRANCH"	/* compare the lowbytes */
.br
"beq $1\\n1:"	/* lowbytes are also equal, so branch */
.PP
Another typical instruction in this group is
.B
zeq.
.R
The table content is:
.sp 1
.br
.B
zeq
.R
| R16 |
.br
"tay"		/* move A to Y for setting testbits */
.br
"bmi $1"	/* highbyte s minus so branch */
.br
"txa"		/* move X to A for setting testbits */
.br
"beq $1\\n1:"	/* lowbyte also zero, thus branch */
.NH 2
Procedure call instructions.
.PP
In this group one code generation might seem a little
akward.
It is the EM instruction
.B
cai
.R
which generates a 'jsr Indir'.
This is because there is no indirect jump_subroutine in the
MCS6500.
The only solution is to store the address in zero page, and then
do a 'jsr' to a known label.
At this label there must be an indirect jump instruction, which
perform a jump to the address stored in zero page.
In this case the label is Indir, and the address is stored in
zero page at the addresses ADDR, ADDR+1.
The tabel content is:
.sp 1
.br
.B
cai
.R
| R16 |
.br
"stx ADDR"	/* store lowbyte of address in zero page */
.br
"sta ADDR+1"	/* store highbyte of address in zero page */
.br
"jsr Indir"	/* use the indirect jump */
.br
| | |
.NH 2
Miscellaneous instructions.
.PP
In this group, as the name suggests, there is no
typical EM instruction or EM pattern.
Most of the MCS6500 code to be generated uses a library subroutine
or is straightforward.
.DS C
.B
PERFORMANCE.
.R
.DE
.NH 0
Introduction.
.PP
To measure the performance of the back end table some timing
tests are done.
What to time?
In this case, the execution time of several Pascal statements
are timed.
Statements in C, which have a Pascal equivalence are timed also.
The statements are timed as follows.
A test program is been written, which executes two
nested  for_loops from 1 to 1.000.
Within these for_loops the statement, which is to be tested, is placed,
so the statement will be executed 1.000.000 times.
Then the same program is executed without the test statement.
The time difference between the two executions is the time
neccesairy to execute the test statement 1.000.000 times.
The total time to execute the test statement requires thus the
time difference divided by 1.000.000.
.NH 0
Testing Pascal statements.
.PP
The next statements are tested.
.IP 1)
int1 := 0;
.IP 2)
int1 := int2 - 1;
.IP 3)
int1 := int1 + 1;
.IP 4)
int1 := icon1 - icon2;
.IP 5)
int1 := icon2 div icon1;
.IP 6)
int1 := int2 * int3;
.IP 7)
bool := (int1 < 0);
.IP 8)
bool := (int1 < 3);
.IP 9)
bool := ((int1 > 3) or (int1 < 3))
.IP 10)
case int1 of 1: bool := false; 2: bool := true end;
.IP 11)
if int1 = 0 then int2 := 3;
.IP 12)
while int1 > 0 do int1 := int1 - 1;
.IP 13)
m := a[k];
.IP 14)
let2 := ['a'..'c'];
.IP 15)
P3(x);
.IP 16)
dum := F3(x);
.IP 17)
s.overhead := 5400;
.IP 18)
with s do overhead := 5400;
.PP
These statement were tested in a procedure test.
.sp 1
.br
procedure test;
.br
var i, j, ... : integer;
.br
    bool : boolean;
.br
    let2 : set of char;
.br
begin
.br
    for i := 1 to 1000
.br
	for j := 1 to 1000
.br
	    STATEMENT
.br
end;
.sp 1
.PP
STATEMENT is one of the statements as shown above, or it is
the empty statement.
The assignment of used variables, if neccesairy, is done before
the first for_loop.
In case of the statement which uses the procedure call, statement
15, a dummy procedure is declared whose body is empty.
In case of the statement which uses the function, statement 16,
this function returns its argument.
for the timing of C statements a similar test program was
written.
.sp 1
.br
main()
.br
{
.br
    int i, j, ...;
.br
    for (i = 1; i <= 1000; i++)
.br
	for (j = 1; j <= 1000; j++)
.br
	    STATEMENT
.br
}
.sp 1
.NH
The results.
.PP
Here are tables with the results of the time measurments.
Times are in microseconds (10^-6).
Some statements appear twice in the tables.
In the second case an array of 200 integers was declerated
before the variable to be tested, so this variable cannot
be accessed by indirect addressing from the second local base.
This results in a larger execution time of the statement to be
tested.
The column 68000 contains the times measured on a Bleasdale,
M68000 based, computer.
The times in column pdp are measured on a DEC pdp11/44, where
the times from column 6500 come from a BBC microcomputer.
.bp
.TS
expand;
c s s s
c c c c
lw35 nw7 nw7 nw7.
Pascal timing results
statement	68000	pdp	6500
_
T{
int1 := 0;
T}	4.0	5.8	16.7
 	4.0	4.2	97.8
_
T{
int1 := int2 - 1;
T}	7.2	7.1	27.2
 	6.9	7.1	206.5
_
T{
int1 := int1 + 1;
T}	6.9	6.8	27.2
 	6.4	6.7	106.5
_
T{
int1 := icon1 + icon2;
T}	6.2	6.2	25.6
 	6.2	6.0	106.6
_
T{
int1 := icon2 div icon1;
T}	14.9	14.3	372.6
 	14.9	14.7	453.7
_
T{
int1 := int2 * int3;
T}	11.5	12.0	558.1
 	11.3	11.6	728.6
_
T{
bool := (int1 < 0);
T}	7.2	6.9	122.8
 	7.8	8.1	453.2
_
T{
bool := (int1 < 3);
T}	7.3	7.6	126.0
 	7.2	8.1	232.2
_
T{
bool := ((int1 > 3) or (int1 < 3))
T}	10.1	12.0	307.8
 	10.2	11.9	440.1
_
T{
case int1 of 1: bool := false; 2: bool := true end;
T}	18.3	17.9	165.7
_
T{
if int1 = 0 then int2 := 3;
T}	9.5	8.5	133.8
_
T{
while int1 > 0 do int1 := int1 - 1;
T}	6.9	6.9	126.0
_
T{
m := a[k];
T}	7.2	6.8	134.3
_
T{
let2 := ['a'..'c'];
T}	38.4	38.8	447.4
_
T{
P3(x);
T}	18.9	18.8	180.3
_
T{
dum := F3(x);
T}	26.8	27.1	343.3
_
T{
s.overhead := 5400;
T}	4.6	4.1	16.7
_
T{
with s do overhead := 5400;
T}	4.2	4.3	16.7
.TE
.TS
expand;
c s s s
c c c c
lw35 nw7 nw7 nw7.
C timing results
statement	68000time	pdptime	6500time
_
T{
int1 = 0;
T}	4.1	3.6	17.2
 	4.1	4.1	97.7
_
T{
int1 = int2 - 1;
T}	6.6	6.9	27.2
 	6.1	6.5	206.4
_
T{
int1 = int1 + 1;
T}	6.4	7.3	27.2
 	6.3	6.2	206.4
_
T{
int1 = int2 * int3;
T}	11.4	12.3	522.6
	9.6	10.1	721.2
_
T{
int1 = (int2 < 0);
T}	7.2	7.6	126.4
 	7.4	7.7	232.5
_
T{
int1 = (int2 < 3);
T}	7.0	7.5	126.0
 	7.8	7.8	232.6
_
T{
int1 = ((int2 > 3) || (int2 < 3));
T}	11.8	12.2	193.4
 	11.5	13.2	245.6
_
T{
switch (int1) { case 1: int1 = 0; break; case 2: int1 = 1; break; }
T}	28.3	29.2	164.1
_
T{
if (int1 == 0) int2 = 3;
T}	4.8	4.8	19.4
_
T{
while (int2 > 0) int2 = int2 - 1;
T}	5.8	6.0	125.9
_
T{
int2 = a[int2];
T}	4.8	5.1	192.8
_
T{
P3(int2);
T}	18.8	18.4	180.3
_
T{
int2 = F3(int2);
T}	27.0	27.2	309.4
_
T{
s.overhead = 5400;
T}	5.0	4.1	16.7
.TE
.NH
Pascal statements which don't have a C equivalent.
.PP
At first, the two statements who perform an operation on constants
are left out.
These are left out while the C front end does constant folding,
while the Pascal front end doesn't.
So in C the statements int1 = icon1 + icon2; and int1 = icon1 / icont2;
will use the same amount of time since the expression is evaluated
by the front end.
The two other statements (let2 := ['a'..'c']; and
.B
with
.R
s
.B
do
.R
overhead := 5400;), aren't included in the C statement timing table,
because there constructs do not exist in C.
Although in C there can be direct bit manipulation, and thus can
be used to implement sets I have not used it here.
The
.B
with
.R
statement does not exists in C and there is nothing with the slightest
resemblance to it.
.PP
At first sight in the table , it looked if there is no much difference
in the times for the M68000 and the pdp11/44, in comparison with the
times needed by the MCS6500.
To verify this impression, I calculated the correlation coefficient
between the times of the M68000 and pdp11/44.
It turned out to be 0.997 for both the Pascal time tests and the C
time tests.
Since the correlation coefficient is near to one and the difference
between the times is small, they can be considered to be the same
as seen from the times of the MCS6500.
Then I have tried to make a grafic of the times from the M68000 and
the MCS6500.
Well, there was't any correlation to been seen, taken all the times.
The only correlation one could see, with some effort, was in the
times for the first three Pascal statements.
The two first C statements show also a correlation, which two points
always do.
.PP
Also the three Pascal statements
.B
case
.R
,
.B
if
.R
,
and
.B
while
.R
have a correlation coefficient of 0.999.
This is probably because the
.B
case
.R
statement uses a subroutine in both cases and the other two
statements
.B
if
.R
and,
.B
while
.R
generate in line code.
The last two Pascal statements use the same time, since the front
end wil generate the same EM code for both.
.PP
The independence between the rest of the test times is because
in these cases the object code for the MCS6500 uses library
subroutines, while the other processors can handle the EM code
with in line code.
.PP
It is clear that the MCS6500 is a slower device, it needs longer
execution times, the need of more library subroutines, but
there is no constant factor between it execution times and those
of other processors.
.PP
The slowing down of the MCS6500 as result of the need of a
library subroutine is illustrated by the muliplication
statement.
The MCS6500 needs a library subroutine, while the other
two processors have a machine instruction to perform the
multiply.
This results in a factor of 48.5, when the operands can be accessed
indirect by the MCS6500.
When the MCS6500 cannot access the operands indirectly the situation
is even worse.
The slight differences between the MCS6500 execution times for
Pascal statements and C statements is probably the result of the
front end, and thus beyond the scope of this discussion.
.PP
Another timing test is done in C on the statement k = i + j + 1983.
This statement is tested on many UNIX*
.FS
* UNIX is a Trademark of Bell Laboratories.
.FE
systems.
For a complete list see appendix A.
The slowest one is the IBM XT, which runs on a 8088 microprocessor.
The fasted one is the Amdahl computer.
Here is short table to illustrate the performance of the
MCS6500.
.TS
c c c
c n n.
machine	short	int
IBM XT	53.4	53.4
Amdahl	0.5	0.3
MCS6500	150.2	150.2
.TE
The MCS6500 is three times slower than the IBM XT, but threehundred
times slower than the Amdahl.
The reason why the times on the IBM XT and the MCS6500 are the
same for short's and int's, is that most C compilers make the types
short and integer the same size on 16-bit machines.
In this project the MCS6500 is regarded as a 16-bit machine.
.NH
Length tests.
.PP
I have also compiled several programs written in Pascal and C to
see if there is a resemblance between the number of bytes generated
in the machine's language.
In the tables:
.IP length: 9
The number of bytes of the source program.
.IP 68000:
The number of bytes of the a.out file for a M68000.
.IP pdp:
The number of bytes of the a.out file for a pdp11/44.
.IP 6500:
The number of bytes of the a.out file for a MCS6500.
.LP
These are the results:
.TS
c s s s
c c c c
n n n n.
Pascal programs
length	68000	pdp	6500
_
19946	14383	16090	26710
19484	20169	20190	35416
10849	10469	11464	18949
273	4221	5106	7944
1854	5807	6610	10301
.TE
.TS
c s s s
c c c c
n n n n.
C progams
length	68000	pdp	6500
_
9444	6927	8234	11559
7655	14353	18240	26251
4775	11309	15934	19910
639	6337	9660	12494
.TE
.PP
In contrast to the execution times of the test statements, the
object code files sizes show a constant factor between them.
After calculating the correlation coefficient, I have calculated
the line fitted between sizes.
.FS
* x is the number of bytes
.FE
.TS
c s s
c c c
l c c.
Pascal programs
processor	corr. coef.	fitted line
_
68000-pdp	0.996
68000-6500	0.999	1.76x + 502*
pdp-6500	0.999	1.80x - 1577
.TE
.TS
c s s
c c c
l c c.
C programs
processor	corr. coef.	fitted line
_
68000-pdp	0.974
68000-6500	0.992	1.80x + 502*
pdp-6500	0.980	1.40x - 1577
.TE
.PP
As seen from the tables above the correlation coefficient for
Pascal programs is better than the ones for C programs.
Thus the line fits best for Pascal programs.
With the formula of the best fitted line one can now estimate
the size of the object code, which a program needs, for a MCS6500
without having the compiler at hand.
One also can see from these formula that the object code
generated for a MCS6500 is about 1.8 times more than for the other
processors.
Since the number of bytes in the source file havily depends on the
programmer, how many spaces he or she uses, the size of the indenting
in structured programs, etc., there is no correlation between the
size of the source file and the size of the object file.
Also the use of comments has its influence on the size.
.bp
.DS C
.B
SUMMARY.
.R
.DE
.NH 0
Summary
.PP
In this chapter some final conclusions are made.
.PP
In spite of its simplicity, the MCS6500 is strong enough to
implement a EM machine.
A serious deficy of the MCS6500 is the missing of 16-bit
general purpose registers, and especially the missing of a
16-bit stackpointer.
As pointed out before, one 16-bit register can be simulated
by a pair of 8-bit registers, in fact, the accumulator A to
hold the highbyte, and the index register X to hold the lowbyte
of the word.
By lack of a 16-bit stackpointer, zero page must be used to hold
a stackpointer and there are also two subroutines needed for
manipulating the stack (Push and Pop).
.PP
As seen at the time tests, the simple instruction set of the
MCS6500 forces the use of library subroutines.
These library subroutines increas the execution time of the
programs.
.PP
The sizes of the object code files show a strong correlation
in contrast to the execution times.
With this correlatiuon one canestimate the size of a program
if it is to be used on a MCS6500.
.bp
.NH 0
.B
REFERENCES.
.R
.IP 1.
Osborn, A., Jacobson, S., and Kane, J. The Mos Technology MCS6500.
.B
An Introduction to Microcomputers ,
.R
Volume II, Some Real Products (june 1977) chap. 9.
.RS
.PP
A hardware description of some real existing CPU's, such as
the Intel Z80, MCS6500, etc. is given in this book.
.RE
.IP 2.
van Staveren, H.
The table driven code generator from the Amsterdam Compiler Kit.
Vrije Universiteit, Amsterdam, (July 11, 1983).
.RS
.PP
The defining document for writing a back end table.
.RE
.IP 3.
Tanenbaum, A.S. Structured Computer Organization.
Prentice Hall. (1976).
.RS
.PP
In this book computers are described as a hierarchy of levels,
with each one performing some well-defined function.
.RE