Initial revision

1987-03-03 10:59:52 +00:00 · 1987-03-03 10:59:52 +00:00 · 004f017550
commit 004f017550
parent 4d4c8b45fb
30 changed files with 3903 additions and 0 deletions
--- a/doc/ego/ic/ic1
+++ b/doc/ego/ic/ic1
@ -0,0 +1,57 @@
 .bp
 .NH
 The Intermediate Code and the IC phase
 .PP
 In this chapter the intermediate code of the EM global optimizer
 will be defined.
 The 'Intermediate Code construction' phase (IC),
 which builds the initial intermediate code from
 EM Compact Assembly Language,
 will be described.
 .NH 2
 Introduction
 .PP
 The EM global optimizer is a multi pass program,
 hence there is a need for an intermediate code.
 Usually, programs in the Amsterdam Compiler Kit use the
 Compact Assembly Language format
 .[~[
 keizer architecture
 .], section 11.2]
 for this purpose.
 Although this code has some convenient features,
 such as being compact,
 it is quite unsuitable in our case,
 because of a number of reasons.
 At first, the code lacks global information
 about whole procedures or whole basic blocks.
 Second, it uses identifiers ('names') to bind
 defining and applied occurrences of
 procedures, data labels and instruction labels.
 Although this is usual in high level programming
 languages, it is awkward in an intermediate code
 that must be read many times.
 Each pass of the optimizer would have
 to incorporate an identifier look-up mechanism
 to associate a defining occurrence with each
 applied occurrence of an identifier.
 Finally, EM programs are used to declare blocks of bytes,
 rather than variables. A 'hol 6' instruction may be used to
 declare three 2-byte variables.
 Clearly, the optimizer wants to deal with variables, and
 not with rows of bytes.
 .PP
 To overcome these problems, we have developed a new
 intermediate code.
 This code does not merely consist of the EM instructions,
 but also contains global information in the
 form of tables and graphs.
 Before describing the intermediate code we will
 first leap aside to outline
 the problems one generally encounters
 when trying to store complex data structures such as
 graphs outside the program, i.e. in a file.
 We trust this will enhance the
 comprehensibility of the
 intermediate code definition and the design and implementation
 of the IC phase.
--- a/doc/ego/ic/ic2
+++ b/doc/ego/ic/ic2
@ -0,0 +1,146 @@
 .NH 2
 Representation of complex data structures in a sequential file
 .PP
 Most programmers are quite used to deal with
 complex data structures, such as
 arrays, graphs and trees.
 There are some particular problems that occur
 when storing such a data structure
 in a sequential file.
 We call data that is kept in
 main memory
 .UL internal
 ,as opposed to
 .UL external
 data
 that is kept in a file outside the program.
 .sp
 We assume a simple data structure of a
 scalar type (integer, floating point number)
 has some known external representation.
 An
 .UL array
 having elements of a scalar type can be represented
 externally easily, by successively
 representing its elements.
 The external representation may be preceded by a
 number, giving the length of the array.
 Now, consider a linear, singly linked list,
 the elements of which look like:
 .DS
 record
        data: scalar_type;
        next: pointer_type;
 end;
 .DE
 It is significant to note that the "next"
 fields of the elements only have a meaning within
 main memory.
 The field contains the address of some location in
 main memory.
 If a list element is written to a file in
 some program,
 and read by another program,
 the element will be allocated at a different
 address in main memory.
 Hence this address value is completely
 useless outside the program.
 .sp
 One may represent the list by ignoring these "next" fields
 and storing the data items in the order they are linked.
 The "next" fields are represented \fIimplicitly\fR.
 When the file is read again,
 the same list can be reconstructed.
 In order to know where the external representation of the
 list ends,
 it may be useful to put the length of
 the list in front of it.
 .sp
 Note that arrays and linear lists have the
 same external representation.
 .PP
 A doubly linked, linear list,
 with elements of the type:
 .DS
 record
        data: scalar_type;
        next,
        previous: pointer_type;
 end
 .DE
 can be represented in precisely the same way.
 Both the "next" and the "previous" fields are represented
 implicitly.
 .PP
 Next, consider a binary tree,
 the nodes of which have type:
 .DS
 record
        data: scalar_type;
        left,
        right: pointer_type;
 end
 .DE
 Such a tree can be represented sequentially,
 by storing its nodes in some fixed order, e.g. prefix order.
 A special null data item may be used to
 denote a missing left or right son.
 For example, let the scalar type be integer,
 and let the null item be 0.
 Then the tree of fig. 3.1(a)
 can be represented as in fig. 3.1(b).
 .DS
                        4
                    9      12
                12    3   4   6
                     8  1  5 1
 Fig. 3.1(a) A binary tree
 4 9 12 0 0 3 8 0 0 1 0 0 12 4 0 5 0 0 6 1 0 0 0
 Fig. 3.1(b) Its sequential representation
 .DE
 We are still able to represent the pointer fields ("left"
 and "right") implicitly.
 .PP
 Finally, consider a general
 .UL graph
 , where each node has a "data" field and
 pointer fields,
 with no restriction on where they may point to.
 Now we're at the end of our tale.
 There is no way to represent the pointers implicitly,
 like we did with lists and trees.
 In order to represent them explicitly,
 we use the following scheme.
 Every node gets an extra field,
 containing some unique number that identifies the node.
 We call this number its
 .UL id.
 A pointer is represented externally as the id of the node
 it points to.
 When reading the file we use a table that maps
 an id to the address of its node.
 In general this table will not be completely filled in
 until we have read the entire external representation of
 the graph and allocated internal memory locations for
 every node.
 Hence we cannot reconstruct the graph in one scan.
 That is, there may be some pointers from node A to B,
 where B is placed after A in the sequential file than A.
 When we read the node of A we cannot map the id of B
 to the address of node B,
 as we have not yet allocated node B.
 We can overcome this problem if the size
 of every node is known in advance.
 In this case we can allocate memory for a node
 on first reference.
 Else, the mapping from id to pointer
 cannot be done while reading nodes.
 The mapping can be done either in an extra scan
 or at every reference to the node.
--- a/doc/ego/ic/ic3
+++ b/doc/ego/ic/ic3
@ -0,0 +1,414 @@
 .NH 2
 Definition of the intermediate code
 .PP
 The intermediate code of the optimizer consists
 of several components:
 .IP -
 the object table
 .IP -
 the procedure table
 .IP -
 the em code
 .IP -
 the control flow graphs
 .IP -
 the loop table
 .LP -
 .PP
 These components are described in
 the next sections.
 The syntactic structure of every component
 is described by a set of context free syntax rules,
 with the following conventions:
 .DS
 x               a non-terminal symbol
 A               a terminal symbol (in capitals)
 x: a b c;       a grammar rule
 a | b           a or b
 (a)+            1 or more occurrences of a
 {a}             0 or more occurrences of a
 .DE
 .NH 3
 The object table
 .PP
 EM programs declare blocks of bytes rather than (global) variables.
 A typical program may declare 'HOL 7780'
 to allocate space for 8 I/O buffers,
 2 large arrays and 10 scalar variables.
 The optimizer wants to deal with
 .UL objects
 like variables, buffers and arrays
 and certainly not with huge numbers of bytes.
 Therefore the intermediate code contains information
 about which global objects are used.
 This information can be obtained from an EM program
 by just looking at the operands of instruction
 such as LOE, LAE, LDE, STE, SDE, INE, DEE and ZRE.
 .PP
 The object table consists of a list of
 .UL datablock
 entries.
 Each such entry represents a declaration like HOL, BSS,
 CON or ROM.
 There are five kinds of datablock entries.
 The fifth kind,
 UNKNOWN, denotes a declaration in a
 separately compiled file that is not made
 available to the optimizer.
 Each datablock entry contains the type of the block,
 its size, and a description of the objects that
 belong to it.
 If it is a rom,
 it also contains a list of values given
 as arguments to the rom instruction,
 provided that this list contains only integer numbers.
 An object has an offset (within its datablock)
 and a size.
 The size need not always be determinable.
 Both datablock and object contain a unique
 identifying number
 (see previous section for their use).
 .DS
 .UL syntax
  object_table:
                {datablock} ;
  datablock:
                D_ID            -- unique identifying number
                PSEUDO          -- one of ROM,CON,BSS,HOL,UNKNOWN
                SIZE            -- # bytes declared
                FLAGS
                {value}         -- contents of rom
                {object} ;      -- objects of the datablock
  object:
                O_ID            -- unique identifying number
                OFFSET          -- offset within the datablock
                SIZE ;          -- size of the object in bytes
  value:
                argument ;
 .DE
 A data block has only one flag: "external", indicating
 whether the data label is externally visible.
 The syntax for "argument" will be given later on
 (see em_text).
 .NH 3
 The procedure table
 .PP
 The procedure table contains global information
 about all procedures that are made available
 to the optimizer
 and that are needed by the EM program.
 (Library units may not be needed, see section 3.5).
 The table has one entry for
 every procedure.
 .DS
 .UL syntax
  procedure_table:
                {procedure}
  procedure:
                P_ID            -- unique identifying number
                #LABELS         -- number of instruction labels
                #LOCALS         -- number of bytes for locals 
 		#FORMALS        -- number of bytes for formals
                FLAGS           -- flag bits
                calling         -- procedures called by this one
                change          -- info about global variables changed
                use ;           -- info about global variables used
  calling:
                {P_ID} ;        -- procedures called
  change:
                ext             -- external variables changed
                FLAGS ;
  use:
                FLAGS ;
  ext:
                {O_ID} ;        -- a set of objects
 .DE
 .PP
 The number of bytes of formal parameters accessed by
 a procedure is determined by the front ends and
 passed via a message (parameter message) to the optimizer.
 If the front end is not able to determine this number
 (e.g. the parameter may be an array of dynamic size or
 the procedure may have a variable number of arguments) the attribute
 contains the value 'UNKNOWN_SIZE'.
 .sp 0
 A procedure has the following flags:
 .IP -
 external: true if the proc. is externally visible
 .IP -
 bodyseen: true if its code is available as EM text
 .IP -
 calunknown: true if it calls a procedure that has its bodyseen
 flag not set
 .IP -
 environ: true if it uses or changes a (non-global) variable in
 a lexically enclosing procedure
 .IP -
 lpi: true if is used as operand of an lpi instruction, so
 it may be called indirect
 .LP
 The change and use attributes both have one flag: "indirect",
 indicating whether the procedure does a 'use indirect'
 or a 'store indirect' (indirect means through a pointer).
 .NH 3
 The EM text
 .PP
 The EM text contains the EM instructions.
 Every EM instruction has an operation code (opcode)
 and 0 or 1 operands.
 EM pseudo instructions can have more than
 1 operand.
 The opcode is just a small (8 bit) integer.
 .sp
 There are several kinds of operands, which we will
 refer to as
 .UL types.
 Many EM instructions can have more than one type of operand.
 The types and their encodings in Compact Assembly Language
 are discussed extensively in.
 .[~[
 keizer architecture 
 .], section 11.2]
 Of special interest is the way numeric values
 are represented.
 Of prime importance is the machine independency of
 the representation.
 Ultimately, one could store every integer
 just as a string of the characters '0' to '9'.
 As doing arithmetic on strings is awkward,
 Compact Assembly Language allows several alternatives.
 The main idea is to look at the value of the integer.
 Integers that fit in 16, 32 or 64 bits are
 represented as a row of resp. 2, 4 and 8 bytes,
 preceded by an indication of how many bytes are used.
 Longer integers are represented as strings;
 this is only allowed within pseudo instructions, however.
 This concept works very well for target machines
 with reasonable word sizes.
 At present, most ACK software cannot be used for word sizes
 higher than 32 bits,
 although the handles for using larger word sizes are
 present in the design of the EM code.
 In the intermediate code we essentially use the
 same ideas.
 We allow three representations of integers.
 .IP -
 integers that fit in a short are represented as a short
 .IP -
 integers that fit in a long but not in a short are represented
 as longs
 .IP -
 all remaining integers are represented as strings
 (only allowed in pseudos).
 .LP
 The terms short and long are defined in
 .[~[
 ritchie reference manual programming language
 .], section 4]
 and depend only on the source machine
 (i.e. the machine on which ACK runs),
 not on the target machines.
 For historical reasons a long will often be called an
 .UL offset.
 .PP
 Operands can also be instruction labels,
 objects or procedures.
 Instruction labels are denoted by a
 .UL label
 .UL identifier,
 which can be distinguished from a normal identifier.
 .sp
 The operand of a pseudo instruction can be a list of
 .UL arguments.
 Arguments can have the same type as operands, except
 for the type short, which is not used for arguments.
 Furthermore, an argument can be a string or
 a string representation of a signed integer, unsigned integer
 or floating point number.
 If the number of arguments is not fully determined by
 the pseudo instruction (e.g. a ROM pseudo can have any number
 of arguments), then the list is terminated by a special
 argument of type CEND.
 .DS
 .UL syntax
  em_text:
                {line} ;
  line:
                INSTR           -- opcode
                OPTYPE          -- operand type
                operand ;
  operand:
                empty |         -- OPTYPE = NO
                SHORT |         -- OPTYPE = SHORT
                OFFSET |        -- OPTYPE = OFFSET
                LAB_ID |        -- OPTYPE = INSTRLAB
                O_ID |          -- OPTYPE = OBJECT
                P_ID |          -- OPTYPE = PROCEDURE
                {argument} ;    -- OPTYPE = LIST
  argument:
                ARGTYPE
                arg ;
  arg:
                empty |         -- ARGTYPE = CEND
                OFFSET |
                LAB_ID |
                O_ID |
                P_ID |
                string |        -- ARGTYPE = STRING
                const ;         -- ARGTYPE = ICON,UCON or FCON
  string:
                LENGTH          -- number of characters
                {CHARACTER} ;
  const:
                SIZE            -- number of bytes
                string ;        -- string representation of (un)signed
                                -- or floating point constant
 .DE
 .NH 3
 The control flow graphs
 .PP
 Each procedure can be divided
 into a number of basic blocks.
 A basic block is a piece of code with
 no jumps in, except at the beginning,
 and no jumps out, except at the end.
 .PP
 Every basic block has a set of
 .UL successors,
 which are basic blocks that can follow it immediately in
 the dynamic execution sequence.
 The
 .UL predecessors
 are the basic blocks of which this one
 is a successor.
 The successor and predecessor attributes
 of all basic blocks of a single procedure
 are said to form the
 .UL control
 .UL flow
 .UL graph
 of that procedure.
 .PP
 Another important attribute is the
 .UL immediate
 .UL dominator.
 A basic block B dominates a block C if
 every path in the graph from the procedure entry block
 to C goes through B.
 The immediate dominator of C is the closest dominator
 of C on any path from the entry block.
 (Note that the dominator relation is transitive,
 so the immediate dominator is well defined.)
 .PP
 A basic block also has an attribute containing
 the identifiers of every
 .UL loop
 that the block belongs to (see next section for loops).
 .DS
 .UL syntax
  control_flow_graph:
                {basic_block} ;
  basic_block:
                B_ID            -- unique identifying number
                #INSTR          -- number of EM instructions
                succ
                pred
                idom            -- immediate dominator
                loops           -- set of loops
 		FLAGS ;         -- flag bits
  succ:
                {B_ID} ;
  pred:
                {B_ID} ;
  idom:
                B_ID ;
  loops:
                {LP_ID} ;
 .DE
 The flag bits can have the values 'firm' and 'strong',
 which are explained below.
 .NH 3
 The loop tables
 .PP
 Every procedure has an associated
 .UL loop
 .UL table
 containing information about all the loops
 in the procedure.
 Loops can be detected by a close inspection of
 the control flow graph.
 The main idea is to look for two basic blocks,
 B and C, for which the following holds:
 .IP -
 B is a successor of C
 .IP -
 B is a dominator of C
 .LP
 B is called the loop
 .UL entry
 and C is called the loop
 .UL end.
 Intuitively, C contains a jump backwards to
 the beginning of the loop (B).
 .PP
 A loop L1 is said to be
 .UL nested
 within loop L2 if all basic blocks of L1
 are also part of L2.
 It is important to note that loops could
 originally be written as a well structured for -or
 while loop or as a messy goto loop.
 Hence loops may partly overlap without one
 being nested inside the other.
 The
 .UL nesting
 .UL level
 of a loop is the number of loops in
 which it is nested (so it is 0 for
 an outermost loop).
 The details of loop detection will be discussed later.
 .PP
 It is often desirable to know whether a
 basic block gets executed during every iteration
 of a loop.
 This leads to the following definitions:
 .IP -
 A basic block B of a loop L is said to be a \fIfirm\fR block
 of L if B is executed on all successive iterations of L,
 with the only possible exception of the last iteration.
 .IP -
 A basic block B of a loop L is said to be a \fIstrong\fR block
 of L if B is executed on all successive iterations of L.
 .LP
 Note that a strong block is also a firm block.
 If a block is part of a conditional statement, it is neither
 strong nor firm, as it may be skipped during some iterations
 (see Fig. 3.2).
 .DS
 loop
       if cond1 then
 	      ... -- this code will not
 		  -- result in a firm or strong block
       end if;
       ...  -- strong (always executed)
       exit when cond2;
       ...  -- firm (not executed on
            -- last iteration).
 end loop;
 Fig. 3.2 Example of firm and strong block
 .DE
 .DS
 .UL syntax
  looptable:
                {loop} ;
  loop:
                LP_ID           -- unique identifying number
                LEVEL           -- loop nesting level
                entry           -- loop entry block
                end ;
  entry:
                B_ID ;
  end:
                B_ID ;
 .DE
--- a/doc/ego/ic/ic4
+++ b/doc/ego/ic/ic4
@ -0,0 +1,80 @@
 .NH 2
 External representation of the intermediate code
 .PP
 The syntax of the intermediate code was given
 in the previous section.
 In this section we will make some remarks about
 the representation of the code in sequential files.
 .sp
 We use sequential files in order to avoid
 the bookkeeping of complex file indices.
 As a consequence of this decision
 we can't store all components
 of the intermediate code
 in one file.
 If a phase wishes to change some attribute
 of a procedure,
 or wants to add or delete entire procedures
 (inline substitution may do the latter),
 the procedure table will only be fully updated
 after the entire EM text has been scanned.
 Yet, the next phase undoubtedly wants
 to read the procedure table before it
 starts working on the EM text.
 Hence there is an ordering problem, which
 can be solved easily by putting the
 procedure table in a separate file.
 Similarly, the data block table is kept
 in a file of its own.
 .PP
 The control flow graphs (CFGs) could be mixed
 with the EM text.
 Rather, we have chosen to put them
 in a separate file too.
 The control flow graph file should be regarded as a
 file that imposes some structure on the EM-text file,
 just as an overhead sheet containing a picture
 of a Flow Chart may be put on an overhead sheet
 containing statements.
 The loop tables are also put in the CFG file.
 A loop imposes an extra structure on the
 CFGs and hence on the EM text.
 So there are four files:
 .IP -
 the EM-text file
 .IP -
 the procedure table file
 .IP -
 the object table file
 .IP -
 the CFG and loop tables file
 .LP
 Every table is preceded by its length, in order to
 tell where it ends.
 The CFG file also contains the number of instructions of
 every basic block,
 indicating which part of the EM text belongs
 to that block.
 .DS
 .UL syntax
  intermediate_code:
                object_table_file
                proctable_file
                em_text_file
                cfg_file ;
  object_table_file:
                LENGTH          -- number of objects
                object_table ;
  proctable_file:
                LENGTH          -- number of procedures
                procedure_table ;
  em_text_file:
                em_text ;
  cfg_file:
                {per_proc} ;    -- one for every procedure
  per_proc:
                BLENGTH         -- number of basic blocks
                LLENGTH         -- number of loops
                control_flow_graph
                looptable ;
 .DE
--- a/doc/ego/ic/ic5
+++ b/doc/ego/ic/ic5
@ -0,0 +1,163 @@
 .NH 2
 The Intermediate Code construction phase
 .PP
 The first phase of the global optimizer,
 called
 .UL IC,
 constructs a major part of the intermediate code.
 To be specific, it produces:
 .IP -
 the EM text
 .IP -
 the object table
 .IP -
 part of the procedure table
 .LP
 The calling, change and use attributes of a procedure
 and all its flags except the external and bodyseen flags
 are computed by the next phase (Control Flow phase).
 .PP
 As explained before,
 the intermediate code does not contain
 any names of variables or procedures.
 The normal identifiers are replaced by identifying
 numbers.
 Yet, the output of the global optimizer must
 contain normal identifiers, as this
 output is in Compact Assembly Language format.
 We certainly want all externally visible names
 to be the same in the input as in the output,
 because the optimized EM module may be a library unit,
 used by other modules.
 IC dumps the names of all procedures and data labels
 on two files:
 .IP -
 the procedure dump file, containing tuples (P_ID, procedure name)
 .IP -
 the data dump file, containing tuples (D_ID, data label name)
 .LP
 The names of instruction labels are not dumped,
 as they are not visible outside the procedure
 in which they are defined.
 .PP
 The input to IC consists of one or more files.
 Each file is either an EM module in Compact Assembly Language
 format, or a Unix archive file (library) containing such modules.
 IC only extracts those modules from a library that are
 needed somehow, just as a linker does.
 It is advisable to present as much code
 of the EM program as possible to the optimizer,
 although it is not required to present the whole program.
 If a procedure is called somewhere in the EM text,
 but its body (text) is not included in the input,
 its bodyseen flag in the procedure table will still
 be off.
 Whenever such a procedure is called,
 we assume the worst case for everything;
 it will change and use all variables it has access to,
 it will call every procedure etc.
 .sp
 Similarly, if a data label is used
 but not defined, the PSEUDO attribute in its data block
 will be set to UNKNOWN.
 .NH 3
 Implementation
 .PP
 Part of the code for the EM Peephole Optimizer
 .[
 staveren peephole toplass
 .]
 has been used for IC.
 Especially the routines that read and unravel
 Compact Assembly Language and the identifier
 lookup mechanism have been used.
 New code was added to recognize objects,
 build the object and procedure tables and to
 output the intermediate code.
 .PP
 IC uses singly linked linear lists for both the
 procedure and object table.
 Hence there are no limits on the size of such
 a table (except for the trivial fact that it must fit
 in main memory).
 Both tables are outputted after all EM code has
 been processed.
 IC reads the EM text of one entire procedure
 at a time,
 processes it and appends the modified code to
 the EM text file.
 EM code is represented internally as a doubly linked linear
 list of EM instructions.
 .PP
 Objects are recognized by looking at the operands
 of instructions that reference global data.
 If we come across the instructions:
 .DS
 LDE X+6         -- Load Double External
 LAE X+20        -- Load Address External
 .DE
 we conclude that the data block
 preceded by the data label X contains an object
 at offset 6 of size twice the word size,
 and an object at offset 20 of unknown size.
 .sp
 A data block entry of the object table is allocated
 at the first reference to a data label.
 If this reference is a defining occurrence
 or a INA pseudo instruction,
 the label is not externally visible
 .[~[
 keizer architecture
 .], section 11.1.4.3]
 In this case, the external flag of the data block
 is turned off.
 If the first reference is an applied occurrence
 or a EXA pseudo instruction, the flag is set.
 We record this information, because the
 optimizer may change the order of defining and
 applied occurrences.
 The INA and EXA pseudos are removed from the EM text.
 They may be regenerated by the last phase
 of the optimizer.
 .sp
 Similar rules hold for the procedure table
 and the INP and EXP pseudos.
 .NH 3
 Source files of IC
 .PP
 The source files of IC consist
 of the files ic.c, ic.h and several packages.
 .UL ic.h
 contains type definitions, macros and
 variable declarations that may be used by
 ic.c and by every package.
 .UL ic.c
 contains the definitions of these variables,
 the procedure
 .UL main
 and some high level I/O routines used by main.
 .sp
 Every package xxx consists of two files.
 ic_xxx.h contains type definitions,
 macros, variable declarations and
 procedure declarations that may be used by
 every .c file that includes this .h file.
 The file ic_xxx.c provides the
 definitions of these variables and
 the implementation of the declared procedures.
 IC uses the following packages:
 .IP lookup: 18
 procedures that loop up procedure, data label
 and instruction label names; procedures to dump
 the procedure and data label names.
 .IP lib:
 one procedure that gets the next useful input module;
 while scanning archives, it skips unnecessary modules.
 .IP aux:
 several auxiliary routines.
 .IP io:
 low-level I/O routines that unravel the Compact
 Assembly Language.
 .IP put:
 routines that output the intermediate code
 .LP
--- a/doc/ego/il/il1
+++ b/doc/ego/il/il1
@ -0,0 +1,112 @@
 .bp
 .NH 1
 Inline substitution
 .NH 2
 Introduction
 .PP
 The Inline Substitution technique (IL)
 tries to decrease the overhead associated
 with procedure calls (invocations).
 During a procedure call, several actions
 must be undertaken to set up the right
 environment for the called procedure.
 .[
 johnson calling sequence
 .]
 On return from the procedure, most of these
 effects must be undone.
 This entire process introduces significant
 costs in execution time as well as
 in object code size.
 .PP
 The inline substitution technique replaces
 some of the calls by the modified body of
 the called procedure, hence eliminating
 the overhead.
 Furthermore, as the calling and called procedure
 are now integrated, they can be optimized
 together, using other techniques of the optimizer.
 This often leads to extra opportunities for
 optimization
 .[
 ball predicting effects
 .]
 .[
 carter code generation cacm
 .]
 .[
 scheifler inline cacm
 .]
 .PP
 An inline substitution of a call to a procedure P increases
 the size of the program, unless P is very small or P is
 called only once.
 In the latter case, P can be eliminated.
 In practice, procedures that are called only once occur
 quite frequently, due to the
 introduction of structured programming.
 (Carter
 .[
 carter umi ann arbor
 .]
 states that almost 50% of the Pascal procedures
 he analyzed were called just once).
 .PP
 Scheifler
 .[
 scheifler inline cacm
 .]
 has a more general view of inline substitution.
 In his model, the program under consideration is
 allowed to grow by a certain amount,
 i.e. code size is sacrificed to speed up the program.
 The above two cases are just special cases of
 his model, obtained by setting the size-change to
 (approximately) zero.
 He formulates the substitution problem as follows:
 .IP
 "Given a program, a subset of all invocations,
 a maximum program size, and a maximum procedure size,
 find a sequence of substitutions that minimizes
 the expected execution time."
 .LP
 Scheifler shows that this problem is NP-complete
 .[~[
 aho hopcroft ullman analysis algorithms
 .], chapter 10]
 by reduction to the Knapsack Problem.
 Heuristics will have to be used to find a near-optimal
 solution.
 .PP
 In the following chapters we will extend
 Scheifler's view and adapt it to the EM Global Optimizer.
 We will first describe the transformations that have
 to be applied to the EM text when a call is substituted
 in line.
 Next we will examine in which cases inline substitution
 is not possible or desirable.
 Heuristics will be developed for
 chosing a good sequence of substitutions.
 These heuristics make no demand on the user
 (such as making profiles
 .[
 scheifler inline cacm
 .]
 or giving pragmats
 .[~[
 ichbiah ada military standard
 .], section 6.3.2]),
 although the model could easily be extended
 to use such information.
 Finally, we will discuss the implementation
 of the IL phase of the optimizer.
 .PP
 We will often use the term inline expansion
 as a synonym of inline substitution.
 .sp 0
 The inverse technique of procedure abstraction
 (automatic subroutine generation)
 .[
 shaffer subroutine generation
 .]
 will not be discussed in this report.
--- a/doc/ego/il/il2
+++ b/doc/ego/il/il2
@ -0,0 +1,93 @@
 .NH 2
 Parameters and local variables.
 .PP
 In the EM calling sequence, the calling procedure
 pushes its parameters on the stack
 before doing the CAL.
 The called routine first saves some
 status information on the stack and then
 allocates space for its own locals
 (also on the stack).
 Usually, one special purpose register,
 the Local Base (LB) register,
 is used to access both the locals and the
 parameters.
 If memory is highly segmented,
 the stack frames of the caller and the callee
 may be allocated in different fragments;
 an extra Argument Base (AB) register is used
 in this case to access the actual parameters.
 See 4.2 of
 .[
 keizer architecture
 .]
 for further details.
 .PP
 If a procedure call is expanded in line,
 there are two problems:
 .IP 1. 3
 No stack frame will be allocated for the called procedure;
 we must find another place to put its locals.
 .IP 2.
 The LB register cannot be used to access the actual
 parameters;
 as the CAL instruction is deleted, the LB will
 still point to the local base of the \fIcalling\fR procedure.
 .LP
 The local variables of the called procedure will
 be put in the stack frame of the calling procedure,
 just after its own locals.
 The size of the stack frame of the
 calling procedure will be increased
 during its entire lifetime.
 Therefore our model will allow a
 limit to be set on the number of bytes
 for locals that the called procedure may have
 (see next section).
 .PP
 There are several alternatives to access the parameters.
 An actual parameter may be any auxiliary expression,
 which we will refer to as
 the \fIactual parameter expression\fR.
 The value of this expression is stored
 in a location on the stack (see above),
 the \fIparameter location\fR.
 .sp 0
 The alternatives for accessing parameters are:
 .IP -
 save the value of the stackpointer at the point of the CAL
 in a temporary variable X;
 this variable can be used to simulate the AB register,  i.e.
 parameter locations are accessed via an offset to
 the value of X.
 .IP -
 create a new temporary local variable T for
 the parameter (in the stack frame of the caller);
 every access to the parameter location must be changed
 into an access to T.
 .IP -
 do not evaluate the actual parameter expression before the call;
 instead, substitute this expression for every use of the
 parameter location.
 .LP
 The first method may be expensive if X is not
 put in a register.
 We will not use this method.
 The time required to evaluate and access the
 parameters when the second method is used
 will not differ much from the normal
 calling sequence (i.e. not in line call).
 It is not expensive, but there are no
 extra savings either.
 The third method is essentially the 'by name'
 parameter mechanism of Algol60.
 If the actual parameter is just a numeric constant,
 it is advantageous to use it.
 Yet, there are several circumstances
 under which it cannot or should not be used.
 We will deal with this in the next section.
 .sp 0
 In general we will use the third method,
 if it is possible and desirable.
 Such parameters will be called \fIin line parameters\fR.
 In all other cases we will use the second method.
--- a/doc/ego/il/il3
+++ b/doc/ego/il/il3
@ -0,0 +1,164 @@
 .NH 2
 Feasibility and desirability analysis
 .PP
 Feasibility and desirability analysis
 of in line substitution differ
 somewhat from most other techniques.
 Usually, much effort is needed to find
 a feasible opportunity for optimization
 (e.g. a redundant subexpression).
 Desirability analysis then checks
 if it is really advantageous to do
 the optimization.
 For IL, opportunities are easy to find.
 To see if an in line expansion is
 desirable will not be hard either.
 Yet, the main problem is to find the most
 desirable ones.
 We will deal with this problem later and
 we will first attend feasibility and
 desirability analysis.
 .PP
 There are several reasons why a procedure invocation
 cannot or should not be expanded in line.
 .sp
 A call to a procedure P cannot be expanded in line
 in any of the following cases:
 .IP 1. 3
 The body of P is not available as EM text.
 Clearly, there is no way to do the substitution.
 .IP 2.
 P, or any procedure called by P (transitively),
 follows the chain of statically enclosing
 procedures (via a LXL or LXA instruction)
 or follows the chain of dynamically enclosing
 procedures (via a DCH).
 If the call were expanded in line,
 one level would be removed from the chains,
 leading to total chaos.
 This chaos could be solved by patching up
 every LXL, LXA or DCH in all procedures
 that could be part of the chains,
 but this is hard to implement.
 .IP 3.
 P, or any procedure called by P (transitively),
 calls a procedure whose body is not
 available as EM text.
 The unknown procedure may use an LXL, LXA or DCH.
 However, in several languages a separately
 compiled procedure has no access to the
 static or dynamic chain.
 In this case
 this point does not apply.
 .IP 4.
 P, or any procedure called by P (transitively),
 uses the LPB instruction, which converts a
 local base to an argument base;
 as the locals and parameters are stored
 in a non-standard way (differing from the
 normal EM calling sequence) this instruction
 would yield incorrect results.
 .IP 5.
 The total number of bytes of the parameters
 of P is not known.
 P may be a procedure with a variable number
 of parameters or may have an array of dynamic size
 as value parameter.
 .LP
 It is undesirable to expand a call to a procedure P in line
 in any of the following cases:
 .IP 1. 3
 P is large, i.e. the number of EM instructions
 of P exceeds some threshold.
 The expanded code would be large too.
 Furthermore, several programs in ACK,
 including the global optimizer itself,
 may run out of memory if they they have to run
 in a small address space and are provided
 very large procedures.
 The threshold may be set to infinite,
 in which case this point does not apply.
 .IP 2.
 P has many local variables.
 All these variables would have to be allocated
 in the stack frame of the calling procedure.
 .PP
 If a call may be expanded in line, we have to
 decide how to access its parameters.
 In the previous section we stated that we would
 use in line parameters whenever possible and desirable.
 There are several reasons why a parameter
 cannot or should not be expanded in line.
 .sp
 No parameter of a procedure P can be expanded in line,
 in any of the following cases:
 .IP 1. 3
 P, or any procedure called by P (transitively),
 does a store-indirect or a use-indirect (i.e. through
 a pointer).
 However, if the front-end has generated messages
 telling that certain parameters can not be accessed
 indirectly, those parameters may be expanded in line.
 .IP 2.
 P, or any procedure called by P (transitively),
 calls a procedure whose body is not available as EM text.
 The unknown procedure may do a store-indirect
 or a use-indirect.
 However, the same remark about front-end messages
 as for 1. holds here.
 .IP 3.
 The address of a parameter location is taken (via a LAL).
 In the normal calling sequence, all parameters
 are stored sequentially. If the address of one
 parameter location is taken, the address of any
 other parameter location can be computed from it.
 Hence we must put every parameter in a temporary location;
 furthermore, all these locations must be in
 the same order as for the normal calling sequence.
 .IP 4.
 P has overlapping parameters; for example, it uses
 the parameter at offset 10 both as a 2 byte and as a 4 byte
 parameter.
 Such code may be produced by the front ends if
 the formal parameter is of some record type
 with variants.
 .PP
 Sometimes a specific parameter must not be expanded in line.
 .sp 0
 An actual parameter expression cannot be expanded in line
 in any of the following cases:
 .IP 1. 3
 P stores into the parameter location.
 Even if the actual parameter expression is a simple
 variable, it is incorrect to change the 'store into
 formal' into a 'store into actual', because of
 the parameter mechanism used.
 In Pascal, the following expansion is incorrect:
 .DS
 procedure p (x:integer);
 begin
   x := 20;
 end;
 ...
 a := 10;                a := 10;
 p(a);        --->       a := 20;
 write(a);               write(a);
 .DE
 .IP 2.
 P changes any of the operands of the
 actual parameter expression.
 If the expression is expanded and evaluated
 after the operand has been changed,
 the wrong value will be used.
 .IP 3.
 The actual parameter expression has side effects.
 It must be evaluated only once,
 at the place of the call.
 .LP
 It is undesirable to expand an actual parameter in line
 in the following case:
 .IP 1. 3
 The parameter is used more than once
 (dynamically) and the actual parameter expression
 is not just a simple variable or constant.
 .LP
--- a/doc/ego/il/il4
+++ b/doc/ego/il/il4
@ -0,0 +1,132 @@
 .NH 2
 Heuristic rules
 .PP
 Using the information described
 in the previous section,
 we can find all calls that can
 be expanded in line, and for which
 this expansion is desirable.
 In general, we cannot expand all these calls,
 so we have to choose the 'best' ones.
 With every CAL instruction
 that may be expanded, we associate
 a \fIpay off\fR,
 which expresses how desirable it is
 to expand this specific CAL.
 .sp
 Let Tc denote the portion of EM text involved
 in a specific call, i.e. the pushing of the actual
 parameter expressions, the CAL itself,
 the popping of the parameters and the
 pushing of the result (if any, via an LFR).
 Let Te denote the EM text that would be obtained
 by expanding the call in line.
 Let Pc be the original program and Pe the program
 with Te substituted for Tc.
 The pay off of the CAL depends on two factors:
 .IP -
 T = execution_time(Pe) - execution_time(Pc)
 .IP -
 S = code_size(Pe) - code_size(Pc)
 .LP
 The change in execution time (T) depends on:
 .IP -
 T1 = execution_time(Te) - execution_time(Tc)
 .IP -
 N = number of times Te or Tc get executed.
 .LP
 We assume that T1 will be the same every
 time the code gets executed.
 This is a reasonable assumption.
 (Note that we are talking about one CAL,
 not about different calls to the same procedure).
 Hence
 .DS
 T = N * T1
 .DE
 T1 can be estimated by a careful analysis
 of the transformations that are performed.
 Below, we list everything that will be
 different when a call is expanded in line:
 .IP -
 The CAL instruction is not executed.
 This saves a subroutine jump.
 .IP -
 The instructions in the procedure prolog
 are not executed.
 These instructions, generated from the PRO pseudo,
 save some machine registers 
 (including the old LB), set the new LB and allocate space
 for the locals of the called routine.
 The savings may be less if there are no
 locals to allocate.
 .IP -
 In line parameters are not evaluated before the call
 and are not pushed on the stack.
 .IP -
 All remaining parameters are stored in local variables,
 instead of being pushed on the stack.
 .IP -
 If the number of parameters is nonzero,
 the ASP instruction after the CAL is not executed.
 .IP -
 Every reference to an in line parameter is
 substituted by the parameter expression.
 .IP -
 RET (return) instructions are replaced by
 BRA (branch) instructions.
 If the called procedure 'falls through'
 (i.e. it has only one RET, at the end of its code),
 even the BRA is not needed.
 .IP -
 The LFR (fetch function result) is not executed
 .PP
 Besides these changes, which are caused directly by IL,
 other changes may occur as IL influences other optimization
 techniques, such as Register Allocation and Constant Propagation.
 Our heuristic rules do not take into account the quite
 inpredictable effects on Register Allocation.
 It does, however, favour calls that have numeric \fIconstants\fR
 as parameter; especially the constant "0" as an inline
 parameter gets high scores,
 as further optimizations may often be possible.
 .PP
 It cannot be determined statically how often a CAL instruction gets
 executed.
 We will use \fIloop nesting\fR information here.
 The nesting level of the loop in which
 the CAL appears (if any) will be used as an
 indication for the number of times it gets executed.
 .PP
 Based on all these facts,
 the pay off of a call will be computed.
 The following model was developed empirically.
 Assume procedure P calls procedure Q.
 The call takes place in basic block B.
 .DS
 ZP = # zero parameters
 CP = # constant parameters - ZP
 LN = Loop Nesting level (0 if outside any loop)
 F  = \fIif\fR # formal parameters of Q > 0 \fIthen\fR 1 \fIelse\fR 0
 FT = \fIif\fR Q falls through \fIthen\fR 1 \fIelse\fR 0
 S  = size(Q) - 1 - # inline_parameters - F
 L  = \fIif\fR # local variables of P > 0 \fIthen\fR 0 \fIelse\fR -1
 A  = CP + 2 * ZP
 N  = \fIif\fR LN=0 and P is never called from a loop \fIthen\fR 0 \fIelse\fR (LN+1)**2
 FM = \fIif\fR B is a firm block \fIthen\fR 2 \fIelse\fR 1
 pay_off = (100/S + FT + F + L + A) * N * FM
 .DE
 S stands for the size increase of the program,
 which is slightly less than the size of Q.
 The size of a procedure is taken to be its number
 of (non-pseudo) EM instructions.
 The terms "loop nesting level" and "firm" were defined
 in the chapter on the Intermediate Code (section "loop tables").
 If a call is not inside a loop and the calling procedure
 is itself never called from a loop (transitively),
 then the call will probably be executed at most once.
 Such a call is never expanded in line (its pay off is zero).
 If the calling procedure doesn't have local variables, a penalty (L)
 is introduced, as it will most likely get local variables if the
 call gets expanded.
--- a/doc/ego/il/il5
+++ b/doc/ego/il/il5
@ -0,0 +1,440 @@
 .NH 2
 Implementation
 .PP
 A major factor in the implementation
 of Inline Substitution is the requirement
 not to use an excessive amount of memory.
 IL essentially analyzes the entire program;
 it makes decisions based on which procedure calls
 appear in the whole program.
 Yet, because of the memory restriction, it is
 not feasible to read the entire program
 in main memory.
 To solve this problem, the IL phase has been
 split up into three subphases that are executed sequentially:
 .IP 1.
 analyze every procedure; see how it accesses its parameters;
 simultaneously collect all calls
 appearing in the whole program an put them
 in a \fIcall-list\fR.
 .IP 2.
 use the call-list and decide which calls will be substituted
 in line.
 .IP 3.
 take the decisions of subphase 2 and modify the
 program accordingly.
 .LP
 Subphases 1 and 3 scan the input program; only
 subphase 3 modifies it.
 It is essential that the decisions can be made
 in subphase 2
 without using the input program,
 provided that subphase 1 puts enough information
 in the call-list.
 Subphase 2 keeps the entire call-list in main memory
 and repeatedly scans it, to
 find the next best candidate for expansion.
 .PP
 We will specify the
 data structures used by IL before 
 describing the subphases.
 .NH 3
 Data structures
 .NH 4
 The procedure table
 .PP
 In subphase 1 information is gathered about every procedure
 and added to the procedure table.
 This information is used by the heuristic rules.
 A proctable entry for procedure p has
 the following extra information:
 .IP -
 is it allowed to substitute an invocation of p in line?
 .IP -
 is it allowed to put any parameter of such a call in line?
 .IP -
 the size of p (number of EM instructions)
 .IP -
 does p 'fall through'?
 .IP -
 a description of the formal parameters that p accesses; this information
 is obtained by looking at the code of p. For every parameter f,
 we record:
 .RS
 .IP -
 the offset of f
 .IP -
 the type of f (word, double word, pointer)
 .IP -
 may the corresponding actual parameter be put in line?
 .IP -
 is f ever accessed indirectly?
 .IP -
 if f used: never, once or more than once?
 .RE
 .IP -
 the number of times p is called (see below)
 .IP -
 the file address of its call-count information (see below).
 .LP
 .NH 4
 Call-count information
 .PP
 As a result of Inline Substitution, some procedures may
 become useless, because all their invocations have been
 substituted in line.
 One of the tasks of IL is to keep track which
 procedures are no longer called.
 Note that IL is especially keen on procedures that are
 called only once
 (possibly as a result of expanding all other calls to it).
 So we want to know how many times a procedure
 is called \fIduring\fR Inline Substitution.
 It is not good enough to compute this
 information afterwards.
 The task is rather complex, because
 the number of times a procedure is called
 varies during the entire process:
 .IP 1.
 If a call to p is substituted in line,
 the number of calls to p gets decremented by 1.
 .IP 2.
 If a call to p is substituted in line,
 and p contains n calls to q, then the number of calls to q
 gets incremented by n.
 .IP 3.
 If a procedure p is removed (because it is no
 longer called) and p contains n calls to q,
 then the number of calls to q gets decremented by n.
 .LP
 (Note that p may be the same as q, if p is recursive).
 .sp 0
 So we actually want to have the following information:
 .DS
 NRCALL(p,q) = number of call to q appearing in p,
 for all procedures p and q that may be put in line.
 .DE
 This information, called \fIcall-count information\fR is
 computed by the first subphase.
 It is stored in a file.
 It is represented as a number of lists, rather than as
 a (very sparse) matrix.
 Every procedure has a list of (proc,count) pairs,
 telling which procedures it calls, and how many times.
 The file address of its call-count list is stored
 in its proctable entry.
 Whenever this information is needed, it is fetched from
 the file, using direct access.
 The proctable entry also contains the number of times
 a procedure is called, at any moment.
 .NH 4
 The call-list
 .PP
 The call-list is the major data structure use by IL.
 Every item of the list describes one procedure call.
 It contains the following attributes:
 .IP -
 the calling procedure (caller)
 .IP -
 the called procedure (callee)
 .IP -
 identification of the CAL instruction (sequence number)
 .IP -
 the loop nesting level; our heuristic rules appreciate
 calls inside a loop (or even inside a loop nested inside
 another loop, etc.) more than other calls
 .IP -
 the actual parameter expressions involved in the call;
 for every actual, we record:
 .RS
 .IP -
 the EM code of the expression
 .IP -
 the number of bytes of its result (size)
 .IP -
 an indication if the actual may be put in line
 .RE
 .LP
 The structure of the call-list is rather complex.
 Whenever a call is expanded in line, new calls
 will suddenly appear in the program,
 that were not contained in the original body
 of the calling subroutine.
 These calls are inherited from the called procedure.
 We will refer to these invocations as \fInested calls\fR
 (see Fig. 5.1).
 .DS
 procedure p is
 begin                           .
     a();                       .
     b();                       .
 end;
 procedure r is            procedure r is
 begin                     begin
     x();                      x();
     p();  -- in line          a();  -- nested call
     y();                      b();  -- nested call
 end;                           y();
                          end;
 Fig. 5.1 Example of nested procedure calls
 .DE
 Nested calls may subsequently be put in line too
 (probably resulting in a yet deeper nesting level, etc.).
 So the call-list does not always reflect the source program,
 but changes dynamically, as decisions are made.
 If a call to p is expanded, all calls appearing in p
 will be added to the call-list.
 .sp 0
 A convenient and elegant way to represent
 the call-list is to use a LISP-like list.
 .[
 poel lisp trac
 .]
 Calls that appear at the same level
 are linked in the CDR direction. If a call C
 to a procedure p is expanded,
 all calls appearing in p are put in a sub-list
 of C, i.e. in its CAR.
 In the example above, before the decision
 to expand the call to p is made, the
 call-list of procedure r looks like:
 .DS
 (call-to-x, call-to-p, call-to-y)
 .DE
 After the decision, it looks like:
 .DS
 (call-to-x, (call-to-p*, call-to-a, call-to-b), call-to-y)
 .DE
 The call to p is marked, because it has been
 substituted.
 Whenever IL wants to traverse the call-list of some procedure,
 it uses the well-known LISP technique of
 recursion in the CAR direction and
 iteration in the CDR direction
 (see page 1.19-2 of
 .[
 poel lisp trac
 .]
 ).
 All list traversals look like:
 .DS
 traverse(list)
 {
    for (c = first(list); c != 0; c = CDR(c)) {
 	if (c is marked) {
 	    traverse(CAR(c));
 	} else {
 	    do something with c
 	}
    }
 }
 .DE
 The entire call-list consists of a number of LISP-like lists,
 one for every procedure.
 The proctable entry of a procedure contains a pointer
 to the beginning of the list.
 .NH 3
 The first subphase: procedure analysis
 .PP
 The tasks of the first subphase are to determine
 several attributes of every procedure
 and to construct the basic call-list,
 i.e. without nested calls.
 The size of a procedure is determined
 by simply counting its EM instructions.
 Pseudo instructions are skipped.
 A procedure does not 'fall through' if its CFG
 contains a basic block
 that is not the last block of the CFG and
 that ends on a RET instruction.
 The formal parameters of a procedure are determined
 by inspection of
 its code.
 .PP
 The call-list in constructed by looking at all CAL instructions
 appearing in the program.
 The call-list should only contain calls to procedures
 that may be put in line.
 This fact is only known if the procedure was
 analyzed earlier.
 If a call to a procedure p appears in the program
 before the body of p,
 the call will always be put in the call-list.
 If p is later found to be unsuitable,
 the call will be removed from the list by the
 second subphase.
 .PP
 An important issue is the recognition
 of the actual parameter expressions of the call.
 The front ends produces messages telling how many
 bytes of formal parameters every procedure accesses.
 (If there is no such message for a procedure, it
 cannot be put in line).
 The actual parameters together must account for
 the same number of bytes.A recursive descent parser is used
 to parse side-effect free EM expressions.
 It uses a table and some
 auxiliary routines to determine
 how many bytes every EM instruction pops from the stack
 and how many bytes it pushes onto the stack.
 These numbers depend on the EM instruction, its argument,
 and the wordsize and pointersize of the target machine.
 Initially, the parser has to recognize the
 number of bytes specified in the formals-message,
 say N.
 Assume the first instruction before the CAL pops S bytes
 and pushes R bytes.
 If R > N, too many bytes are recognized
 and the parser fails.
 Else, it calls itself recursively to recognize the
 S bytes used as operand of the instruction.
 If it succeeds in doing so, it continues with the next instruction,
 i.e. the first instruction before the code recognized by
 the recursive call, to recognize N-R more bytes.
 The result is a number of EM instructions that collectively push N bytes.
 If an instruction is come across that has side-effects
 (e.g. a store or a procedure call) or of which R and S cannot
 be computed statically (e.g. a LOS), it fails.
 .sp 0
 Note that the parser traverses the code backwards.
 As EM code is essentially postfix code, the parser works top down.
 .PP
 If the parser fails to recognize the parameters, the call will not
 be substituted in line.
 If the parameters can be determined, they still have to
 match the formal parameters of the called procedure.
 This check is performed by the second subphase; it cannot be
 done here, because it is possible that the called
 procedure has not been analyzed yet.
 .PP
 The entire call-list is written to a file,
 to be processed by the second subphase.
 .NH 3
 The second subphase: making decisions
 .PP
 The task of the second subphase is quite easy
 to understand.
 It reads the call-list file,
 builds an incore call-list and deletes every
 call that may not be expanded in line (either because the called
 procedure may not be put in line, or because the actual parameters
 of the call do not match the formal parameters of the called procedure).
 It assigns a \fIpay-off\fR to every call,
 indicating how desirable it is to expand it.
 .PP
 The subphase repeatedly scans the call-list and takes
 the call with the highest ratio.
 The chosen one gets marked,
 and the call-list is extended with the nested calls,
 as described above.
 These nested calls are also assigned a ratio,
 and will be considered too during the next scans.
 .sp 0
 After every decision the number of times
 every procedure is called is updated, using
 the call-count information.
 Meanwhile, the subphase keeps track of the amount of space left
 available.
 If all space is used, or if there are no more calls left to
 be expanded, it exits this loop.
 Finally, calls to procedures that are called only
 once are also chosen.
 .PP
 The actual parameters of a call are only needed by
 this subphase to assign a ratio to a call.
 To save some space, these actuals are not kept in main memory.
 They are removed after the call has been read and a ratio
 has been assigned to it.
 So this subphase works with \fIabstracts\fR of calls.
 After all work has been done,
 the actual parameters of the chosen calls are retrieved
 from a file,
 as they are needed by the transformation subphase.
 .NH 3
 The third subphase: doing transformations
 .PP
 The third subphase makes the actual modifications to
 the EM text.
 It is directed by the decisions made in the previous subphase,
 as expressed via the call-list.
 The call-list read by this subphase contains
 only calls that were selected for expansion.
 The list is ordered in the same way as the EM text,
 i.e. if a call C1 appears before a call C2 in the call-list,
 C1 also appears before C2 in the EM text.
 So the EM text is traversed linearly,
 the calls that have to be substituted are determined
 and the modifications are made.
 If a procedure is come across that is no longer needed,
 it is simply not written to the output EM file.
 The substitution of a call takes place in distinct steps:
 .IP "change the calling sequence" 7
 .sp 0
 The actual parameter expressions are changed.
 Parameters that are put in line are removed.
 All remaining ones must store their result in a
 temporary local variable, rather than
 push it on the stack.
 The CAL instruction and any ASP (to pop actual parameters)
 or LFR (to fetch the result of a function)
 are deleted.
 .IP "fetch the text of the called procedure"
 .sp 0
 Direct disk access is used to to read the text of the
 called procedure.
 The file offset is obtained from the proctable entry.
 .IP "allocate bytes for locals and temporaries"
 .sp 0
 The local variables of the called procedure will be put in the
 stack frame of the calling procedure.
 The same applies to any temporary variables
 that hold the result of parameters
 that were not put in line.
 The proctable entry of the caller is updated.
 .IP "put a label after the CAL"
 .sp 0
 If the called procedure contains a RET (return) instruction
 somewhere in the middle of its text (i.e. it does
 not fall through), the RET must be changed into
 a BRA (branch), to jump over the
 remainder of the text.
 This label is not needed if the called
 procedure falls through.
 .IP "copy the text of the called procedure and modify it"
 .sp 0
 References to local variables of the called routine
 and to parameters that are not put in line
 are changed to refer to the
 new local of the caller.
 References to in line parameters are replaced
 by the actual parameter expression.
 Returns (RETs) are either deleted or
 replaced by a BRA.
 Messages containing information about local
 variables or parameters are changed.
 Global data declarations and the PRO and END pseudos
 are removed.
 Instruction labels and references to them are
 changed to make sure they do not have the
 same identifying number as
 labels in the calling procedure.
 .IP "insert the modified text"
 .sp 0
 The pseudos of the called procedure are put after the pseudos
 of the calling procedure.
 The real text of the callee is put at
 the place where the CAL was.
 .IP "take care of nested substitutions"
 .sp 0
 The expanded procedure may contain calls that
 have to be expanded too (nested calls).
 If the descriptor of this call contains actual
 parameter expressions,
 the code of the expressions has to be changed
 the same way as the code of the callee was changed.
 Next, the entire process of finding CALs and doing
 the substitutions is repeated recursively.
 .LP
--- a/doc/ego/il/il6
+++ b/doc/ego/il/il6
@ -0,0 +1,27 @@
 .NH 2
 Source files of IL
 .PP
 The sources of IL are in the following files
 and packages (the prefixes 1_, 2_ and 3_ refer to the three subphases):
 .IP il.h: 14
 declarations of global variables and
 data structures
 .IP il.c:
 the routine main; the driving routines of the three subphases
 .IP 1_anal:
 contains a subroutine that analyzes a procedure
 .IP 1_cal:
 contains a subroutine that analyzes a call
 .IP 1_aux:
 implements auxiliary procedures used by subphase 1
 .IP 2_aux:
 implements auxiliary procedures used by subphase 2
 .IP 3_subst:
 the driving routine for doing the substitution
 .IP 3_change:
 lower level routines that do certain modifications
 .IP 3_aux:
 implements auxiliary procedures used by subphase 3
 .IP aux
 implements auxiliary procedures used by several subphases.
 .LP
--- a/doc/ego/intro/head
+++ b/doc/ego/intro/head
@ -0,0 +1,7 @@
 .ND
 .ll 80m
 .nr LL 80m
 .nr tl 78m
 .tr ~ 
 .ds >. .
 .ds [. " \[
--- a/doc/ego/intro/intro1
+++ b/doc/ego/intro/intro1
@ -0,0 +1,79 @@
 .TL
 The design and implementation of
 the EM Global Optimizer
 .AU
 H.E. Bal
 .AI
 Vrije Universiteit
 Wiskundig Seminarium, Amsterdam
 .AB
 The EM Global Optimizer is part of the Amsterdam Compiler Kit,
 a toolkit for making retargetable compilers.
 It optimizes the intermediate code common to all compilers of
 the toolkit (EM),
 so it can be used for all programming languages and
 all processors supported by the kit.
 .PP
 The optimizer is based on well-understood concepts like
 control flow analysis and data flow analysis.
 It performs the following optimizations:
 Inline Substitution, Strength Reduction, Common Subexpression Elimination,
 Stack Pollution, Cross Jumping, Branch Optimization, Copy Propagation,
 Constant Propagation, Dead Code Elimination and Register Allocation.
 .PP
 This report describes the design of the optimizer and several
 of its implementation issues.
 .AE
 .bp
 .NH 1
 Introduction
 .PP
 .FS
 This work was supported by the
 Stichting Technische Wetenschappen (STW)
 under grant VWI00.0001.
 .FE
 The EM Global Optimizer is part of a software toolkit
 for making production-quality retargetable compilers.
 This toolkit,
 called the Amsterdam Compiler Kit
 .[
 tanenbaum toolkit rapport
 .]
 .[
 tanenbaum toolkit cacm
 .]
 runs under the Unix*
 .FS
 *Unix is a Trademark of Bell Laboratories
 .FE
 operating system.
 .sp 0
 The main design philosophy of the toolkit is to use
 a language- and machine-independent
 intermediate code, called EM.
 .[
 keizer architecture
 .]
 The basic compilation process can be split up into
 two parts.
 A language-specific front end translates the source program into EM.
 A machine-specific back end transforms EM to assembly code
 of the target machine.
 .PP
 The global optimizer is an optional phase of the
 compilation process, and can be used to obtain
 machine code of a higher quality.
 The optimizer transforms EM-code to better EM-code,
 so it comes between the front end and the back end.
 It can be used with any combination of languages
 and machines, as far as they are supported by
 the compiler kit.
 .PP
 This report describes the design of the
 global optimizer and several of its
 implementation issues.
 Measurements can be found in.
 .[
 bal tanenbaum global
 .]
--- a/doc/ego/intro/tail
+++ b/doc/ego/intro/tail
@ -0,0 +1,3 @@
 .[
 $LIST$
 .]
--- a/doc/ego/lv/lv1
+++ b/doc/ego/lv/lv1
@ -0,0 +1,95 @@
 .bp
 .NH 1
 Live-Variable analysis
 .NH 2
 Introduction
 .PP
 The "Live-Variable analysis" optimization technique (LV)
 performs some code improvements and computes information that may be
 used by subsequent optimizations.
 The main task of this phase is the 
 computation of \fIlive-variable information\fR.
 .[~[
 aho compiler design
 .] section 14.4]
 A variable A is said to be \fIdead\fR at some point p of the
 program text, if on no path in the control flow graph
 from p to a RET (return), A can be used before being changed;
 else A is said to be \fIlive\fR. 
 .PP
 A statement of the form
 .DS
 VARIABLE := EXPRESSION
 .DE
 is said to be dead if the left hand side variable is dead just after
 the statement and the right hand side expression has no
 side effects (i.e. it doesn't change any variable).
 Such a statement can be eliminated entirely.
 Dead code will seldom be present in the original program,
 but it may be the result of earlier optimizations,
 such as copy propagation.
 .PP
 Live-variable information is passed to other phases via
 messages in the EM code.
 Live/dead messages are generated at points in the EM text where
 variables become dead or live.
 This information is especially useful for the Register
 Allocation phase.
 .NH 2
 Implementation
 .PP
 The implementation uses algorithm 14.6 of.
 .[
 aho compiler design
 .]
 First two sets DEF and USE are computed for every basic block b:
 .IP DEF(b) 9
 the set of all variables that are assigned a value in b before
 being used
 .IP USE(b) 9
 the set of all variables that may be used in b before being changed.
 .LP
 (So variables that may, but need not, be used resp. changed via a procedure
 call or through a pointer are included in USE but not in DEF).
 The next step is to compute the sets IN and OUT :
 .IP IN[b] 9
 the set of all variables that are live at the beginning of b
 .IP OUT[b] 9
 the set of all variables that are live at the end of b
 .LP
 IN and OUT can be computed for all blocks simultaneously by solving the
 data flow equations:
 .DS
 (1)   IN[b] = OUT[b] - DEF[b] + USE[b]
 [2]   OUT[b] = IN[s1] + ... + IN[sn] ;
 	where SUCC[b] = {s1, ... , sn}
 .DE
 The equations are solved by a similar algorithm as for
 the Use Definition equations (see previous chapter).
 .PP
 Finally, each basic block is visited in turn to remove its dead code
 and to emit the live/dead messages.
 Every basic block b is traversed from its last
 instruction backwards to the beginning of b.
 Initially, all variables that are dead at the end
 of b are marked dead. All others are marked live.
 If we come across an assignment to a variable X that
 was marked live, a live-message is put after the
 assignment and X is marked dead;
 if X was marked dead, the assignment may be removed, provided that
 the right hand side expression contains no side effects.
 If we come across a use of a variable X that
 was marked dead, a dead-message is put after the
 use and X is marked live.
 So at any point, the mark of X tells whether X is
 live or dead immediately before that point.
 A message is also generated at the start of a basic block
 for every variable that was live at the end of the (textually)
 previous block, but dead at the entry of this block, or v.v.
 .PP
 Only local variables are considered.
 This significantly reduces the memory needed by this phase,
 eases the implementation and is hardly less efficient than
 considering all variables.
 (Note that it is very hard to prove that an assignment to
 a global variable is dead).
--- a/doc/ego/ov/ov1
+++ b/doc/ego/ov/ov1
@ -0,0 +1,371 @@
 .bp
 .NH 1
 Overview of the global optimizer
 .NH 2
 The ACK compilation process
 .PP
 The EM Global Optimizer is one of three optimizers that are
 part of the Amsterdam Compiler Kit (ACK).
 The phases of ACK are:
 .IP 1.
 A Front End translates a source program to EM
 .IP 2.
 The Peephole Optimizer
 .[
 tanenbaum staveren peephole toplass
 .]
 reads EM code and produces 'better' EM code.
 It performs a number of optimizations (mostly peephole
 optimizations)
 such as constant folding, strength reduction and unreachable code
 elimination.
 .IP 3.
 The Global Optimizer further improves the EM code.
 .IP 4.
 The Code Generator transforms EM to assembly code
 of the target computer.
 .IP 5.
 The Target Optimizer improves the assembly code.
 .IP 6.
 An Assembler/Loader generates an executable file.
 .LP
 For a more extensive overview of the ACK compilation process,
 we refer to.
 .[
 tanenbaum toolkit rapport
 .]
 .[
 tanenbaum toolkit cacm
 .]
 .PP
 The input of the Global Optimizer may consist of files and
 libraries.
 Every file or module in the library must contain EM code in
 Compact Assembly Language format.
 .[~[
 tanenbaum machine architecture
 .], section 11.2]
 The output consists of one such EM file.
 The input files and libraries together need not
 constitute an entire program,
 although as much of the program as possible should be supplied.
 The more information about the program the optimizer 
 gets, the better its output code will be.
 .PP
 The Global Optimizer is language- and machine-independent,
 i.e. it can be used for all languages and machines supported by ACK.
 Yet, it puts some unavoidable restrictions on the EM code
 produced by the Front End (see below).
 It must have some knowledge of the target machine.
 This knowledge is expressed in a machine description table
 which is passed as argument to the optimizer.
 This table does not contain very detailed information about the
 target (such as its instruction set and addressing modes).
 .NH 2
 The EM code
 .PP
 The definition of EM, the intermediate code of all ACK compilers,
 is given in a separate document.
 .[
 tanenbaum machine architecture
 .]
 We will only discuss some features of EM that are most relevant
 to the Global Optimizer.
 .PP
 EM is the assembly code of a virtual \fIstack machine\fR.
 All operations are performed on the top of the stack.
 For example, the statement "A := B + 3" may be expressed in EM as:
 .DS
 LOL -4         -- push local variable B
 LOC 3          -- push constant 3
 ADI 2          -- add two 2-byte items on top of
 	       -- the stack and push the result
 STL -2         -- pop A
 .DE
 So EM is essentially a \fIpostfix\fR code.
 .PP
 EM has a rich instruction set, containing several arithmetic
 and logical operators.
 It also contains special-case instructions (such as INCrement).
 .PP
 EM has \fIglobal\fR (\fIexternal\fR) variables, accessible
 by all procedures and \fIlocal\fR variables, accessible by a few
 (nested) procedures.
 The local variables of a lexically enclosing procedure may
 be accessed via a \fIstatic link\fR. 
 EM has instructions to follow the static chain.
 There are EM instruction to allow a procedure
 to access its local variables directly (such as LOL and STL above).
 Local variables are referenced via an offset in the stack frame
 of the procedure, rather than by their names (e.g. -2 and -4 above).
 The EM code does not contain the (source language) type
 of the variables.
 .PP
 All structured statements in the source program are expressed in
 low level jump instructions.
 Besides conditional and unconditional branch instructions, there are 
 two case instructions (CSA and CSB),
 to allow efficient translation of case statements.
 .NH 2
 Requirements on the EM input
 .PP
 As the optimizer should be useful for all languages,
 it clearly should not put severe restrictions on the EM code
 of the input.
 There is, however, one immovable requirement:
 it must be possible to determine the \fIflow of control\fR of the
 input program.
 As virtually all global optimizations are based on control flow information,
 the optimizer would be totally powerless without it.
 For this reason we restrict the usage of the case jump instructions (CSA/CSB)
 of EM.
 Such an instruction is always called with the address of a case descriptor
 on top the the stack.
 .[~[
 tanenbaum machine architecture
 .] section 7.4]
 This descriptor contains the labels of all possible
 destinations of the jump.
 We demand that all case descriptors are allocated in a global
 data fragment of type ROM, i.e. the case descriptors
 may not be modifyable.
 Furthermore, any case instruction should be immediately preceded by
 a LAE (Load Address External) instruction, that loads the
 address of the descriptor,
 so the descriptor can be uniquely identified.
 .PP
 The optimizer will work improperly if the user deceives the control flow.
 We will give two methods to do this.
 .PP
 In "C" the notorious library routines "setjmp" and "longjmp"
 .[
 unix programmer's manual
 .]
 may be used to jump out of a procedure,
 but can also be used for a number of other stuffy purposes,
 for example, to create an extra entry point in a loop.
 .DS
 while (condition) {
 	 ....
 	 setjmp(buf);
 	 ...
 }
 ...
 longjmp(buf);
 .DE
 The invocation to longjmp actually is a jump to the place of
 the last call to setjmp with the same argument (buf).
 As the calls to setjmp and longjmp are indistinguishable from
 normal procedure calls, the optimizer will not see the danger.
 No need to say that several loop optimizations will behave
 unexpectedly when presented with such pathological input.
 .PP
 Another way to deceive the flow of control is
 by using exception handling routines.
 Ada*
 .FS
 * Ada is a registered trademark of the U.S. Government
 (Ada Joint Program Office).
 .FE
 has clearly recognized the dangers of exception handling,
 but other languages (such as PL/I) have not.
 .[
 ada rationale
 .]
 .PP
 The optimizer will be more effective if the EM input contains
 some extra information about the source program.
 Especially the \fIregister message\fR is very important.
 These messages indicate which local variables may never be
 accessed indirectly.
 Most optimizations benefit significantly by this information.
 .PP
 The Inline Substitution technique needs to know how many bytes
 of formal parameters every procedure accesses.
 Only calls to procedures for which the EM code contains this information
 will be substituted in line.
 .NH 2
 Structure of the optimizer
 .PP
 The Global Optimizer is organized as a number of \fIphases\fR,
 each one performing some task.
 The main structure is as follows:
 .IP IC 6
 the Intermediate Code construction phase transforms EM into the
 intermediate code (ic) of the optimizer
 .IP CF
 the Control Flow phase extends the ic with control flow
 information and interprocedural information
 .IP OPTs
 zero or more optimization phases, each one performing one or
 more related optimizations
 .IP CA
 the Compact Assembly phase generates Compact Assembly Language EM code
 out of ic.
 .LP
 .PP
 An important issue in the design of a global optimizer is the
 interaction between optimization techniques.
 It is often advantageous to combine several techniques in
 one algorithm that takes into account all interactions between them.
 Ideally, one single algorithm should be developed that does
 all optimizations simultaneously and deals with all possible interactions.
 In practice, such an algorithm is still far out of  reach.
 Instead some rather ad hoc (albeit important) combinations are chosen,
 such as Common Subexpression Elimination and Register Allocation.
 .[
 prabhala sethi common subexpressions
 .]
 .[
 sethi ullman optimal code
 .]
 .PP
 In the Em Global Optimizer there is one separate algorithm for
 every technique.
 Note that this does not mean that all techniques are independent
 of each other.
 .PP
 In principle, the optimization phases can be run in any order;
 a phase may even be run more than once.
 However, the following rules should be obeyed:
 .IP -
 the Live Variable analysis phase (LV) must be run prior to
 Register Allocation (RA), as RA uses information outputted by LV.
 .IP -
 RA should be the last phase; this is a consequence of the way
 the interface between RA and the Code Generator is defined.
 .LP
 The ordering of the phases has significant impact on
 the quality of the produced code.
 In
 .[
 wulf overview production quality carnegie-mellon
 .]
 two kinds of phase ordering problems are distinguished.
 If two techniques A and B both take away opportunities of each other,
 there is a "negative" ordering problem.
 If, on the other hand, both A and B introduce new optimization
 opportunities for each other, the problem is called "positive".
 In the Global Optimizer the following interactions must be
 taken into account:
 .IP -
 Inline Substitution (IL) may create new opportunities for most
 other techniques, so it should be run as early as possible
 .IP -
 Use Definition analysis (UD) may introduce opportunities for LV.
 .IP -
 Strength Reduction may create opportunities for UD
 .LP
 The optimizer has a default phase ordering, which can
 be changed by the user.
 .NH 2
 Structure of this document
 .PP
 The remaining chapters of this document each describe one
 phase of the optimizer.
 For every phase, we describe its task, its design,
 its implementation, and its source files.
 The latter two sections are intended to aid the
 maintenance of the optimizer and
 can be skipped by the initial reader.
 .NH 2
 References
 .PP
 There are very 
 few modern textbooks on optimization.
 Chapters 12, 13, and 14 of
 .[
 aho compiler design
 .]
 are a good introduction to the subject.
 Wulf et. al.
 .[
 wulf optimizing compiler
 .]
 describe one specific optimizing (Bliss) compiler.
 Anklam et. al.
 .[
 anklam vax-11
 .]
 discuss code generation and optimization in
 compilers for one specific machine (a Vax-11).
 Kirchgaesner et. al. 
 .[
 optimizing ada compiler
 .]
 present a brief description of many
 optimizations; the report also contains a lengthy (over 60 pages)
 bibliography.
 .PP
 The number of articles on optimization is quite impressive.
 The Lowrey and Medlock paper on the Fortran H compiler
 .[
 object code optimization
 .]
 is a classical one.
 Other papers on global optimization are.
 .[
 faiman optimizing pascal
 .]
 .[
 perkins sites
 .]
 .[
 harrison general purpose optimizing
 .]
 .[
 morel partial redundancies
 .]
 .[
 Mintz global optimizer
 .]
 Freudenberger
 .[
 freudenberger setl optimizer
 .]
 describes an optimizer for a Very High Level Language (SETL).
 The Production-Quality Compiler-Compiler (PQCC) project uses
 very sophisticated compiler techniques, as described in.
 .[
 wulf overview ieee
 .]
 .[
 wulf overview carnegie-mellon
 .]
 .[
 wulf machine-relative
 .]
 .PP
 Several Ph.D. theses are dedicated to optimization.
 Davidson
 .[
 davidson simplifying
 .]
 outlines a machine-independent peephole optimizer that
 improves assembly code.
 Katkus
 .[
 katkus
 .]
 describes how efficient programs can be obtained at little cost by
 optimizing only a small part of a program.
 Photopoulos
 .[
 photopoulos mixed code
 .]
 discusses the idea of generating interpreted intermediate code as well
 as assembly code, to obtain programs that are both small and  fast.
 Shaffer
 .[
 shaffer automatic
 .]
 describes the theory of automatic subroutine generation.
 .]
 Leverett
 .[
 leverett register allocation compilers
 .]
 deals with register allocation in the PQCC compilers.
 .PP
 References to articles about specific optimization techniques
 will be given in later chapters.
--- a/doc/ego/ra/ra1
+++ b/doc/ego/ra/ra1
@ -0,0 +1,33 @@
 .bp
 .NH 1
 Register Allocation
 .NH 2
 Introduction
 .PP
 The efficient usage of the general purpose registers
 of the target machine plays a key role in any optimizing compiler.
 This subject, often referred to as \fIRegister Allocation\fR,
 has great impact on both the code generator and the
 optimizing part of such a compiler.
 The code generator needs registers for at least the evaluation of
 arithmetic expressions;
 the optimizer uses the registers to decrease the access costs
 of frequently used entities (such as variables).
 The design of an optimizing compiler must pay great
 attention to the cooperation of optimization, register allocation
 and code generation.
 .PP
 Register allocation has received much attention in literature (see
 .[
 leverett register allocation compilers
 .]
 .[
 chaitin register coloring
 .]
 .[
 freiburghouse usage counts
 .]
 and
 .[~[
 sites register
 .]]).
--- a/doc/ego/ra/ra2
+++ b/doc/ego/ra/ra2
@ -0,0 +1,139 @@
 .NH 2
 Usage of registers in ACK compilers
 .PP
 We will first describe the major design decisions 
 of the Amsterdam Compiler Kit,
 as far as they concern register allocation.
 Subsequently we will outline 
 the role of the Global Optimizer in the register
 allocation process and the interface
 between the code generator and the optimizer.
 .NH 3
 Usage of registers without the intervention of the Global Optimizer
 .PP
 Registers are used for two purposes:
 .IP 1.
 for the evaluation of arithmetic expressions
 .IP 2.
 to hold local variables, for the duration of the procedure they
 are local to.
 .LP
 It is essential to note that no translation part of the compilers,
 except for the code generator, knows anything at all
 about the register set of the target computer.
 Hence all decisions about registers are ultimately made by
 the code generator.
 Earlier phases of a compiler can only \fIadvise\fR the code generator.
 .PP
 The code generator splits the register set into two:
 a fixed part for the evaluation of expressions (called \fIscratch\fR
 registers) and a fixed part to store local variables.
 This partitioning, which depends only on the target computer, significantly
 reduces the complexity of register allocation, at the penalty
 of some loss of code quality.
 .PP
 The code generator has some (machine-dependent) knowledge of the access costs
 of memory locations and registers and of the costs of saving and
 restoring registers. (Registers are always saved by the \fIcalled\fR
 procedure).
 This knowledge is expressed in a set of procedures for each target machine.
 The code generator also knows how many registers there are and of
 which type they are.
 A register can be of type \fIpointer\fR, \fIfloating point\fR
 or \fIgeneral\fR.
 .PP
 The front ends of the compilers determine which local variables may
 be put in a register;
 such a variable may never be accessed indirectly (i.e. through a pointer).
 The front end also determines the types and sizes of these variables.
 The type can be any of the register types or the type \fIloop variable\fR,
 which denotes a general-typed variable that is used as loop variable
 in a for-statement.
 All this information is collected in a \fIregister message\fR in
 the EM code.
 Such a message is a pseudo EM instruction.
 This message also contains a \fIscore\fR field,
 indicating how desirable it is to put this variable in a register.
 A front end may assign a high score to a variable if it
 was declared as a register variable (which is only possible in
 some languages, such as "C").
 Any compiler phase before the code generator may change this score field,
 if it has reason to do so.
 The code generator bases its decisions on the information contained
 in the register message, most notably on the score.
 .PP
 If the global optimizer is not used,
 the score fields are set by the Peephole Optimizer.
 This optimizer simply counts the number of occurrences
 of every local (register) variable and adds this count
 to the score provided by the front end.
 In this way a simple, yet quite effective
 register allocation scheme is achieved.
 .NH 3
 The role of the Global Optimizer
 .PP
 The Global Optimizer essentially tries to improve the scheme
 outlined above.
 It uses the following principles for this purpose:
 .IP -
 Entities are not always assigned a register for the duration
 of an entire procedure; smaller regions of the program text
 may be considered too.
 .IP -
 several variables may be put in the same register simultaneously,
 provided at most one of them is live at any point.
 .IP -
 besides local variables, other entities (such as constants and addresses of
 variables and procedures) may be put in a register.
 .IP -
 more accurate cost estimates are used.
 .LP
 To perform its task, the optimizer must have some
 knowledge of the target machine.
 .NH 3
 The interface between the register allocator and the code generator
 .PP
 The RA phase of the optimizer must somehow be able to express its
 decisions.
 Such decisions may look like: 'put constant 1283 in a register from
 line 12 to line 40'.
 To be precise, RA must be able to tell the code generator to:
 .IP -
 initialize a register with some value
 .IP -
 update an entity from a register
 .IP -
 replace all occurrences of an entity in a certain region
 of text by a reference to the register.
 .LP
 At least three problems occur here: the code generator is only used to
 put local variables in registers,
 it only assigns a register to a variable for the duration of an entire
 procedure and it is not used to have some earlier compiler phase
 make all the decisions.
 .PP
 All problems are solved by one mechanism, that involves no changes
 to the code generator.
 With every (non-scratch) register R that will be used in
 a procedure P, we associate a new variable T, local to P.
 The size of T is the same as the size of R.
 A register message is generated for T with an exceptionally high score.
 The scores of all original register messages are set to zero.
 Consequently, the code generator will always assign precisely those new
 variables to a register.
 If the optimizer wants to put some entity, say the constant 1283, in
 a register, it emits the code "T := 1283" and replaces all occurrences
 of '1283' by T.
 Similarly, it can put the address of a procedure in T and replace all
 calls to that procedure by indirect calls.
 Furthermore, it can put several different entities in T (and thus in R)
 during the lifetime of P.
 .PP
 In principle, the code generated by the optimizer in this way would
 always be valid EM code, even if the optimizer would be presented
 a totally wrong description of the target computer register set.
 In practice, it would be a waste of data as well as text space to
 allocate memory for these new variables, as they will always be assigned
 a register (in the correct order of events).
 Hence, no memory locations are allocated for them.
 For this reason they are called pseudo local variables.
--- a/doc/ego/ra/ra3
+++ b/doc/ego/ra/ra3
@ -0,0 +1,383 @@
 .NH 2
 The register allocation phase
 .PP
 .NH 3
 Overview
 .PP
 The RA phase deals with one procedure at a time.
 For every procedure, it first determines which entities
 may be put in a register. Such an entity
 is called an \fIitem\fR.
 For every item it decides during which parts of the procedure it
 might be assigned a register.
 Such a region is called a \fItimespan\fR.
 For any item, several (possibly overlapping) timespans may
 be considered.
 A pair (item,timespan) is called an \fIallocation\fR.
 If the items of two allocations are both live at some
 point of time in the intersections of their timespans,
 these allocations are said to be \fIrivals\fR of each other,
 as they cannot be assigned the same register.
 The rivals-set of every allocation is computed.
 Next, the gains of assigning a register to an allocation are estimated,
 for every allocation.
 With all this information, decisions are made which allocations
 to store in which registers (\fIpacking\fR).
 Finally, the EM text is transformed to reflect these decisions.
 .NH 3
 The item recognition subphase
 .PP
 RA tries to put the following entities in a register:
 .IP -
 a local variable for which a register message was found
 .IP -
 the address of a local variable for which no
 register message was found
 .IP -
 the address of a global variable
 .IP -
 the address of a procedure
 .IP -
 a numeric constant.
 .LP
 Only the \fIaddress\fR of a global variable
 may be put in a register, not the variable itself.
 This approach avoids the very complex problems that would be
 caused by procedure calls and indirect pointer references (see
 .[~[
 aho design compiler
 .] sections 14.7 and 14.8]
 and 
 .[~[
 spillman side-effects
 .]]).
 Still, on most machines accessing a global variable using indirect
 addressing through a register is much cheaper than
 accessing it via its address.
 Similarly, if the address of a procedure is put in a register, the
 procedure can be called via an indirect call.
 .PP
 With every item we associate a register type.
 This type is
 .DS
 for local variables: the type contained in the register message
 for addresses of variables and procedures: the pointer type
 for constants: the general type
 .DE
 An entity other than a local variable is not taken to be an item
 if it is used only once within the current procedure.
 .PP
 An item is said to be \fIlive\fR at some point of the program text
 if its value may be used before it is changed.
 As addresses and constants are never changed, all items but local
 variables are always live.
 The region of text during which a local variable is live is
 determined via the live/dead messages generated by the
 Live Variable analysis phase of the Global Optimizer.
 .NH 3
 The allocation determination subphase
 .PP
 If a procedure has more items than registers,
 it may be advantageous to put an item in a register
 only during those parts of the procedure where it is most
 heavily used.
 Such a part will be called a timespan.
 With every item we may associate a set of timespans.
 If two timespans of an item overlap,
 at most one of them may be granted a register,
 as there is no use in putting the same item in two
 registers simultaneously.
 If two timespans of an item are distinct,
 both may be chosen;
 the item will possibly be put in two
 different registers during different parts of the procedure.
 The timespan may also consist
 of the whole procedure.
 .PP
 A list of (item,timespan) pairs (allocations)
 is build, which will be the input to the decision making
 subphase of RA (packing subphase).
 This allocation list is the main data structure of RA.
 The description of the remainder of RA will be in terms
 of allocations rather than items.
 The phrase "to assign a register to an allocation" means "to assign
 a register to the item of the allocation for the duration of
 the timespan of the allocation".
 Subsequent subphases will add more information
 to this list.
 .PP
 Several factors must be taken into account when a
 timespan for an item is constructed:
 .IP 1.
 At any \fIentry point\fR of the timespan where the
 item is live,
 the register must be initialized with the item
 .IP 2.
 At any exit point of the timespan where the item is live,
 the item must be updated.
 .LP
 In order to decrease these costs, we will only consider timespans with
 one entry point
 and no live exit points.
 .NH 3
 The rivals computation subphase
 .PP
 As stated before, several different items may be put in the
 same register, provided they are not live simultaneously.
 For every allocation we determine the intersection
 of its timespan and the lifetime of its item (i.e. the part of the
 procedure during which the item is live).
 The allocation is said to be busy during this intersection.
 If two allocations are ever busy simultaneously they are
 said to be rivals of each other.
 The rivals information is added to the allocation list.
 .NH 3
 The profits computation subphase
 .PP
 To make good decisions, the packing subphase needs to
 know which allocations can be assigned the same register
 (rivals information) and how much is gained by
 granting an allocation a register.
 .PP
 Besides the gains of using a register instead of an
 item,
 two kinds of overhead costs must be
 taken into account:
 .IP -
 the register must be initialized with the item
 .IP -
 the register must be saved at procedure entry
 and restored at procedure exit.
 .LP
 The latter costs should not be due to a single
 allocation, as several allocations can be assigned the same register.
 These costs are dealt with after packing has been done.
 They do not influence the decisions of the packing algorithm,
 they may only undo them.
 .PP
 The actual profits consist of improvements
 of execution time and code size.
 As the former is far more difficult to estimate , we will 
 discuss code size improvements first.
 .PP
 The gains of putting a certain item in a register
 depends on how the item is used.
 Suppose the item is
 a pointer variable.
 On machines that do not have a
 double-indirect addressing mode,
 two instructions are needed to dereference the variable
 if it is not in a register, but only one if it is put in a register.
 If the variable is not dereferenced, but simply copied, one instruction
 may be sufficient in both cases.
 So  the gains of putting a pointer variable in a register are higher
 if the variable is dereferenced often.
 .PP
 To make accurate estimates, detailed knowledge of
 the target machine and of the code generator
 would be needed.
 Therefore, a simplification has been made that substantially limits
 the amount of target machine information that is needed.
 The estimation of the number of bytes saved does
 not take into account how an item is used.
 Rather, an average number is used.
 So these gains are computed as follows:
 .DS
 #bytes_saved = #occurrences * gains_per_occurrence
 .DE
 The number of occurrences is derived from
 the EM code.
 Note that this is not exact either,
 as there is no one-to-one correspondence between occurrences in
 the EM code and in the assembler code.
 .PP
 The gains of one occurrence depend on:
 .IP 1.
 the type of the item
 .IP 2.
 the size of the item
 .IP 3.
 the type of the register
 .LP
 and for local variables and addresses of local variables:
 .IP 4.
 the type of the local variable
 .IP 5.
 the offset of the variable in the stackframe
 .LP
 For every allocation we try two types of registers: the register type
 of the item and the general register type.
 Only the type with the highest profits will subsequently be used.
 This type is added to the allocation information.
 .PP
 To compute the gains, RA uses a machine-dependent table
 that is read from a machine descriptor file.
 By means of this table the number of bytes saved can be computed
 as a function of the five properties.
 .PP
 The costs of initializing a register with an item
 is determined in a similar way.
 The cost of one initialization is also
 obtained from the descriptor file.
 Note that there can be at most one initialization for any
 allocation.
 .PP
 To summarize, the number of bytes a certain allocation would
 save is computed as follows:
 .DS
 net_bytes_saved =  bytes_saved - init_cost
 bytes_saved =      #occurrences * gains_per_occ
 init_cost =        #initializations * costs_per_init
 .DE
 .PP
 It is inherently more difficult to estimate the execution
 time saved by putting an item in a register,
 because it is impossible to predict how
 many times an item will be used dynamically.
 If an occurrence is part of a loop,
 it may be executed many times.
 If it is part of a conditional statement, 
 it may never be executed at all.
 In the latter case, the speed of the program may even get
 worse if an initialization is needed.
 As a clear example, consider the piece of "C" code in Fig. 13.1.
 .DS
 switch(expr) {
      case 1:  p(); break;
      case 2:  p(); p(); break;
      case 3:  p(); break;
      default: break;
 }
 Fig. 13.1 A "C" switch statement
 .DE
 Lots of bytes may be saved by putting the address of procedure p
 in a register, as p is called four times (statically).
 Dynamically, p will be called zero, one or two times,
 depending on the value of the expression.
 .PP
 The optimizer uses the following strategy for optimizing
 execution time:
 .IP 1.
 try to put items in registers during \fIloops\fR first
 .IP 2.
 always keep the initializing code outside the loop
 .IP 3.
 if an item is not used in a loop, do not put it in a register if
 the initialization costs may be higher than the gains
 .LP
 The latter condition can be checked by determining the 
 minimal number of usages (dynamically) of the item during the procedure,
 via a shortest path algorithm.
 In the example above, this minimal number is zero, so the address of
 p is not put in a register.
 .PP
 The costs of one occurrence is estimated as described above for the
 code size.
 The number of dynamic occurrences is guessed by looking at the
 loop nesting level of every occurrence.
 If the item is never used in a loop,
 the minimal number of occurrences is used.
 From these facts, the execution time improvement is assessed
 for every allocation.
 .NH 3
 The packing subphase
 .PP
 The packing subphase takes as input the allocation
 list and outputs a
 description of which allocations should be put
 in which registers.
 So it is essentially the decision making part of RA.
 .PP
 The packing system tries to assign a register to allocations one
 at a time, in some yet to be defined order.
 For every allocation A, it first checks if there is a register
 (of the right type)
 that is already assigned to one or more allocations,
 none of which are rivals of A.
 In this case A is assigned the same register.
 Else, A is assigned a new register, if one exists.
 A table containing the number of free registers for every type
 is maintained.
 It is initialized with the number of non-scratch registers of
 the target computer and updated whenever a
 new register is handed out.
 The packing algorithm stops when no more allocations can 
 or need be assigned a register.
 .PP
 After an allocation A has been packed,
 all allocations with non-disjunct timespans (including
 A itself) are removed from the allocation list.
 .PP
 In case the number of items exceeds the number of registers, it
 is important to choose the most profitable allocations.
 Due to the possibility of having several allocations
 occupying the same register,
 this problem is quite complex.
 Our packing algorithm uses simple heuristic rules
 and avoids any combinatorial search.
 It has distinct rules for different costs measures.
 .PP
 If object code size is the most important factor,
 the algorithm is greedy and chooses allocations in
 decreasing order of their profits attribute.
 It does not take into account the fact that
 other allocations may be passed over because of
 this decision.
 .PP
 If execution time is at prime stake, the algorithm
 first considers allocations whose timespans consist of loops.
 After all these have been packed, it considers the remaining
 allocations.
 Within the two subclasses, it considers allocations
 with the highest profits first.
 When assigning a register to an allocation with a loop
 as timespan, the algorithm checks if the item has
 already been put in a register during another loop.
 If so, it tries to use the same register for the
 new allocation.
 After all packing has been done,
 it checks if the item has always been assigned the same
 register (although not necessarily during all loops).
 If so, it tries to put the item in that register during
 the entire procedure. This is possible
 if the allocation (item,whole_procedure) is not a rival
 of any allocation with a different item that has been
 assigned to the same register.
 Note that this approach is essentially 'bottom up',
 as registers are first assigned over small regions
 of text which are later collapsed into larger regions.
 The advantage of this approach is the fact that
 the decisions for one loop can be made independently
 of all other loops.
 .PP
 After the entire packing process has been completed,
 we compute for each register how much is gained in using
 this register, by simply adding the net profits
 of all allocations assigned to it.
 This total yield should outweigh the costs of
 saving/restoring the register at procedure entry/exit.
 As most modern processors (e.g. 68000, Vax) have special
 instructions to save/restore several registers,
 the differential costs of saving one extra register are by
 no means constant.
 The costs are read from the machine descriptor file and
 compared to the total yields of the registers.
 As a consequence of this analysis, some allocations 
 may have their registers taken away.
 .NH 3
 The transformation subphase
 .PP
 The final subphase of RA transforms the EM text according to the
 decisions made by the packing system.
 It traverses the text of the currently optimized procedure and
 changes all occurrences of items at points where
 they are assigned a register.
 It also clears the score field of the register messages for
 normal local variables and emits register messages with a very
 high score for the pseudo locals.
 At points where registers have to be initialized with items,
 it generates EM code to do so.
 Finally it tries to decrease the size of the stackframe
 of the procedure by looking at which local variables need not
 be given memory locations.
--- a/doc/ego/ra/ra4
+++ b/doc/ego/ra/ra4
@ -0,0 +1,28 @@
 .NH 2
 Source files of RA
 .PP
 The sources of RA are in the following files and packages:
 .IP ra.h: 14
 declarations of global variables and data structures
 .IP ra.c:
 the routine main; initialization of target machine-dependent tables
 .IP items:
 a routine to build the list of items of one procedure;
 routines to manipulate items
 .IP lifetime:
 contains a subroutine that determines when items are live/dead
 .IP alloclist:
 contains subroutines that build the initial allocations list
 and that compute the rivals sets.
 .IP profits:
 contains a subroutine that computes the profits of the allocations
 and a routine that determines the costs of saving/restoring registers
 .IP pack:
 contains the packing subphase
 .IP xform:
 contains the transformation subphase
 .IP interval:
 contains routines to manipulate intervals of time
 .IP aux:
 contains auxiliary routines
 .LP
--- a/doc/ego/sp/sp1
+++ b/doc/ego/sp/sp1
@ -0,0 +1,171 @@
 .bp
 .NH 1
 Stack pollution
 .NH 2
 Introduction
 .PP
 The "Stack Pollution" optimization technique (SP) decreases the costs
 (time as well as space) of procedure calls.
 In the EM calling sequence, the actual parameters are popped from
 the stack by the \fIcalling\fR procedure.
 The ASP (Adjust Stack Pointer) instruction is used for this purpose.
 A call in EM is shown in Fig. 8.1
 .DS
 Pascal:                EM:
 f(a,2)                 LOC 2
 		       LOE A
 		       CAL F
 		       ASP 4    -- pop 4 bytes
 Fig. 8.1 An example procedure call in Pascal and EM
 .DE
 As procedure calls occur often in most programs,
 the ASP is one of the most frequently used EM instructions.
 .PP
 The main intention of removing the actual parameters after a procedure call
 is to avoid the stack size to increase rapidly.
 Yet, in some cases, it is possible to \fIdelay\fR or even \fIavoid\fR the
 removal of the parameters without letting the stack grow
 significantly.
 In this way, considerable savings in code size and execution time may
 be achieved, at the cost of a slightly increased stack size.
 .PP
 A stack adjustment may be delayed if there is some other stack adjustment
 later on in the same basic block.
 The two ASPs can be combined into one.
 .DS
 Pascal:           EM:               optimized EM:
 f(a,2)            LOC 2             LOC 2
 g(3,b,c)          LOE A             LOE A
 		  CAL F             CAL F
 		  ASP 4             LOE C
 		  LOE C             LOE B
 		  LOE B             LOC 3
 		  LOC 3             CAL G
 		  CAL G             ASP 10
 		  ASP 6
 Fig. 8.2 An example of local Stack Pollution
 .DE
 The stacksize will be increased only temporarily.
 If the basic block contains another ASP, the ASP 10 may subsequently be
 combined with that next ASP, and so on.
 .PP
 For some back ends, a stack adjustment also takes place
 at the point of a procedure return.
 There is no need to specify the number of bytes to be popped at a
 return.
 This provides an opportunity to remove ASPs more globally.
 If all ASPs outside any loop are removed, the increase of the
 stack size will still only be small, as no such ASP is executed more
 than once without an intervening return from the procedure it is part of.
 .PP
 This second approach is not generally applicable to all target machines,
 as some back ends require the stack to be cleaned up at the point of
 a procedure return.
 .NH 2
 Implementation
 .PP
 There is one main problem the implementation has to solve.
 In EM, the stack is not only used for passing parameters,
 but also for evaluating expressions.
 Hence, ASP instructions can only be combined or removed
 if certain conditions are satisfied.
 .PP
 Two consecutive ASPs of one basic block can only be combined
 (as described above) if:
 .IP 1.
 On no point of text in between the two ASPs, any item is popped from
 the stack that was pushed onto it before the first ASP.
 .IP 2.
 The number of bytes popped from the stack by the second ASP must equal
 the number of bytes pushed since the first ASP.
 .LP
 Condition 1. is not satisfied in Fig. 8.3.
 .DS
 Pascal:               EM:
 5 + f(10) + g(30)     LOC 5
 		      LOC 10
 		      CAL F
 		      ASP 2    -- cannot be removed
 		      LFR 2    -- push function result
 		      ADI 2
 		      LOC 30
 		      CAL G
 		      ASP 2
 		      LFR 2
 		      ADI 2
 Fig. 8.3 An illegal transformation
 .DE
 If the first ASP were removed (delayed), the first ADI would add
 10 and f(10), instead of 5 and f(10).
 .sp
 Condition 2. is not satisfied in Fig. 8.4.
 .DS
 Pascal:               EM:
 f(10) + 5 * g(30)     LOC 10
 		      CAL F
 		      ASP 2
 		      LFR 2
 		      LOC 5
 		      LOC 30
 		      CAL G
 		      ASP 2
 		      LFR 2
 		      MLI 2   --  5 * g(30)
 		      ADI 2
 Fig. 8.4 A second illegal transformation
 .DE
 If the two ASPs were combined into one 'ASP 4', the constant 5 would
 have been popped, rather than the parameter 10 (so '10 + f(10)*g(30)'
 would have been computed).
 .PP
 The second approach to deleting ASPs (i.e. let the procedure return
 do the stack clean-up)
 is only applied to the last ASP of every basic block.
 Any preceding ASPs are dealt with by the first approach.
 The last ASP of a basic block B will only be removed if:
 .IP -
 on no path in the control flow graph from B to any block containing a
 RET (return) there is a basic block that, at some point of its text, pops
 items from the stack that it has not itself pushed earlier.
 .LP
 Clearly, if this condition is satisfied, no harm can be done; no
 other basic block will ever access items that were pushed
 on the stack before the ASP.
 .PP
 The number of bytes pushed onto or popped from the stack can be
 easily encoded in a so called "pop-push table".
 The numbers in general depend on the target machine word- and pointer
 size and on the argument given to the instruction.
 For example, an ADS instruction is described by:
 .DS
   -a-p+p
 .DE
 which means: an 'ADS n' first pops an n-byte value (n being the argument),
 next pops a pointer-size value and finally pushes a pointer-size value.
 For some infrequently used EM instructions the pop-push numbers
 cannot be computed statically.
 .PP
 The stack pollution algorithm first performs a depth first search over
 the control flow graph and marks all blocks that do not satisfy
 the global condition.
 Next it visits all basic blocks in turn.
 For every pair of adjacent ASPs, it checks conditions 1. and 2. and
 combines the ASPs if they are satisfied.
 The new ASP may be used as first ASP in the next pair.
 If a condition fails, it simply continues with the next ASP.
 Finally, the last ASP is removed if:
 .IP -
 nothing has been popped from the stack after the last ASP that was
 pushed before it
 .IP -
 the block was not marked by the depth first search
 .IP -
 the block is not in a loop
 .LP
--- a/doc/ego/sr/sr1
+++ b/doc/ego/sr/sr1
@ -0,0 +1,44 @@
 .bp
 .NH 1
 Strength reduction
 .NH 2
 Introduction
 .PP
 The Strength Reduction optimization technique (SR)
 tries to replace expensive operators
 by cheaper ones,
 in order to decrease the execution time
 of the program.
 A classical example is replacing a 'multiplication by 2'
 by an addition or a shift instruction.
 These kinds of local transformations are already
 done by the EM Peephole Optimizer.
 Strength reduction can also be applied
 more generally to operators used in a loop.
 .DS
 i := 1;                    i := 1;
 while i < 100 loop  -->    TMP := i * 118;
   put(i * 118);           while i < 100 loop
   i := i + 1;                put(TMP);
 end loop;                     i := i + 1;
 			      TMP := TMP + 118;
 			   end loop;
 Fig. 6.1 An example of Strenght Reduction
 .DE
 In Fig. 6.1, a multiplication inside a loop is
 replaced by an addition inside the loop and a multiplication
 outside the loop.
 Clearly, this is a global optimization; it cannot
 be done by a peephole optimizer.
 .PP
 In some cases a related technique, \fItest replacement\fR,
 can be used to eliminate the
 loop variable i.
 This technique will not be discussed in this report.
 .sp 0
 In the example above, the resulting code
 can be further optimized by using
 constant propagation.
 Obviously, this is not the task of the
 Strength Reduction phase.
--- a/doc/ego/sr/sr2
+++ b/doc/ego/sr/sr2
@ -0,0 +1,217 @@
 .NH 2
 The model of strength reduction
 .PP
 In this section we will describe 
 the transformations performed by
 Strength Reduction (SR).
 Before doing so, we will introduce the
 central notion of an induction variable.
 .NH 3
 Induction variables
 .PP
 SR looks for variables whose
 values form an arithmetic progression
 at the beginning of a loop.
 These variables are called induction variables.
 The most frequently occurring example of such
 a variable is a loop-variable in a high-order
 programming language.
 Several quite sophisticated models of strength
 reduction can be found in the literature.
 .[
 cocke reduction strength cacm
 .]
 .[
 allen cocke kennedy reduction strength
 .]
 .[
 lowry medlock cacm
 .]
 .[
 aho compiler design
 .]
 In these models the notion of an induction variable
 is far more general than the intuitive notion
 of a loop-variable.
 The definition of an induction variable we present here
 is more restricted,
 yielding a simpler model and simpler transformations.
 We think the principle source for strength reduction lies in
 expressions using a loop-variable,
 i.e. a variable that is incremented or decremented
 by the same amount after every loop iteration,
 and that cannot be changed in any other way.
 .PP
 Of course, the EM code does not contain high level constructs
 such as for-statements.
 We will define an induction variable in terms
 of the Intermediate Code of the optimizer.
 Note that the notions of a loop in the
 EM text and of a firm basic block
 were defined in section 3.3.5.
 .sp
 .UL definition
 .sp 0
 An induction variable i of a loop L is a local variable
 that is never accessed indirectly,
 whose size is the word size of the target machine, and
 that is assigned exactly once within L,
 the assignment:
 .IP -
 being of the form i := i + c or i := c +i,
 c is a constant
 called the \fIstep value\fR of i.
 .IP -
 occurring in a firm block of L.
 .LP
 (Note that the first restriction on the assignment
 is not described in terms of the Intermediate Code;
 we will give such a description later; the current
 definition is easier to understand however).
 .NH 3
 Recognized expressions
 .PP
 SR recognizes certain expressions using
 an induction variable and replaces
 them by cheaper ones.
 Two kinds of expensive operations are recognized:
 multiplication and array address computations.
 The expressions that are simplified must
 use an induction variable
 as an operand of
 a multiplication or as index in an array expression.
 .PP
 Often a linear function of an induction variable is used,
 rather than the variable itself.
 In these cases optimization is still possible.
 We call such expressions \fIiv-expressions\fR.
 .sp
 .UL definition:
 .sp 0
 An iv-expression of an induction variable i of a loop L is
 an expression that:
 .IP -
 uses only the operators + and - (unary as well as binary)
 .IP -
 uses i as operand exactly once
 .IP -
 uses (besides i) only constants or variables that are
 never changed in L as operands.
 .LP
 .PP
 The expressions recognized by SR are of the following forms:
 .IP (1)
 iv_expression * constant
 .IP (2)
 constant * iv_expression
 .IP (3)
 A[iv-expression] :=       (assign to array element)
 .IP (4)
 A[iv-expression]          (use array element)
 .IP (5)
 & A[iv-expression]        (take address of array element)
 .LP
 (Note that EM has different instructions to use an array element,
 store into one, or take the address of one, resp. LAR, SAR, and AAR).
 .sp 0
 The size of the elements of A must
 be known statically.
 In cases (3) and (4) this size 
 must equal the word size of the
 target machine.
 .NH 3
 Transformations
 .PP
 With every recognized expression we associate
 a new temporary local variable TMP,
 allocated in the stack frame of the
 procedure containing the expression.
 At any program point within the loop, TMP will
 contain the following value:
 .IP multiplication: 18
 the current value of iv-expression * constant
 .IP arrays:
 the current value of &A[iv-expression].
 .LP
 In the second case, TMP essentially is a pointer variable,
 pointing to the element of A that is currently in use.
 .sp 0
 If the same expression occurs several times in the loop,
 the same temporary local is used each time.
 .PP
 Three transformations are applied to the EM text:
 .IP (1)
 TMP is initialized with the right value.
 This initialization takes place just
 before the loop.
 .IP (2)
 The recognized expression is simplified.
 .IP (3)
 TMP is incremented; this takes place just
 after the induction variable is incremented.
 .LP
 For multiplication, the initial value of TMP
 is the value of the recognized expression at
 the program point immediately before the loop.
 For arrays, TMP is initialized with the address
 of the first array element that is accessed.
 So the initialization code is:
 .DS
 TMP := iv-expression * constant;  or
 TMP := &A[iv-expression]
 .DE
 At the point immediately before the loop,
 the induction variable will already have been
 initialized,
 so the value used in the code above will be the
 value it has during the first iteration.
 .PP
 For multiplication, the recognized expression can simply be
 replaced by TMP.
 For array optimizations, the replacement
 depends on the form:
 .DS
 \fIform\fR                         \fIreplacement\fR
 (3) A[iv-expr] :=            *TMP :=     (assign indirect)
 (4) A[iv-expr]               *TMP        (use indirect)
 (5) &A[iv-expr]              TMP
 .DE
 The '*' denotes the indirect operator. (Note that
 EM has different instructions to do
 an assign-indirect and a use-indirect).
 As the size of the array elements is restricted
 to be the word size in case (3) and (4),
 only one EM instruction needs to
 be generated in all cases.
 .PP
 The amount by which TMP is incremented is:
 .IP multiplication: 18
 step value * constant
 .IP arrays:
 step value * element size
 .LP
 Note that the step value (see definition of induction variable above),
 the constant, and the element size (see previous section) can all
 be determined statically.
 If the sign of the induction variable in the
 iv-expression is negative, the amount
 must be negated.
 .PP
 The transformations are demonstrated by an example.
 .DS
 i := 100;                     i := 100;
 while i > 1 loop              TMP := (6-i) * 5;
   X := (6-i) * 5 + 2;        while i > 1 loop
   Y := (6-i) * 5 - 8;   -->     X := TMP + 2;
   i := i - 3;                   Y := TMP - 8;
 end loop;                        i := i - 3;
 			         TMP := TMP + 15;
 			      end loop;
 Fig. 6.2 Example of complex Strength Reduction transformations
 .DE
 The expression '(6-i)*5' is recognized twice. The constant
 is 5.
 The step value is -3.
 The sign of i in the recognized expression is '-'.
 So the increment value of TMP is -(-3*5) = +15.
--- a/doc/ego/sr/sr3
+++ b/doc/ego/sr/sr3
@ -0,0 +1,232 @@
 .NH 2
 Implementation
 .PP
 Like most phases, SR deals with one procedure
 at a time.
 Within a procedure, SR works on one loop at a time.
 Loops are processed in textual order.
 If loops are nested inside each other,
 SR starts with the outermost loop and proceeds in the
 inwards direction.
 This order is chosen,
 because it enables the optimization
 of multi-dimensional array address computations,
 if the elements are accessed in the usual way
 (i.e. row after row, rather than column after column).
 For every loop, SR first detects all induction variables
 and then tries to recognize
 expressions that can be optimized.
 .NH 3
 Finding induction variables
 .PP
 The process of finding induction variables
 can conveniently be split up
 into two parts.
 First, the EM text of the loop is scanned to find
 all \fIcandidate\fR induction variables,
 which are word-sized local variables
 that are assigned precisely once
 in the loop, within a firm block.
 Second, for every candidate, the single assignment
 is inspected, to see if it has the form
 required by the definition of an induction variable.
 .PP
 Candidates are found by scanning the EM code of the loop.
 During this scan, two sets are maintained.
 The set "cand" contains all variables that were
 assigned exactly once so far, within a firm block.
 The set "dismiss" contains all variables that
 should not be made a candidate.
 Initially, both sets are empty.
 If a variable is assigned to, it is put
 in the cand set, if three conditions are met:
 .IP 1.
 the variable was not in cand or dismiss already
 .IP 2.
 the assignment takes place in a firm block
 .IP 3.
 the assignment is not a ZRL instruction (assignment by zero)
 or a SDL instruction (store double local).
 .LP
 If any condition fails, the variable is dismissed from cand
 (if it was there already) and put in dismiss
 (if it was not there already).
 .sp 0
 All variables for which no register message was generated (i.e. those
 variables that may be accessed indirectly) are assumed
 to be changed in the loop.
 .sp 0
 All variables that remain in cand are candidate induction variables.
 .PP
 From the set of candidates, the induction variables can
 be determined, by inspecting the single assignment.
 The assignment must match one of the EM patterns below.
 ('x' is the candidate. 'ws' is the word size of the target machine.
 'n' is any number.)
 .DS
 \fIpattern\fR                                     \fIstep size\fR
 INL x  |                                      +1
 DEL x  |                                      -1
 LOL x ; (INC | DEC) ; STL x  |                +1 | -1
 LOL x ; LOC n ; (ADI ws | SBI ws) ; STL x  |  +n | -n
 LOC n ; LOL x ; ADI ws ; STL x.               +n
 .DE
 From the patterns the step size of the induction variable
 can also be determined.
 These step sizes are displayed on the right hand side.
 .sp
 For every induction variable we maintain the following information:
 .IP -
 the offset of the variable in the stackframe of its procedure
 .IP -
 a pointer to the EM text of the assignment statement
 .IP -
 the step value
 .LP
 .NH 3
 Optimizing expressions
 .PP
 If any induction variables of the loop were found,
 the EM text of the loop is scanned again,
 to detect expressions that can be optimized.
 SR scans for multiplication and array instructions.
 Whenever it finds such an instruction, it analyses the
 code in front of it.
 If an expression is to be optimized, it must
 be generated by the following syntax rules.
 .DS
   optimizable_expr:
 		iv_expr const mult |
 		const iv_expr mult |
 		address iv_expr address array_instr;
   mult:
 		MLI ws |
 		MLU ws ;
   array_instr:
 		LAR ws |
 		SAR ws |
 		AAR ws ;
   const:
 		LOC n ;
 .DE
 An 'address' is an EM instruction that loads an
 address on the stack.
 An instruction like LOL may be an 'address', if
 the size of an address (pointer size, =ps) is
 the same as the word size.
 If the pointer size is twice the word size,
 instructions like LDL are an 'address'.
 (The addresses in the third grammar rule
 denote resp. the array address and the
 array descriptor address).
 .DS
   address:
 		LAE |
 		LAL |
 		LOL if ps=ws |
 		LOE    ,,    |
 		LIL    ,,    |
 		LDL if ps=2*ws |
 		LDE    ,,      ;
 .DE
 The notion of an iv-expression was introduced earlier.
 .DS
   iv_expr:
 		iv_expr unair_op |
 		iv_expr iv_expr binary_op |
 		loopconst |
 		iv ;
   unair_op:
 		NGI ws |
 		INC |
 		DEC ;
   binary_op:
 		ADI ws |
 		ADU ws |
 		SBI ws |
 		SBU ws ;
   loopconst:
 		const |
 		LOL x  if x is not changed in loop ;
   iv:
 		LOL x  if x is an induction variable ;
 .DE
 An iv_expression must satisfy one additional constraint:
 it must use exactly one operand that is an induction
 variable.
 A simple, hand written, top-down parser is used
 to recognize an iv-expression.
 It scans the EM code from right to left
 (recall that EM is essentially postfix).
 It uses semantic attributes (inherited as well as
 derived) to check the additional constraint.
 .PP
 All information assembled during the recognition
 process is put in a 'code_info' structure.
 This structure contains the following information:
 .IP -
 the optimizable code itself
 .IP -
 the loop and basic block the code is part of
 .IP -
 the induction variable
 .IP -
 the iv-expression
 .IP -
 the sign of the induction variable in the
 iv-expression
 .IP -
 the offset and size of the temporary local variable
 .IP -	
 the expensive operator (MLI, LAR etc.)
 .IP -
 the instruction that loads the constant
 (for multiplication) or the array descriptor
 (for arrays).
 .LP
 The entire transformation process is driven
 by this information.
 As the EM text is represented internally
 as a list, this process consists
 mainly of straightforward list manipulations.
 .sp 0
 The initialization code must be put
 immediately before the loop entry.
 For this purpose a \fIheader block\fR is
 created that has the loop entry block as
 its only successor and that dominates the
 entry block.
 The CFG and all relations (SUCC,PRED, IDOM, LOOPS etc.)
 are updated.
 .sp 0
 An EM instruction that will
 replace the optimizable code
 is created and put at the place of the old code.
 The list representing the old optimizable code
 is used to create a list for the initializing code,
 as they are similar.
 Only two modifications are required:
 .IP -
 if the expensive operator is a LAR or SAR,
 it must be replaced by an AAR, as the initial value
 of TMP is the \fIaddress\fR of the first
 array element that is accessed.
 .IP -
 code must be appended to store the result of the
 expression in TMP.
 .LP
 Finally, code to increment TMP is created and put after
 the code of the single assignment to the
 induction variable.
 The generated code uses either an integer addition
 (ADI) or an integer-to-pointer addition (ADS)
 to do the increment.
 .PP
 SR maintains a set of all expressions that have already
 been recognized in the present loop.
 Such expressions are said to be \fIavailable\fR.
 If an expression is recognized that is
 already available,
 no new temporary local variable is allocated for it,
 and the code to initialize and increment the local
 is not generated.
--- a/doc/ego/sr/sr4
+++ b/doc/ego/sr/sr4
@ -0,0 +1,28 @@
 .NH 2
 Source files of SR
 .PP
 The sources of SR are in the following files
 and packages:
 .IP sr.h: 14
 declarations of global variables and
 data structures
 .IP sr.c:
 the routine main; a driving routine to process
 (possibly nested) loops in the right order
 .IP iv
 implements a procedure that finds the induction variables
 of a loop
 .IP reduce
 implements a procedure that finds optimizable expressions
 and that does the transformations
 .IP cand
 implements a procedure that finds the candidate induction
 variables; used to implement iv
 .IP xform
 implements several useful routines that transform
 lists of EM text or a CFG; used to implement reduce
 .IP expr
 implements a procedure that parses iv-expressions
 .IP aux
 implements several auxiliary procedures.
 .LP
--- a/doc/ego/ud/ud1
+++ b/doc/ego/ud/ud1
@ -0,0 +1,58 @@
 .bp
 .NH 1
 Use-Definition analysis
 .NH 2
 Introduction
 .PP
 The "Use-Definition analysis" phase (UD) consists of two related optimization
 techniques that both depend on "Use-Definition" information.
 The techniques are Copy Propagation and Constant Propagation.
 They are best explained via an example (see Figs. 11.1 and 11.2).
 .DS
   (1)  A := B                  A := B
 	 ...          -->        ...
   (2)  use(A)                  use(B)
 Fig. 11.1 An example of Copy Propagation
 .DE
 .DS
   (1)  A := 12                  A := 12
 	 ...          -->        ...
   (2)  use(A)                  use(12)
 Fig. 11.2 An example of Constant Propagation
 .DE
 Both optimizations have to check that the value of A at line (2)
 can only be obtained at line (1).
 Copy Propagation also has to assure that the value of B is
 the same at line (1) as at line (2).
 .PP
 One purpose of both transformations is to introduce
 opportunities for the Dead Code Elimination optimization.
 If the variable A is used nowhere else, the assignment A := B
 becomes useless and can be eliminated.
 .sp 0
 If B is less expensive to access than A (e.g. this is sometimes the case
 if A is a local variable and B is a global variable),
 Copy Propagation directly improves the code itself.
 If A is cheaper to access the transformation will not be performed.
 Likewise, a constant as operand may be cheeper than a variable.
 Having a constant as operand may also facilitate other optimizations.
 .PP
 The design of UD is based on the theory described in section
 14.1 and 14.3 of.
 .[
 aho compiler design
 .]
 As a main departure from that theory,
 we do not demand the statement A := B to become redundant after
 Copy Propagation.
 If B is cheaper to access than A, the optimization is always performed;
 if B is more expensive than A, we never do the transformation.
 If A and B are equally expensive UD uses the heuristic rule to
 replace infrequently used variables by frequently used ones.
 This rule increases the chances of the assignment to become useless.
 .PP
 In the next section we will give a brief outline of the data
 flow theory used
 for the implementation of UD.
--- a/doc/ego/ud/ud2
+++ b/doc/ego/ud/ud2
@ -0,0 +1,64 @@
 .NH 2
 Data flow information
 .PP
 .NH 3
 Use-Definition information
 .PP
 A \fIdefinition\fR of a variable A is an assignment to A.
 A definition is said to \fIreach\fR a point p if there is a
 path in the control flow graph from the definition to p, such that
 A is not redefined on that path.
 .PP
 For every basic block B, we define the following sets:
 .IP GEN[b] 9
 the set of definitions in b that reach the end of b.
 .IP KILL[b]
 the set of definitions outside b that define a variable that
 is changed in b.
 .IP IN[b]
 the set of all definitions reaching the beginning of b.
 .IP OUT[b]
 the set of all definitions reaching the end of b.
 .LP
 GEN and KILL can be determined by inspecting the code of the procedure.
 IN and OUT are computed by solving the following data flow equations:
 .DS
 (1)    OUT[b] = IN[b] - KILL[b] + GEN[b]
 (2)    IN[b]  = OUT[p1] + ... + OUT[pn],
 	 where PRED(b) = {p1, ... , pn}
 .DE
 .NH 3
 Copy information
 .PP
 A \fIcopy\fR is a definition of the form "A := B".
 A copy is said to be \fIgenerated\fR in a basic block n if
 it occurs in n and there is no subsequent assignment to B in n.
 A copy is said to be \fIkilled\fR in n if:
 .IP (i)
 it occurs in n and there is a subsequent assignment to B within n, or
 .IP (ii)
 it occurs outside n, the definition A := B reaches the beginning of n
 and B is changed in n (note that a copy also is a definition).
 .LP
 A copy \fIreaches\fR a point p, if there are no assignments to B
 on any path in the control flow graph from the copy to p.
 .PP
 We define the following sets:
 .IP C_GEN[b] 11
 the set of all copies in b generated in b.
 .IP C_KILL[b]
 the set of all copies killed in b.
 .IP C_IN[b]
 the set of all copies reaching the beginning of b.
 .IP C_OUT[b]
 the set of all copies reaching the end of b.
 .LP
 C_IN and C_OUT are computed by solving the following equations:
 (root is the entry node of the current procedure; '*' denotes
 set intersection)
 .DS
 (1)    C_OUT[b] = C_IN[b] - C_KILL[b] + C_GEN[b]
 (2)    C_IN[b]  = C_OUT[p1] * ... * C_OUT[pn],
 	 where PRED(b) = {p1, ... , pn} and b /= root
       C_IN[root] = {all copies}
 .DE
--- a/doc/ego/ud/ud3
+++ b/doc/ego/ud/ud3
@ -0,0 +1,26 @@
 .NH 2
 Pointers and subroutine calls
 .PP
 The theory outlined above assumes that variables can
 only be changed by a direct assignment.
 This condition does not hold for EM.
 In case of an assignment through a pointer variable,
 it is in general impossible to see which variable is affected
 by the assignment.
 Similar problems occur in the presence of procedure calls.
 Therefore we distinguish two kinds of definitions:
 .IP -
 an \fIexplicit\fR definition is a direct assignment to one
 specific variable
 .IP -
 an \fIimplicit\fR definition is the potential alteration of
 a variable as a result of a procedure call or an indirect assignment.
 .LP
 An indirect assignment causes implicit definitions to
 all variables that may be accessed indirectly, i.e. 
 all local variables for which no register message was generated
 and all global variables.
 If a procedure contains an indirect assignment it may change the
 same set of variables, else it may change some global variables directly.
 The KILL, GEN, IN and OUT sets contain explicit as well
 as implicit definitions.
--- a/doc/ego/ud/ud4
+++ b/doc/ego/ud/ud4
@ -0,0 +1,78 @@
 .NH 2
 Implementation
 .PP
 UD first builds a number of tables:
 .IP locals: 9
 contains information about the local variables of the
 current procedure (offset,size,whether a register message was found
 for it and, if so, the score field of that message)
 .IP defs:
 a table of all explicit definitions appearing in the
 current procedure.
 .IP copies:
 a table of all copies appearing in the
 current procedure.
 .LP
 Every variable (local as well as global), definition and copy
 is identified by a unique number, which is the index
 in the table.
 All tables are constructed by traversing the EM code.
 A fourth table, "vardefs" is used, indexed by a 'variable number',
 which contains for every variable the set of explicit definitions of it.
 Also, for each basic block b, the set CHGVARS containing all variables
 changed by it is computed.
 .PP
 The GEN sets are obtained in one scan over the EM text,
 by analyzing every EM instruction.
 The KILL set of a basic block b is computed by looking at the
 set of variables
 changed by b (i.e. CHGVARS[b]).
 For every such variable v, all explicit definitions to v
 (i.e. vardefs[v]) that are not in GEN[b] are added to KILL[b].
 Also, the implicit defininition of v is added to KILL[b].
 Next, the data flow equations for use-definition information
 are solved,
 using a straight forward, iterative algorithm.
 All sets are represented as bitvectors, so the operations
 on sets (union, difference) can be implemented efficiently.
 .PP
 The C_GEN and C_KILL sets are computed simultaneously in one scan
 over the EM text.
 For every copy A := B appearing in basic block b we do
 the following:
 .IP 1.
 for every basic block n /= b that changes B, see if the definition A := B
 reaches the beginning of n (i.e. check if the index number of A := B in
 the "defs" table is an element of IN[n]);
 if so, add the copy to C_KILL[n]
 .IP 2.
 if B is redefined later on in b, add the copy to C_KILL[b], else
 add it to C_GEN[b]
 .LP
 C_IN and C_OUT are computed from C_GEN and C_KILL via the second set of
 data flow equations.
 .PP
 Finally, in one last scan all opportunities for optimization are
 detected.
 For every use u of a variable A, we check if
 there is a unique explicit definition d reaching u.
 .sp
 If the definition is a copy A := B and B has the same value at d as
 at u, then the use of A at u may be changed into B.
 The latter condition can be verified as follows:
 .IP -
 if u and d are in the same basic block, see if there is
 any assignment to B in between d and u
 .IP -
 if u and d are in different basic blocks, the condition is
 satisfied if there is no assignment to B in the block of u prior to u
 and d is in C_IN[b].
 .LP
 Before the transformation is actually done, UD first makes sure the
 alteration is really desirable, as described before.
 The information needed for this purpose (access costs of local and
 global variables) is read from a machine descriptor file.
 .sp
 If the only definition reaching u has the form "A := constant", the use
 of A at u is replaced by the constant.
--- a/doc/ego/ud/ud5
+++ b/doc/ego/ud/ud5
@ -0,0 +1,19 @@
 .NH 2
 Source files of UD
 .PP
 The sources of UD are in the following files and packages:
 .IP ud.h: 14
 declarations of global variables and data structures
 .IP ud.c:
 the routine main; initialization of target machine dependent tables
 .IP defs:
 routines to compute the GEN and KILL sets and routines to analyse
 EM instructions
 .IP const:
 routines involved in constant propagation
 .IP copy:
 routines involved in copy propagation
 .IP aux:
 contains auxiliary routines
 .LP