Rabbit: A Compiler for Scheme/Notes

From Wikisource
Jump to navigation Jump to search

[Page 103]

93 Notes {Note ASET' Is Imperative) It is true that ASET' is an actual imperative which produces a side effect, and is not expressed applicatively. ASET' is used only for two purposes in practice: to initialize global variables (often relating to MacLISP primitives), and to implement objects with state (cells, in the PLASMA sense [Smith and Hewitt] [Hewitt and Smith]). If we were to redesign SCHEME from scratch, I imagine that we would introduce cells as our primitive side-effect rather than ASET'. The decision to use ASET' was motivated primarily by the desire to interface easily to the MacLISP environment (and, as a corollary, to be able to implement SCHEME in three days instead of three years!).

We note that in approximately one hundred pages of SCHEME code written by three people, the non-quoted ASET has never been used, and ASET' has been used only a dozen times or so, always for one of the two purposes mentioned above. In most situations where one would like to write an assignment of some kind, macros which expand into applicative constructions suffice.

[Page 104]

94 (Note Code Pointers) Conceptually a closure is made up of a pointer to some code (a 'script'

[Smith and Hewitt]) and an environment. In e RABBIT-formatted CBETA, the pointer to the code is encoded into two levels: a pointer to a particular piece of l'Iacl.ISP code, plus a tag within that FROG. This implementation was forced upon us by MacLISP. If we could easily create pointers into the middle of a FROG, we could avoid this two-level encoding.

On the other hand, this is not just an engineering kludge, but can be provided with a reasonable semantic explanation: rather than compiling a lot of little functions, we compile a single big function which is a giant CASE statement. Wherever we wish to make a closure of a little function, we actually close a different little function which calls the big function with an extra argument to dispatch on.

(Note Continuation Variable Hack) Since the dissertation was written, a simple modification to the routine which converts to continuation-passing style has eliminated some of the register shuffling. The effect of the change was to perform substitutions of one continuation variable for another, in situations such as:

((CLAHBDA (CONT-3 ...) ...) CONT-Z ...) where CONT-2 would be substituted for CONT-3 in the body of the CLAHBDA-expression. Once this is done, CONT-3 is unreferenced, and so is not really passed at all by virtue of the phase-Z analysis. The result Is that continuations are not copied back and forth from register to register. In the

[Page 105]

iterative factorial example in the text the actual register assignment would be CONT-5 **CONT** VAR-1 **ONE** FNVAR-2 <none> VAR-3 **Two** VAR-4 **THREE** This optimization is discussed more thoroughly in the Appendix near the routine CONVERT-COMBINATION.

{Note Dijkstra's Opinion) In [Dijkstra] a remark is made to the effect that defining the while gg construct in terms of function calls seems unusually clumsy. In [Steele] we reply that this is due partly to Dijkstra's choice of ALGOL for expressing the definition. Here we would add that, while such a definition is completely workable and is useful for compilation purposes, we need never tell the user that we defined while-gg in this manner! Only the writer of the macros needs to know the complexity involved; the user need not, and should not, care as long as the construction works when he uses it.

[Page 106]

96 (Note Evaluation for Control) It is usual U1 a compiler to distinguish at least three 'evaluation contexts": value, control, and effect. (See [Nulf], for example.) Evaluation for control occurs in the predicate of an IF, where the point is not so much to produce a data object as simply to decide whether it is true or false. The results of AND, OR, and NOT operations in predicates are 'encoded in the program counter". when compiling an AND, OR, or NOT, a flag is passed down indicating whether it is for value or for control; in the latter case, two tags are also passed down, indicating the branch targets for success or failure. (This is called "anchor pointing" in [Allen and Cocke].) In RABBIT this notion falls out automatically without any special handling, thanks to the definition of AND and OR as macros expanding into IF statements. If we were also to define NOT as a macro *(NOT x) => (IF x 'NIL 'T) then nearly all such special "evaluation for control" cases would be handled by virtue of the nested-IF transformation in the optimizer.

One transformation which ought to be in the optimizer is (IF ((LAMBDA (X Y ...) <body>) A B ...) <con> <alt>) => ((LAMBDA (X Y ...) (IF <body> <con> <alt>)) A B ...) which could be important if the (body) is itself as IF. (This transformation would occur at a point (in the optimizer) where no conflicts between X, Y, ... and variables used in <con> and <alt> could occur.)

[Page 107]

97 (Note Evaluation for Effect) This is the point where the notion of evaluation for effect is handled (see (Note Evaluation for Control)). It is detected as the special case of evaluation for value where no one refers to the value! This may be construed as the distinction between "statement" and 'expression' made in Algol-like languages.

(Note Full-Funarg Example) As an example of the difference between lexical and dynamic scoping, consider the classic case of the "funarg problem". We have defined a function MAPCAR which, given a function and a list, produces a new list of the results of the function applied to each element of the given list:

(DEFINE MAPCAR (LAMBDA (FN L) (IF (NULL L) NIL (CONS (FN (CAR L)) (MAPCAR FN (CDR L)))))) Now suppose in another program we have a list X and a number L, and want to add L to every element of X:

(MAPCAR (LAMBDA (Z) (+ Z L)) X) This works correctly in a lexically scoped language such as SCHEME, because the L in the function (LAMBDA (Z) (+ Z L)) refers to the value of L at the point the LAMBDA-expression is evaluated. In a dynamically scoped language, such as standard LISP, the L refers to the most recent run-time binding of L, which is the binding in the definition of MAPCAR (which occurs between the time the LAMBDA-expression is passed to MAPCAR and the time the LAMBDA-expression is

[Page 108]

98 invoked).

(Note Generalized LABELS) Since the dissertation was written, and indeed after '[Revised Report] came out, the format of LABELS in SCHEME was generalized to permit labelled functions to be defined using any of the same three formats permitted by DEFINE in [Revised Report]. RABBIT has been updated to reflect this change, and the code for it appears in the Appendix.

(Note Heap-Allocated Contours) RABBIT maintains heap-allocated environments as a simple chained list of variable values. However, all the variables which ere added on at once as a single set may be regarded as a new 'contour' in the Algol sense. Such contours could be heap-allocated arrays (vectors), and so an environment would be a chained list of such little arrays. The typical Algol implementation technique using a "display" (a margin array whose elements point at successive elements (contours) of the environment chain) is clearly applicable here. One advantage of the list-of-all-values representation actually used in RABBIT is that null contours automatically add no content to the environment structure, which makes it easier to recognize later, in the code generator, that no environment adjustments are necessary in changing between two environments which differ only by null contours (see the code for ADJUST-KNOWN!-'N-CENV in the Appendix).

[Page 109]

99 (Note Loop Unrolling} In the case of a LABELS used to implement a loop, the substitution of a labelled function for the variable which names it would constitute an instance of loop unrolling [Allen and Cocke], particularly if the substitution permitted subsequent optimizations such as eliminating dead code. Here, as elsewhere, a specific optimization technique falls out as a consequence of the more general technique of beta-conversion.

(Note Multiple-Argument Continuations} One could easily define a SCHEME-like language in which continuations could take more than one argument (that is, functions could return several values); see the discussion in [Declarative]. We have elected not to provide for this in SCHEME and RABBIT.

(Note Non-deterministic CPS Conversion) As with optimization, so the conversion to continuation-passing style involves decisions which ideally could be made non-deterministically. The decisions made at this level will affect later decisions involving register allocation, etc., which cannot easily be foreseen at this stage.

[Page 110]

100 (Note Non-deterministic Optimization) To simplify the implementation, RABBIT uses only a deterministic (and very conservative) optimizer. Ideally, an optimizer would be non-deterministic in structure; it could try an optimization, see how the result interacted with other optimizations, and back out if the end result is not as good as desired.

We have experimented briefly with the use of the AMORD language [Doyle] to build a non-deterministic compiler, but have no significant results yet.

We can see more clearly the fundamental unity of macros and other optimizations in the light of this hypothetical non-deterministic implementation.

Rather than trying to guess ahead of time whether a macro expansion or optimization is desirable, it goes ahead and tries, and then measures the utility of the result. The only difference between a macro and other optimizations is that a macro call is an all-or-nothing situation: if it cannot be expanded for some reason, it is of infinite disutility, while if it can its disutility is finite.' This leads to the idea of non-deterministic macro expansions, which we have not pursued.

(Note Non-quoted ASET} The SCHEME interpreter permits one to compute the name of the variable, but for technical and philosophical reasons RABBIT forbids this. We shall treat "ASET'" as a single syntactic object (think "ASETQ").

Hewitt (private communication) and others have objected that the ASET primitive is "dangerous" in that one cannot predict what variable may be clobbered, and in that it makes one dependent on the representation of variables (since one can "compute up" an arbitrary variable to be set). The first is a valid objection on the basis of programming style or programing philosophy.

[Page 111]

101 (Indeed, on this basis alone it was later decided to remove ASET from the SCHEME language, leaving only ASET' in [Revised Report].) The second is only slightly true; the compiler can treat ASET with an non-quoted first argument as a sort of macro. Let Vl, VZ, ..., VN be the names of the bound variables accessible to the occurrence of ASET in question. These names are all distinct, for if two are the same, one variable "shadows" another, and so we may omit the one shadowed (and so inaccessible). Then we may write the transformation:

(ASET a b) => ((LAMBDA (Q1 QZ) (COND ((EQ Q1 'Vl) (ASET' V1 Q2)) ((EQ Q1 'VZ) (ASET' VZ QZ)) ((EQ Q1 'VN) (ASET' VN QZ)) (T (GLOBAL-SET P Q1 Q2)))) a b) This transformation is to be made after the alpha-conversion process, which renames all variables; Q1 and Q2 are two more generated variables guaranteed not to conflict with V1, ..., VN. This expansion makes quite explicit the fact that we are comparing against a list of symbols to decide which variable to modify.

The actual run-time representation of variables is not exploited, the one exception being the GLOBAL-SET operator, which raises questions about the meaning of the global environment and the user interface which we are not prepared to answer.

(See also (Note ASET' Is Imperative).)

[Page 112]

102 (Note Old CPS Algorithm) We reproduce here Appendix A of [Declarative]:

Here we present a set of functions, written in SCHEME. which convert a SCHEME expression from functional style to pure continuation-passing style.

(Note PLASMA CPS) user' Gsrnemmun 0) (DEFINE cznrmw (LAMBDA (x) (Innova (CONS x (EXPLODEN user' aznrznlmun (+ czumevnun 1))))))) GENTEMP creates a new unique symbol consisting of a given prefix and a unique number.

(DEFINE CPS (LAMBDA (SEXPR) (SPRINTER (CPC SEXPR NIL 'fC0||Tl)))) CPS (Continuation-Passing Style) is the main function; its argument is the expression to be converted. It calls CPC (C-P Conversion) to do the real work, and then calls SPRINTER to pretty-print the result, for convenience. The symbol #CONT# is used to represent the implied continuation which is to receive the value of the expression.

[Page 113]

103 (DEFINE CPC (LAMBDA (SEXPR ENV CONT) (COND ( (ATOM SEXPR) (CPC-ATOM SEXPR ENV CONT)) ((EO (CAR SEXPR) 'OUOTE) (IF CONT '(,CONT ,SEXPR) SEXPR)) ((EO (CAR SEXPR) 'LAMDDA) (CPC-LAMBDA SEXPR ENV CONT)) ((EO (CAR SEXPR) 'lF) (CPC-lr sexvn suv CoNr)) ((EO (CAR SEXPR) °CATCN) (CPC-CATCH SEXPR ENV CONT)) ((E0 (CAR SEXPR) 'LABELS) (CPC-LABELS SEXPR ENV CONT)) ((AND (ATOM (CAR SEXPR)) (GET (CAR SEXPR) 'ANACRo)) (CPC (FUNCALL (GET (CAR SEXPR) 'AMACRo) SEXPR) ENV CONT)) (1 (CPC-FORM s£xPR env CONT))))) CPC merely dispatches to one of a number of subsidiary routines based on the form of the expression SEXPR. ENV represents the environment in which SEXPR will be evaluated; it is a list of the variable names. when CPS initially calls CPC, ENV is NIL. CONT is the continuation which will receive the value of SEXPR. The double-quote (") is like a single-quote, except that within the quoted expression any subexpressions preceded by Comma (,) are evaluated and substituted in (also, any subexpressions preceded by atsign (@) are substituted in a list segments).

One special case handled directly by CPC is a quoted expression; CPC also expands any SCHEME macros encountered.

(DEFINE CPC-ATOM (LAMBDA (s£xPR ENV CONT) ((LAMBDA (Ar) (xr Conv '(,CONT ,A1) AT)) (CoND ((NUMBERP SEXPR) sexrn) ((MEMO SEXPR ENV) sexva) ((ser SEXPR 'CPS-NAME)) (T (IMPLODE (CoNs 'x (EXPLODEN SEXPR)))))))) For Convenience, CPC-ATOM will change the name of a global atom. Numbers and atoms in the environment are not changed; otherwise, a specified name on the property list of the given atom is used (properties defined below convert '+'

[Page 114]

104 into "++", etc.); otherwise, the name is prefixed with 'X'. Once the name has been converted, it is converted to a form which invokes the continuation on the atom. (lf a null continuation is supplied, the atom itself is returned.) (DEFINE CFC-LANBDA (LAMBDA (SEXPR ENV CONT) ((LAHBDA (CN) ((LAMBDA (LX) (IF CONT '(,CON'1' ,LX) LX)) "(LAMBDA (!(CADR SEXPR) ,Cl() ,(CPC (CAIJDI SEXPR) (APPEIU (CADR 5£XF|l) (CONS CN £NV)) Cll)))) (sznmw 'c)))) A LAMBDA expression must have an additional parameter, the continuation supplied to its body, added to its parameter list. CN holds the name of this generated parameter. A new LAMBDA expression is created, with CN added. and with_its body converted in an environment containing the new variables. Then the same test for a null CONT is made as in CPC-ATOM.

(DEFINE CPC-IF (LAMBDA (SEXPR EIN CONT) ((LAHlDA (KI) -mmm (Jun _(orc (cm szxva) :uv ((\.AN!DA (ru) '(LAHlDA (,Pl) (xr ,rn .qcvc (canon szxrn) env KI) ,(crc (cannon szxm nv Kll)))) (sznrtm 'r)))) .CONT)) (eenrzne |<)))) First, the continuation for an IF must be given e name KN (rather, the name held in KN; but for convenience, we will continue to use this ambiguity, for the form

[Page 115]

105 of the name is indeed Kn for some number n), for it will be places and we wish to avoid duplicating the code. Then, converted to continuation-passing style, using a continuation the result and call it PN. This continuation will then use an converted consequent to invoke. Each consequent is converted KN.

(DEFINE CPC-CATCH (LAMBDA (SEXPR ENV CONT) ((LAMBDA (EN) '((LAMBDA (,EN) ((LAHBDA (,(CADR SEXPR)) .(CPC (CADDR SEXPR) (CONS (CADR SEXPR) ENV) EN)) (LAMBDA (V C) (,EN V)))) ,CONT)) (GENTEMP 'E)))) referred to in two the predicate is which will receive IF to decide which using continuation This routine handles CATCH as defined in [Sussman 75], and in converting it to continuation-passing style eliminates all occurrences of CATCH. The idea is to give the continuation a name EN, and to bind the CATCH variable to a continuation (LAMBDA (V C) ...) which ignores its continuation and instead exits the catch by calling EN with its argument V. The body of the CATCH is converted using continuation EN.

(DEFINE cwc-LABELS (LAMBUA (ssxvn :uv CONT) (no ((x (CADR sexvn) (con x)) (Y env (CONS (cAAn x) v))) ((NULL x) (no ((v (cnun sexra) (con v)) (z NIL (cons (LIST (CAAR w) (CPC (CADAR v) Y nxL)) l))) ((NULL w) "(LABELS .(Rcv£nse z) ,(CPC (CADDR szxrn) v cour))))))))

[Page 116]

106 Here we have used DO loops as defined in l'iacLISP (DO is implemented as a macro in SCHEME). There are two passes, one performed by each DO. The first pass merely collects in Y the names of all the labelled LAMBDA expressions. The second pass converts all the LAMBDA expressions using a null continuation and an environment augmented by all the collected names in Y, collecting them in Z. At the end, a new LABELS is constructed using the results in Z and a converted LABELS body.

(DEFINE cvc-ronn (LAMBDA (sexvn env CONT) (LABELS ((LooP1 (LAMBDA (x Y z) (IF (NULL x) (oo ((r (REVERSE (cons cont Y)) (xr (NULL (can z)) r (crc (CAR z) env '(LAHBDA (.(cAn Y)) ,r)))) (Y Y (CDR Y)) (Z Z (CDR Z))) ((NULL z) r)) (COND ((on (NULL (CAR x)) (ATOM (CAR x))) (LO0P1 (con x) (cons (crc (CAR x) env NIL) Y) (cons NIL z))) ((£0 (CAIR X) '0U0T£) (Loovl (con x) (cons (can x) Y) (cons NIL z))) ((£o (CAAR x) 'LAHBDA) (LOOP) (con x) (cons (cvc (cAn x) inv NIL) Y) (cons NIL z))) (1 (LOOP) (CDR x) (CONS (GENTE P '1) Y) (CONS (CAR X) Z)))))))) (LooP1 sexrn NIL NIL)))) This, the most complicated routine, converts forms (function calls). This also operates in two passes. The first pass, using LO0P1, uses X to step down the expression, collecting data in Y and Z. At each step, if the next element of X can be evaluated trivially, then it is converted with a null continuation and

[Page 117]

107 added to Y, and NIL is added to Z. Otherwise, a temporary name TN for the result of the subexpression is created and put in Y, and the subexpression itself is put in Z. On the second pass (the DO loop), the final continuation-passing form is constructed in F from the inside out. At each step, if the element of Z is non-null, a new continuation must be created. (There is actually a bug in CPC-FORM, which has to do with variables affected by side-effects. This is easily fixed by changing LO0P1 so that it generates temporaries for variables even though variables evaluate trivially. This would only obscure the examples presented below, however, and so this was omitted.) (LABELS ((BAA (LAMBDA (DUMMY x Y) (IF (NULL x) '1cPs ready to gas) (BAA (Pu1PRoP (CAR x) (CAA Y) 'cps-MAME) (CDR X) (CDR Y)))))) (BAR NIL '(+ - i // " T NIL) *(++ if //// "" 'T 'NlL))) This loop sets up some properties so that "+" will translate into "++' instead of "%+", etc.

Now let us examine some examples of the action of CPS. First, let us try our old friend FACT, the iterative factorial program.

(DEFINE FACT (LAMBDA (M) (LABELS ((FACTl (LAMBDA (M A) (IF (I M 0) A (FACT1 (- H 1) (* H A)))))) (FACTI N l)))) Applying CPS to the LAMBDA expression for FACT yields:

[Page 118]

108 (OCONTQ _ (LAMBDA (N C7) (LABELS ((FACT1 (LAMBDA (M A CIO) ((LAM8DA (Kll) (Xl M 0 (LAMBDA (PIZ) (IF P12 (Kll A) ( M 1 (LAMBDA (T13) (QQMA (LAMBDA (TIA) (FACT) T13 T14 K1l))))))))) C10)))) (FACT) N 1 C7)))) . As an example of CATCH elimination, here is a routine which is a paraphrase of the SQRT routine from [Sussman 75]:

(DEFINE SORT (LAMBDA (X EPS) ((LAMBDA (ANS LOOPTAG) _ (CATCH RETURNTAG (BLOCK (A$ET' LOOPTAG (CATCH H H)) (IF ~ (RETURNTAG ANS) NIL) (AS£T' AIS II) (LUOPTAG LO0PTA6)))) 1.0 Nil-H) Here we have used "-" and "===' as ellipses for complicated (and relatively uninteresting) arithmetic expressions. Applying CPS to the LAMBDA expression for SQRT yields:

[Page 119]

109 (#CONT# (LAMBDA (x EPs C33) ((LAMBDA (ANS LooP1As C34) ((LAMDDA (sas) ((LAHBDA (RETURNTAG) ((LAMBDA (ESZ) ((LAMBDA (M) (£52 M)) (LAMBDA (v c) (E5Z v)))) (LAMBDA (151) (ZASET' LOOPTAG T51 (LAMBDA (137) ((LAMaDA (A s cas) (D C36)) T37 (LAMBDA (cao) ((LAMBDA (xar) ((LAH8DA (Pao) (IF Pso (RETURNTAG ANS K47) (K47 'NIL))) X)) (LAMBDA (142) ((LAMBDA (A B Cdl) (B C4l)) T42 (LAMBDA (C43) (XASET' ANS x==-C40)))) . £a5)))))) (LAMBDA (v c) (sas v)))) ca4)) 1.0 'nxL C33))) Note that the CATCHes have both been eliminated. the reader to verify that the continuation-passing semantics of the original.

(LAMBDA (T45) ((LAMDDA (A D can) (D cAa)) T45 (LAMDDA (c4s) (LOOPTAG LOOPTAG C46)) C43)))) It is left as an exercise for version correctly reflects the

[Page 120]

110 (Note Operations on Functions) It would certainly be possible to define other operations on functions, such as determining the number of arguments required, or the types of the arguments and returned value, etc. (Indeed, after the dissertation was written, it was decided to include such an operator PROCP in [Revised Report].) The point is that functions need not conform to a specific representation such as S-expressions. At a low level, it may be useful to think of invocation as a generic operator which dispatches on the particular representation and invokes the function in an appropriate manner. Similarly, a debugging package might need to be able to distinguish the various representations. At the user level, however, it is perhaps best to hide this issue, and answer a type inquiry with merely "function".

(Note Refinement of RABBIT) Since the original dissertation was written I have continued to refine and improve RABBIT. This effort has included a complete rewriting of the optimizer to make it more efficienct and at the same time more lucid. It also included accommodation of changes to SCHEME as documented in [Revised Report].

This work has spanned perhaps eight months' time, because the availability of computer time restricted me to testing RABBIT only once or twice a night. Thus, the actual time expended for the improvements was much less than ten hours a week.

[Page 121]

111 (Note Side-Effect Classifications) The division of side-effects into classes in RABBIT was not really necessary to the primary goals of RABBIT, but was undertaken as an interesting experiment for our own edification. One could easily imagine a more complex taxonomy. A case of particular interest not handled by RABBIT is dividing the ASET side-effect into ASET of each particular variable; thus an ASET on FOO would not affect a reference to the variable BAR. This could have been done in an ad hoc manner, but we are interested in a more general method dealing only with sets of effects and affectabilities.

(Note Subroutinization} we have not said anything about how to locate candidate expressions for subroutinization. For examples of appropriate strategies, see [Geschke] and [Aho, Johnson, and U11man]. Our point here is that SCHEME, thanks to the property of lexical scoping and the ability to write 'anonymous' functions as LAMBDA-expressions, provides an ideal way to represent the result of such transformations.

[Page 122]

112 (Note Tail-Recursive OR) Since the dissertation was written, the SCHEME language was redefined in [Revised Report] to prescribe a "tail-recursive" interpretation for the last form in an AND or OR. This requirement necessitated a redefinition of OR which is in fact dual to the definition of AND.