The Sherman Rebol-To-Scheme Compiler

hostilefork · December 11, 2023, 9:30pm

Joe Marshall was one of the developers of MIT/GNU Scheme, who worked briefly (less than 6 months) at Rebol Technologies in 1998, and wrote the implementation of Rebol 1.0.

On his blog, "Abstract Heresies", Joe has a remark about how his implementation was received:

With it’s Polish notation, tail recursion, and first-class continuations, REBOL was described as an unholy cross between TCL and Scheme. “The result of Outsterhout and Sussman meeting in a dark alley.”

It seems pretty clear that Joe was not a good fit for looking at the problems posed by Rebol, because he just didn't really like the idea. For starters, a post on comp.lang.lisp suggests that Joe was opposed on principle to Rebol's parentheses-less-ness:

[Not using parentheses] turns out to be a bad idea. Not only does it greatly complicate the interpreter and compiler, it turns out to be rather unreadable in practice. As an example, here is a line from the REBOL BBS tutorial:
update-topic topic-id length? msgs first last msgs
Certainly REBOL knows how many arguments each function takes (and which of those identifiers refer to functions), but if you don't know, you can't parse it.

So after his initial implementation, he walked. He calls Rebol 2.0 a "quirky and bizarre language" in which he "doesn't see any opportunity".

It would seem he lacks interest in absorbing what came later--e.g. to find the potential in something like the PARSE dialect He never mentions it or any other dialect, but focuses on the negative side of handling baseline IF-WHILE-FUNC imperative code, without calling out any ideas he sees as interesting.

A Trivial Program Walked Through in "Sherman"

It's clear that Rebol 1.0 was written in C...though some people believe it was written in Scheme. The air was probably muddied a bit because since Lisp is his lingua-franca, he's shared Scheme pseudocode explaining the premises of the evaluator he'd written. (It's likely that he wrote such prototypes before tackling the C implementation, as well).

Also confusing matters a bit is that in 1999 right after leaving Rebol technologies, he published a Rebol-to-Scheme compiler called Sherman:

GitHub - akavel/sherman: A clone of Joe Marshall's Rebol 1.0 to Scheme compiler.

Sherman requires a very old version of Scheme to run, and that only runs on 32-bit machines. Modernizing it is not a good use of anyone's time, but fortunately Windows still runs old 32-bit EXEs... so GitHub user akavel pieced together the steps.

For a start, let's see how it works on a program called trivial.r:

rebol []

declare [standard-definitions]

foo: func [a] [a + 3]
print "Hello world!"
prin "3 + 3 ="
print foo 3

In Step One, he uses a tool called Flex (Fast Lexical Analyzer) to "translate a language like REBOL into a fully parenthesized notation." The small specification R2S.L would turn the above into this:

`(SHERMAN ,(LIST->BLOCK `(,(LIST->BLOCK `())
declare ,(LIST->BLOCK `(standard-definitions ))
foo: func ,(LIST->BLOCK `(a)) ,(LIST->BLOCK `(a + 3))
print "Hello world!"
prin "3 + 3 ="
print foo 3
))) ;; END of SHERMAN CODE

This transformed the code enough so that Scheme could load it for further processing. Here all it did was get rid of the unloadable square brackets and turn them into parentheses that are tagged as LIST->BLOCK. But other transformations would turn literals like 12-Dec-2012 into labeled strings like ,(SHERMAN-DATE "12-Dec-2012"). It did some amount of validation in the process...but not exhaustive, so you could still get some invalid representations that would have to be caught by the runtime.

The next step of transformation makes something that Scheme can execute...although it isn't written in a way that any Scheme programmer would write code. It's just using the language as a runtime/bytecode, and so it looks like what you get from a disassembler:

(#%define-values (foo) (#%values #f))
(#%begin
  (#%list->block (list))
  (#%set! foo
    (#%lambda (a)
      (#%if (#%procedure? a)
        (#%case (#%arity a)
          ((0) (+ (a) 3))
          (else (#%error "Too few arguments." (#%quote a))))
        (+ a 3))))
  (print "Hello world!")
  (prin "3 + 3 =")
  (#%if (#%procedure? foo)
    (#%case (#%arity foo)
      ((0) (#%begin (print (foo))))
      ((1) (print (foo 3)))
      (else (#%error "Too few arguments." (#%quote foo))))
    (#%begin (print foo))))

Notice that where a WORD! occurs the compiled form has a fixed number of switches on the arity. You have foo: func [a] [a + 3], and when it hits the a in the body it checks to see if it's a function...and if it is, then it enforces that it takes no arguments since it's on the left of the plus sign and invokes it. Note that invocation in Lisp means putting a symbol at the head of non-quoted parentheses, so (a) is a "function call" in (+ (a) 3).

But what would this do if you said foo: func [a] [a 10 20 30]? Does it handle arity-0 or arity-1 or arity-2 or arity-3 or error? The answer is...yes:

  (#%if (#%procedure? a)
    (#%case (#%arity a)
      ((0) (#%begin (a) none))
      ((1) (#%begin (a 10) none))
      ((2) (#%begin (a 10 20)))
      ((3) (a 10 20 30))
      (else (#%error "Too few arguments." (#%quote a))))
    (#%begin a none))))

The approach is at odds with the language design. Let's say you wrote something like:

foo: func [a b] [
   if a = b [alpha a b] else [alpha b a beta a b]
]

You'd start with a hardcoded switch statement to cover the case where if is anywhere from arity-0 to arity-6 to consume everything in the body. Then for each of those branches, you'd have the combinatorial explosion... the arity-6 case has to consider if a is arity-0 to arity-5, but so does the arity-5 case.

The explosion is pathological. Just to show how much so, let's change it to foo: func [a] [a a a a]. The result is so big I'm just going to paste a picture of it broken into columns:

He Knew This Wasn't Very Useful

This isn't something Joe doesn't know about. On MIT's "Lightweight Languages" mailing list, Joe had this remark:

Rebol is extremely difficult to compile in any meaningful sense. Since the parse tree can change dynamically at runtime, there is a combinatoric explosion of code paths. You can constrain things a bit by allowing the user to declare procedure arity, though.

The notes say:

"I decided to see if I couldn't write a compiler for the language. The main challenge is that the language is context-sensitive and parsing depends on runtime values. It isn't possible to build a useful abstract syntax tree at compile time because it is impossible to determine the role each identifier plays until it's value is known. Furthermore, an identifier may indicate a variable at some times or a function call at others."

"The trick of Sherman is to simply generate the control flow for all the parse options and dispatch to the correct one at runtime. This makes a combinatorical explosion, but it is possible to prune the control flow tree if the compiler can make assumptions about certain identifiers."

It would generally seem you'd have to hybridize an evaluative approach vs. compiling if you allowed any symbols with unknown arity...as even super short examples like the [a a a a] above are prohibitive. But if you were already putting in some amount of compilation then maybe cases like [a 1] are worth compiling.

Putting Sherman In Context

So what was the value of publishing this non-viable Rebol compiler?

It's an experiment that I'd doubt took him more than a few days. I'd imagine it was done as a palate-cleanser to return to Scheme and demonstrate how much of a "new" problem can be attacked when recast using infrastructure that people have developed through experience.

And perhaps its his way of saying: "if for some reason you think there's value in Rebol's notational choices, you can have your source look like that... just run a transformation into Scheme."

(Perhaps how I look at being able to bend Ren-C to do things Red does... it's not that I think their idea is necessarily good, I just want to check to make sure I can do it.)

In any case, this addresses a rumor I heard that Rebol 1.0 was written in Scheme. But it seems it was interpreted in a way we would think of today...and more in the spirit of Ren-C...which can indeed handle things like tail-call recursion.

Also, he thought it was a requirement for Rebol to be stackless. He didn't think you could solve it by writing a few native functions like IF and WHILE in C to get them to use less stack: some stack is still consumed, and users will still be writing their own loops and control constructs. Of course I agree (and Ren-C is stackless), but I can imagine that Carl's desire for expedience vs. doing continuations would be a point of conflict.