Whitespace Interpreter Revisited

I came across what was near-to-one of my first Rebol programs (the first one using PARSE).

It was an interpreter for a language called "Whitespace", where all the instructions are made out of tab, space, and newline characters. My idea was to use PARSE to actually evaluate the code:

https://github.com/hostilefork/whitespacers/blob/master/rebol/whitespace.reb

The idea of using parse as the interpreter is an interesting take...but the code is far too repetitive. It was clearly written before I knew much about COMPOSE or really using the generative side of things.

Ideas on Doing it "Right"?

Currently there's a split between the specification and the implementation. That might be desirable in some cases, but I don't think it's desirable here.

The instructions are grouped together, which lets them "inherit" their leading sequence. For example, putting push [space Number] and duplicate-top [lf space] in "Stack-Manipulation" container lets them inherit the leading space common to each instruction. Otherwise they'd have to be spelled out as [space space Number] and [space lf space].

But beyond that, being inside a grouping pulls them out of "global scope", which means the instructions aren't overwriting the definitions of things like add and subtract with the whitespace versions.

Reducing The Repetition

The first thing I think of is merging the functions with the whitespace spec, and doing it more dialect-style:

    ; The 2010 JavaScript-Like Way
    ...
push: [
	command: [space Number]
	description: {Push the number onto the stack}
]
duplicate-top: [
	command: [lf space]
	description: {Duplicate the top item on the stack}
]
    ...
push: func [value [integer!]] [
    insert stack value
    return none
]
duplicate-top: func [] [
    insert stack first stack
    return none
]

Being relatively liberal in syntax here, we can imagine doing this more like:

; A Modern Dialected Way

push: ??? [
    {Push the number onto the stack}
    space (value <number!>)
][
    insert stack value
    return none
]

duplicate-top: ??? [
    {Duplicate the top item on the stack}
    lf space
][
    insert stack first stack
    return none
]

This is a little push in the right direction. But we have some issues to think through.

  • Whatever we're building here as units of the language need to be "registered" in the complete list of operations. If they were global declarations run by the evaluator, then one of the things the ??? operator would have to do would be this registration. If they are not global, then registration could be done by their container that is invoking them.

  • We still have to inherit the leading space that both of these instructions have. An interesting point in the PARSE branching is that we benefit from having the knowledge that these all start with space, so that the space can lead into these tests as alternates...e.g. [space [lf space | space <Number>]] will be more efficient than [space lf space | space space <Number>].

  • If we decide that the container is the boss, then giving a name to ??? is not really necessary. But what happens if you leave it off?

    ... Stack-Manipulation ... [
        push: [
            {Push the number onto the stack}
            space (value <number!>)
        ][
            insert stack value
            return none
         ]
         ...
     ]
    

    This is a counterintuitive mixture of a SET-WORD! with two blocks after it. How do we reconcile the idea of "declaration-like-things" inside of a container like this? The code needs to be in a block, the other things don't...so other notations are possible...

    ... Stack-Manipulation ... [
        PUSH
        {Push the number onto the stack}
        space (value <number!>) [
            insert stack value
            return none
         ]
         ...
     ]
    

I guess the easiest thing to do is to cave, and just come up with something for ???. It has the benefit that you can actually run your implementation of Stack-Manipulation through the evaluator and do normal assignments. So something like push: instruction [...] [...]

Yet there's something a bit dissatisfying about having to type INSTRUCTION redundantly each time...where you're also kind of claiming that word so it can't be used other ways.

Lots Of Other Questions...

I think it would be worthwhile to rework this, but it's worth taking note of how adrift you can get right out of the gate looking at it. There are lots of big questions that don't really have obvious answers.

1 Like

Do you mind if I analyze this and golf it?

I certainly wouldn't mind help on revisiting this!! And if you wanted to blog about it and build some credit for it yourself, you could basically "own" it even if was a collaborative effort. I'd be happy to hand it off...though I do think I could provide good guidance on direction.

As I say, I wrote this a DECADE ago, before I really had any real experience with the language, I was definitely a newbie. So it resulted in something quite bloated for what it is.

But the gimmick is that I'd learned that I could write things like:

parse data [tab space tab | space space tab]

And it would look up the words to get the CHAR! values, and match that. So then when I learned I could write:

rule1: [tab space tab]
rule2: [space space tab]
parse data [rule1 | rule2]

I thought "hey, this could be a pretty literate way of breaking down the spec for whitespace, as code." Had I been more experienced and ready to abstract it, I'd have taken the next step:

all-rules: []
instruction: func [block] [
    append all-rules block
    append all-rules '|
]
rule1: instruction [tab space tab]
rule2: instruction [space space tab]
parse data all-rules

Which is more in the spirit of what you'd want, to avoid having to write out the big aggregate rule manually.

But what inspired the "parse as whitespace interpreter" was around when I'd just found out about the ability to mark and seek parse positions:

>> string: "aaaabbbb"

>> parse string [some "a" pos: some "b"]  ; mark (seek would be :pos)
== #[true]  ; R3-Alpha returned true if the parse reached the end of input

>> pos
== "bbbb"

So I could actually use the parse position as the instruction pointer. Whether this was a good idea or not, I thought it would be interesting to try.

Red has a pretty good overview of PARSE, if you haven't seen my links to it already. Would be good to have a handle on how that works before trying to hack on this.

Making It Suck Less, Even in R3-Alpha, Seems a Good First Goal

I outlined kind of a vision for what the "whitespace definition dialect" might look like. Basically, coming up with a language variant on the fly for making whitespace. The picture I paint is something a bit like this:

duplicate-indexed: instruction [
    {Copy Nth item on the stack (given by the argument) to top of the stack}
    tab space [index: Number]
][
    insert stack pick stack index
    return none
]

So what we have here puts together:

  • The name of the instruction. Using a SET-WORD! for this seems sensible...but if we run it through normal assignment, we'd still like the instruction to know the name it was declared with (for display in debugging). This name would have to be enfixedly quoted if the instruction was run through normal evaluation...or the container would need to make the connection.

  • A description of the instruction. These were originally taken from the whitespace spec, though I think shortening them down to fit in 80 columns is a good idea.

  • The sequence of whitespace characters that represent the instruction. I think that using the plain words tab, space, and newline

  • Inline with the instruction sequence, a definition of whitespace-encoded parameter values and the name of that parameter. There's Number and Label. Here I put that in a block. There's plenty of other ways you could think of writing it, some which might look more like how Rebol does function parameter definitions in its FUNC spec dialect:

      [<tab> <space> index [number!]]
    

    ...but like I say, there's something appealing to me about the idea that the "common case" of the whitespace characters here just be words. The free-wheeling use of words is kind of foundational in Rebol (want to call something IF in a context? Go ahead...) so having the named-parameters be the odd case feels better.

  • Some code to invoke the behavior, where any parameters defined in the spec are bound (in this case index). This code returned the new position or "none" (in R3-Alpha vernacular) to say it wasn't an instruction that changed the position. That doesn't seem like a bad idea, but it would be nice if it composed in a "return none" by default at the end of the code you provide... so that if you didn't make the body end in a RETURN it would not change the position.

Game Plan

I hope maybe you see "the vision" of how it would rapidly-bend to make a "whitespace implementation dialect". This is what I try to get at with the "Minecraft-of-programming". You're building a new language with the seeming ease that you would make a new function in most other systems...

The code needs a first pass just to make it more of a good example--instead of looking like someone's first way-too-long PARSE program.

While converting it to Ren-C might seem the most obvious first step... it might be better to get it working under "Redbol". There need to be more stable test cases for that (ultimately I'd like to be able to get a high percentage of Red's tests to pass!)

So I thought that converting it to Red first might actually be cool. Maybe even do some of the first pass of improvements in Red...which might be an easier entry for you, since they have a GUI console and distribute binaries (albeit 32-bit ones).

However... I was hit pretty immediately by the fact that Red can't use path lookup in parse rules. :roll_eyes:

(I could easily go off on a rant here about the fact that there's a miniscule audience for wanting to run their programs super-super-fast who do not care at all about any kind of formal rigor in the language.
To the extent such people do exist, a Rebol-like language would still never be the right pick. There's far faster and better-vetted languages...so the advantage which should be touted is expressivity. Every time I see this kind of boneheaded response to an issue it reinforces that they just do not understand what the language is actually good for.)

So I guess biting the bullet and making you a Ren-C developer would make sense, if you're up for that. Are you able to compile a version for your local OS of choice?

(My general suggestion for most development--even if one is on Windows--is to use a VirtualBox Linux to build and run. The tooling is just better. You can still keep the files and edit on the host, e.g. in VSCode, if you put the code in a shared folder...I guess I could make a screencast going through the steps of doing this.)

As for me right now, I'll see if I can avoid falling asleep, and get it running under Redbol emulation in Ren-C...

UPDATE

Okay, first step taken! Patched up Redbol to be able to run it. In the web console even. (It's slow, but what's really slow is actually the printing...)

You can see it work with these two commands on the replpad page:

>> redbol

>> do https://github.com/hostilefork/whitespacers/blob/master/rebol/whitespace.reb

That version can stay like that as a museum piece. I've started a new file to use for work in bleeding-edge Ren-C. Not much needed to be changed.

2 Likes

In terms of "neat features that could maybe just fall out"... Rebol's PARSE can work on symbolic data as well as strings. It might be cool to exploit that, in a way that could make debugging easier.

To elaborate: the WORD!s tab, space (sp), and lf (or, preferably when not code golfing, newline) look up to CHAR! by default. So when you use them in PARSE they are looking for characters in a string:

 >> string: "^- ^/^/^-"  ; escape codes for `tab space newline newline tab`

 >> parse string [tab space newline newline tab]
 == #[true]  ; in R3-Alpha, true if parse succeeded. Ren-C returns input.

Now if you start talking about symbolic blocks, you could also have a BLOCK! of CHAR! values. Not as compact as a string, since it costs a value-cell-per-character.

 >> block: [#"^-" #" " #"^/" #"^/" #"^-"]   ; 5-element block

 >> parse block [tab space newline newline tab]
 == #[true]

You could REDUCE to get that block of CHAR! values by looking up the WORD!s:

 >> block: reduce [tab space newline newline tab]
 == [#"^-" #" " #"^/" #"^/" #"^-"]

And you can merge the characters in that block into a string if you want. This is what %whitespace.reb was doing.

But you could also match WORD!s in blocks.

 >> block: [tab space newline newline tab]

 >> parse block ['tab 'space 'newline 'newline 'tab]
 == #[true]

 >> parse block ['tab 'space 2 'newline 'tab]
 == #[true]

So what I'm thinking might be interesting would be if a debug mode of %whitespace.reb were to use blocks instead of strings. To give an idea of what I'm talking about:

 either debug [
     tab: 'tab
     space: 'space
     newline: 'newline
 ][
     tab: #"^-"
     space: #" "
     newline: #"^/"
 ]

So now a rule like [tab space newline newline tab] could take on either meaning depending on the mode.

This isn't a terribly profound thought, but maybe inspires some creative thinking. I've seen some parser combinators that don't have the duality of being able to work on symbols and strings, even in something like Clojure. So just wanted to interject that Rebol parse rules have that duality, and maybe it'd be interesting in this problem.

(I also think it would be ideal to have it use what it knows to auto-generate help, cheaply!)


Making the system have legible debug output nearly "for free" is a good goal. It's already making a step in that direction... just for reference, here's what the R3-Alpha output of the program is. Though I changed the completely invisible PRINT to PRINT MOLD...which still doesn't help you with seeing whitespace at end of line (which is why I'm suggesting doing the transformation to a block when debugging is requested):

WHITESPACE INTERPRETER FOR PROGRAM:
---
{   ^-

   ^-    ^-^-
 
 ^-
 ^-   ^- ^- 
^-
     ^-
^-    
    ^- ^-^-
^-  ^-
^-  ^-   ^- ^-

 
 ^-    ^-^-

   ^-   ^- ^-
 




}
---
LABEL SCAN PHASE
( [mark-location 67] )
( [mark-location 69] )
make map! [
    67 17
    69 96
]
---
( [push 1] )
( [duplicate-top] )
( [output-number-on-stack] )
1
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
2
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
3
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
4
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
5
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
6
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
7
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
8
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
9
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
10
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [discard-top] )
( [end-program] )
Program End Encountered
stack: []
callstack: []
heap: make map! [
]
2 Likes

Well... I've hacked it up to get it to look like the proposal. It's raised some questions (including ones that are known and have discussions here, especially pertaining to binding).

But it's rather cool. And now the ren-c variant runs in the ReplPad without redbol:

>> do https://github.com/hostilefork/whitespacers/blob/master/ren-c/whitespace.reb

I did the changes in individual commits in the git log, corresponding to the points I raised. So you can sort of follow along, and I think it would be illuminating to take a look at each phase--even if some behind-the-scenes parts look like gibberish (for now):

There's a lot of ugly hacking to get here, so it's going to need cleanup and more pushing on system features to make this work. But if it were easy to make something that blended source fragments together in this style and have it still run, some other language would have done it.

Let me know if any of the above is confusing or makes sense, @razetime. Note that if you're reading a diff on GitHub, you can click on the diff line and comment directly on it... to ask questions or make remarks. Please don't hesitate to do so if you have anything to ask or suggest...it's not spam, it's desperately needed peer review and commentary.

(I continue to hope that the "Minecraft of Programming" vision emerges from the process and line of thinking. The code for defining OPERATION and CATEGORY themselves is a lot uglier than I'd like, and we have to keep fiddling with that until it can make clear sense. But the fact that it's even possible to write--and make the usages of operation and category look that natural--shows the goalposts.)


Note: I used the term "Instruction" for the aggregate of the whitespace opcode plus its arguments (which I form as a BLOCK! in the runtime). Then "Operation" for the name like push or jump-if-zero, based on precedent from assembly language. The sequence of ASCII chars is still the "Command" for now due to the term being in the whitespace spec, though I guess that could change, possibly to "opcode"? I think terminology is important, so this should all be nailed down and used consistently...