Whitespace Interpreter Revisited

hostilefork · January 30, 2021, 5:09pm

I came across what was near-to-one of my first Rebol programs (the first one using PARSE).

It was an interpreter for a language called "Whitespace", where all the instructions are made out of tab, space, and newline characters. My idea was to use PARSE to actually evaluate the code:

https://github.com/hostilefork/whitespacers/blob/master/rebol/whitespace.reb

The idea of using parse as the interpreter is an interesting take...but the code is far too repetitive. It was clearly written before I knew much about COMPOSE or really using the generative side of things.

Ideas on Doing it "Right"?

Currently there's a split between the specification and the implementation. That might be desirable in some cases, but I don't think it's desirable here.

The instructions are grouped together, which lets them "inherit" their leading sequence. For example, putting push [space Number] and duplicate-top [lf space] in "Stack-Manipulation" container lets them inherit the leading space common to each instruction. Otherwise they'd have to be spelled out as [space space Number] and [space lf space].

But beyond that, being inside a grouping pulls them out of "global scope", which means the instructions aren't overwriting the definitions of things like add and subtract with the whitespace versions.

Reducing The Repetition

The first thing I think of is merging the functions with the whitespace spec, and doing it more dialect-style:

    ; The 2010 JavaScript-Like Way
    ...
push: [
	command: [space Number]
	description: {Push the number onto the stack}
]
duplicate-top: [
	command: [lf space]
	description: {Duplicate the top item on the stack}
]
    ...
push: func [value [integer!]] [
    insert stack value
    return none
]
duplicate-top: func [] [
    insert stack first stack
    return none
]

Being relatively liberal in syntax here, we can imagine doing this more like:

; A Modern Dialected Way

push: ??? [
    {Push the number onto the stack}
    space (value <number!>)
][
    insert stack value
    return none
]

duplicate-top: ??? [
    {Duplicate the top item on the stack}
    lf space
][
    insert stack first stack
    return none
]

This is a little push in the right direction. But we have some issues to think through.

Whatever we're building here as units of the language need to be "registered" in the complete list of operations. If they were global declarations run by the evaluator, then one of the things the ??? operator would have to do would be this registration. If they are not global, then registration could be done by their container that is invoking them.
We still have to inherit the leading space that both of these instructions have. An interesting point in the PARSE branching is that we benefit from having the knowledge that these all start with space, so that the space can lead into these tests as alternates...e.g. [space [lf space | space <Number>]] will be more efficient than [space lf space | space space <Number>].
If we decide that the container is the boss, then giving a name to ??? is not really necessary. But what happens if you leave it off?
```
... Stack-Manipulation ... [
    push: [
        {Push the number onto the stack}
        space (value <number!>)
    ][
        insert stack value
        return none
     ]
     ...
 ]
```
This is a counterintuitive mixture of a SET-WORD! with two blocks after it. How do we reconcile the idea of "declaration-like-things" inside of a container like this? The code needs to be in a block, the other things don't...so other notations are possible...
```
... Stack-Manipulation ... [
    PUSH
    {Push the number onto the stack}
    space (value <number!>) [
        insert stack value
        return none
     ]
     ...
 ]
```

I guess the easiest thing to do is to cave, and just come up with something for ???. It has the benefit that you can actually run your implementation of Stack-Manipulation through the evaluator and do normal assignments. So something like push: instruction [...] [...]

Yet there's something a bit dissatisfying about having to type INSTRUCTION redundantly each time...where you're also kind of claiming that word so it can't be used other ways.

Lots Of Other Questions...

I think it would be worthwhile to rework this, but it's worth taking note of how adrift you can get right out of the gate looking at it. There are lots of big questions that don't really have obvious answers.

razetime · January 31, 2021, 3:18am

Do you mind if I analyze this and golf it?

hostilefork · January 31, 2021, 8:22am

I certainly wouldn't mind help on revisiting this!! And if you wanted to blog about it and build some credit for it yourself, you could basically "own" it even if was a collaborative effort. I'd be happy to hand it off...though I do think I could provide good guidance on direction.

As I say, I wrote this a DECADE ago, before I really had any real experience with the language, I was definitely a newbie. So it resulted in something quite bloated for what it is.

But the gimmick is that I'd learned that I could write things like:

parse data [tab space tab | space space tab]

And it would look up the words to get the CHAR! values, and match that. So then when I learned I could write:

rule1: [tab space tab]
rule2: [space space tab]
parse data [rule1 | rule2]

I thought "hey, this could be a pretty literate way of breaking down the spec for whitespace, as code." Had I been more experienced and ready to abstract it, I'd have taken the next step:

all-rules: []
instruction: func [block] [
    append all-rules block
    append all-rules '|
]
rule1: instruction [tab space tab]
rule2: instruction [space space tab]
parse data all-rules

Which is more in the spirit of what you'd want, to avoid having to write out the big aggregate rule manually.

But what inspired the "parse as whitespace interpreter" was around when I'd just found out about the ability to mark and seek parse positions:

>> string: "aaaabbbb"

>> parse string [some "a" pos: some "b"]  ; mark (seek would be :pos)
== #[true]  ; R3-Alpha returned true if the parse reached the end of input

>> pos
== "bbbb"

So I could actually use the parse position as the instruction pointer. Whether this was a good idea or not, I thought it would be interesting to try.

Red has a pretty good overview of PARSE, if you haven't seen my links to it already. Would be good to have a handle on how that works before trying to hack on this.

Making It Suck Less, Even in R3-Alpha, Seems a Good First Goal

I outlined kind of a vision for what the "whitespace definition dialect" might look like. Basically, coming up with a language variant on the fly for making whitespace. The picture I paint is something a bit like this:

duplicate-indexed: instruction [
    {Copy Nth item on the stack (given by the argument) to top of the stack}
    tab space [index: Number]
][
    insert stack pick stack index
    return none
]

So what we have here puts together:

The name of the instruction. Using a SET-WORD! for this seems sensible...but if we run it through normal assignment, we'd still like the instruction to know the name it was declared with (for display in debugging). This name would have to be enfixedly quoted if the instruction was run through normal evaluation...or the container would need to make the connection.
A description of the instruction. These were originally taken from the whitespace spec, though I think shortening them down to fit in 80 columns is a good idea.
The sequence of whitespace characters that represent the instruction. I think that using the plain words tab, space, and newline
Inline with the instruction sequence, a definition of whitespace-encoded parameter values and the name of that parameter. There's Number and Label. Here I put that in a block. There's plenty of other ways you could think of writing it, some which might look more like how Rebol does function parameter definitions in its FUNC spec dialect:
```
  [<tab> <space> index [number!]]
```
...but like I say, there's something appealing to me about the idea that the "common case" of the whitespace characters here just be words. The free-wheeling use of words is kind of foundational in Rebol (want to call something IF in a context? Go ahead...) so having the named-parameters be the odd case feels better.
Some code to invoke the behavior, where any parameters defined in the spec are bound (in this case index). This code returned the new position or "none" (in R3-Alpha vernacular) to say it wasn't an instruction that changed the position. That doesn't seem like a bad idea, but it would be nice if it composed in a "return none" by default at the end of the code you provide... so that if you didn't make the body end in a RETURN it would not change the position.

Game Plan

I hope maybe you see "the vision" of how it would rapidly-bend to make a "whitespace implementation dialect". This is what I try to get at with the "Minecraft-of-programming". You're building a new language with the seeming ease that you would make a new function in most other systems...

The code needs a first pass just to make it more of a good example--instead of looking like someone's first way-too-long PARSE program.

While converting it to Ren-C might seem the most obvious first step... it might be better to get it working under "Redbol". There need to be more stable test cases for that (ultimately I'd like to be able to get a high percentage of Red's tests to pass!)

So I thought that converting it to Red first might actually be cool. Maybe even do some of the first pass of improvements in Red...which might be an easier entry for you, since they have a GUI console and distribute binaries (albeit 32-bit ones).

However... I was hit pretty immediately by the fact that Red can't use path lookup in parse rules.

(I could easily go off on a rant here about the fact that there's a miniscule audience for wanting to run their programs super-super-fast who do not care at all about any kind of formal rigor in the language.
To the extent such people do exist, a Rebol-like language would still never be the right pick. There's far faster and better-vetted languages...so the advantage which should be touted is expressivity. Every time I see this kind of boneheaded response to an issue it reinforces that they just do not understand what the language is actually good for.)

So I guess biting the bullet and making you a Ren-C developer would make sense, if you're up for that. Are you able to compile a version for your local OS of choice?

(My general suggestion for most development--even if one is on Windows--is to use a VirtualBox Linux to build and run. The tooling is just better. You can still keep the files and edit on the host, e.g. in VSCode, if you put the code in a shared folder...I guess I could make a screencast going through the steps of doing this.)

As for me right now, I'll see if I can avoid falling asleep, and get it running under Redbol emulation in Ren-C...

UPDATE

Okay, first step taken! Patched up Redbol to be able to run it. In the web console even. (It's slow, but what's really slow is actually the printing...)

You can see it work with these two commands on the replpad page:

>> redbol

>> do https://github.com/hostilefork/whitespacers/blob/master/rebol/whitespace.reb

That version can stay like that as a museum piece. I've started a new file to use for work in bleeding-edge Ren-C. Not much needed to be changed.

hostilefork · January 31, 2021, 8:46am

In terms of "neat features that could maybe just fall out"... Rebol's PARSE can work on symbolic data as well as strings. It might be cool to exploit that, in a way that could make debugging easier.

To elaborate: the WORD!s tab, space (sp), and lf (or, preferably when not code golfing, newline) look up to CHAR! by default. So when you use them in PARSE they are looking for characters in a string:

 >> string: "^- ^/^/^-"  ; escape codes for `tab space newline newline tab`

 >> parse string [tab space newline newline tab]
 == #[true]  ; in R3-Alpha, true if parse succeeded. Ren-C returns input.

Now if you start talking about symbolic blocks, you could also have a BLOCK! of CHAR! values. Not as compact as a string, since it costs a value-cell-per-character.

 >> block: [#"^-" #" " #"^/" #"^/" #"^-"]   ; 5-element block

 >> parse block [tab space newline newline tab]
 == #[true]

You could REDUCE to get that block of CHAR! values by looking up the WORD!s:

 >> block: reduce [tab space newline newline tab]
 == [#"^-" #" " #"^/" #"^/" #"^-"]

And you can merge the characters in that block into a string if you want. This is what %whitespace.reb was doing.

But you could also match WORD!s in blocks.

 >> block: [tab space newline newline tab]

 >> parse block ['tab 'space 'newline 'newline 'tab]
 == #[true]

 >> parse block ['tab 'space 2 'newline 'tab]
 == #[true]

So what I'm thinking might be interesting would be if a debug mode of %whitespace.reb were to use blocks instead of strings. To give an idea of what I'm talking about:

 either debug [
     tab: 'tab
     space: 'space
     newline: 'newline
 ][
     tab: #"^-"
     space: #" "
     newline: #"^/"
 ]

So now a rule like [tab space newline newline tab] could take on either meaning depending on the mode.

This isn't a terribly profound thought, but maybe inspires some creative thinking. I've seen some parser combinators that don't have the duality of being able to work on symbols and strings, even in something like Clojure. So just wanted to interject that Rebol parse rules have that duality, and maybe it'd be interesting in this problem.

(I also think it would be ideal to have it use what it knows to auto-generate help, cheaply!)

Making the system have legible debug output nearly "for free" is a good goal. It's already making a step in that direction... just for reference, here's what the R3-Alpha output of the program is. Though I changed the completely invisible PRINT to PRINT MOLD...which still doesn't help you with seeing whitespace at end of line (which is why I'm suggesting doing the transformation to a block when debugging is requested):

WHITESPACE INTERPRETER FOR PROGRAM:
---
{   ^-

   ^-    ^-^-
 
 ^-
 ^-   ^- ^- 
^-
     ^-
^-    
    ^- ^-^-
^-  ^-
^-  ^-   ^- ^-

 
 ^-    ^-^-

   ^-   ^- ^-
 




}
---
LABEL SCAN PHASE
( [mark-location 67] )
( [mark-location 69] )
make map! [
    67 17
    69 96
]
---
( [push 1] )
( [duplicate-top] )
( [output-number-on-stack] )
1
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
2
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
3
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
4
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
5
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
6
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
7
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
8
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
9
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [jump-to-label] )
( [duplicate-top] )
( [output-number-on-stack] )
10
( [push 10] )
( [output-character-on-stack] )


( [push 1] )
( [do-arithmetic 'add] )
( [duplicate-top] )
( [push 11] )
( [do-arithmetic 'subtract] )
( [jump-if-zero] )
( [discard-top] )
( [end-program] )
Program End Encountered
stack: []
callstack: []
heap: make map! [
]

hostilefork · January 31, 2021, 9:03pm

Well... I've hacked it up to get it to look like the proposal. It's raised some questions (including ones that are known and have discussions here, especially pertaining to binding).

But it's rather cool. And now the ren-c variant runs in the ReplPad without redbol:

>> do https://github.com/hostilefork/whitespacers/blob/master/ren-c/whitespace.reb

I did the changes in individual commits in the git log, corresponding to the points I raised. So you can sort of follow along, and I think it would be illuminating to take a look at each phase--even if some behind-the-scenes parts look like gibberish (for now):

First I introduced CATEGORY and OPERATION as simple aliases for MAKE OBJECT!, as a start to putting some running code that could give smarts to the definition (instead of just being inert literal blocks).
Then I got rid of the repetition of command: and description: in each definition. Instead it assumes the block will have a string in the first position (the description) and everything else makes up the command array.
Next was moving the code for the behavior for each operation to live with the operation. This was to start merging it together so that OPERATION would really just be a fancy way of defining a function plus more auxiliary information. (The function parameterization wasn't exactly consistent, so I had to fix that up)
Another essential step was auto-generating the PARSE rules from the operation and category specs. This is simpler than it looks.
And to finish off how much I can do before sleeping, I fused the parameter definition for the behavior function in with the whitespace instruction definiton. So the sequence of opcode characters is right in line with the spec of the argument that is expected to be encoded in more characters after it to form the rest of an instruction.

There's a lot of ugly hacking to get here, so it's going to need cleanup and more pushing on system features to make this work. But if it were easy to make something that blended source fragments together in this style and have it still run, some other language would have done it.

Let me know if any of the above is confusing or makes sense, @razetime. Note that if you're reading a diff on GitHub, you can click on the diff line and comment directly on it... to ask questions or make remarks. Please don't hesitate to do so if you have anything to ask or suggest...it's not spam, it's desperately needed peer review and commentary.

(I continue to hope that the "Minecraft of Programming" vision emerges from the process and line of thinking. The code for defining OPERATION and CATEGORY themselves is a lot uglier than I'd like, and we have to keep fiddling with that until it can make clear sense. But the fact that it's even possible to write--and make the usages of operation and category look that natural--shows the goalposts.)

Note: I used the term "Instruction" for the aggregate of the whitespace opcode plus its arguments (which I form as a BLOCK! in the runtime). Then "Operation" for the name like push or jump-if-zero, based on precedent from assembly language. The sequence of ASCII chars is still the "Command" for now due to the term being in the whitespace spec, though I guess that could change, possibly to "opcode"? I think terminology is important, so this should all be nailed down and used consistently...

hostilefork · July 3, 2021, 12:08am

I sync'd up the whitespace Ren-C variant to run with the current rules (e.g. a world without /ONLY) and switched it to use UPARSE.

The old circa 2010 variant I left as-is, but made sure the Redbol emulation could run it.

And now both are running on all platforms in debug and release builds via the Ren-C Action:

https://github.com/hostilefork/whitespacers/runs/2976554030?check_suite_focus=true

Really it's only running one test, and only testing that it finishes without raising an error. This should be more rigorous...running more samples and checking the outputs. But it's definitely a start.

There are still a lot of questions raised by this regarding the meaning of dialect design, but it's nice to see the 2010 code running under emulation...especially because that version is using UPARSE tweaked to Rebol2 conventions under the hood (!)

aarchi · July 5, 2021, 1:43pm

For my Whitespace Corpus project, I have documented about 175 implementations of the Whitespace programming language in 41 languages, including spec compliance and building instructions for each. Your %whitespace.reb interpreters are interesting in how they define a DSL describing Whitespace and I'd like to suggest some improvements.

Running those proved to be difficult as a Rebol newcomer. I obtained R3-Alpha binaries from rebolsource to run the 2009 version (I didn't yet know about Redbol). I was initially confused for building Ren-C and poked around the compiler to understand how it's built, but once I found make.sh building became easy.

I'd like to bootstrap Ren-3 back to the first open source release, without any arbitrary binaries, except for the R2 binary needed to build the first source. If a full bootstrap is something that you'd like to have, I can write up my notes so far and I'll have some questions. It seems like this would consist of building a couple of key versions such as 8994d23. Lots of build information is collected in the GitHub Actions (and previously Travis CI) configs that could be reused for the bootstrap. I can't guarantee that I'd have time to finish it though.

Your Whitespace interpreters unfortunately only run a single hardcoded program. This makes it near useless. Would it be simple to accept a command line argument of a program filename and fallback to the existing hardcoded program, when no filename is given? Additionally, a verbose mode would be nice so that the program trace and parse results can be otherwise hidden. If you have any questions about Whitespace language nuances, I'd be happy to answer them.

By the way, I submitted a tiny PR to your Whitespacers repo that fixes some typos and hasn't received any attention.

hostilefork · July 5, 2021, 7:05pm

Glad you are doing that! Better you than me. I was mulling over what to do about the fact that people had starred and were following the repository I made. That meant it would be noisy to be conducting "day to day" (every couple of years?) Ren-C development on it.

I've pointed to your project in the README.md and marked hostilefork/whitespacers as a read-only archive. The Ren-C whitespace interpreter is now a separate repository.

Your %whitespace.reb interpreters are interesting in how they define a DSL describing Whitespace

Glad you find them so. When I wrote the first version I'd only known about Rebol for a couple of months. I had just learned what PARSE was. So it's not stellar code, but I was trying to one-up the "table-driven" implementations with a "spec-driven" implementation.

The Ren-C version is an attempt to do something more impressive, with bundling the code and instruction spec together more tightly. I don't know if keeping it based it on translating it into PARSE rules is the best idea in the world. But it's certainly the way someone might try to solve such a problem, and I have a new usermode parser combinator approach...so it's testing that.

There are a lot of pain points in trying to live up to the promise of "mere mortals can bend the constructs--fast--to make their own language"; Rebol2 was quite unproven, Red hasn't moved the needle much. Ren-C has tons more tools, but there are still really basic questions surrounding "how does variable binding work" which are (nearly) as half-baked as ever. Just trying to tackle them one issue at a time.

I'd like to bootstrap Ren-(C) back to the first open source release, without any arbitrary binaries, except for the R2 binary needed to build the first source.

As it happens, the R3-Alpha open source release did not bootstrap with Rebol2. It had drifted to where you needed a certain "stable enough" early version of R3-Alpha.

Ren-C is even deeper into this situation. Frustrating as that may sound, the experience of bootstrapping at this level is actually rather educational. Making the language able to shift and bend between version changes of itself is a sort of trial-by-fire for the kind of acrobatics we want people to be able to exercise in their own code.

What I've thought about is the idea that people could use the web build to make a .zip of the prep files for their platform. This would dodge the need to have a desktop executable installed at all. So people would go to a URL, pick some settings from checkboxes, and get the contents of the prep/ directory.

Even more ambitious would be if we made the web build embed the TinyC compiler, and it could give you a full desktop executable...built in your browser, using your CPU! (See my "Amish Programming" demo for that being done on the desktop, and there's not any fundamental reason why it couldn't be done in a browser too; one just has to be sensitive to virtual filesystem and memory quotas).

Certainly needs to take files as input. I was hoping to make options for writing "whitespace assembly" (higher level instructions and labels), "whitespace words" (literally writing out space tab lf), or just the actual whitespace bytes...with the option to skip over non-whitespace characters or error ("strict mode").

I can raise the priority level of doing that in my queue, maybe something I can do this week...though I'm actually trying to get my old Rebmu sample programs running on GitHub. If you like esolangs, it might amuse you:

Sorry to have missed that. Merged!

aarchi · July 7, 2021, 2:24pm

Thanks for the mention in your wspacers collection. Hopefully, that'll help bring exposure to it and help it grow.

Yeah, I've been discovering that. I identified a couple of key revisions that had been used to build successive revisions and I intended to figure out how to build them all, but once I got to 8994d23 (2018-12-17), I ran into errors that look related to tooling/OS version mismatches. It wouldn't be pleasant to fix those old versions to work with modern tooling. Your approach with the .zip prep files downloaded from your site sounds much more usable anyways.

Thanks. It's much easier to use now. Now that it can run arbitrary programs, I notice that it has some bugs when executing some of the example programs. If you're interested, I could investigate what's wrong: probably something with reading or more complex control flow.

That's quite a quirky language. I haven't seen that particular method of compression before (i.e. alternating case to indicate word boundaries) and it could be useful for implementing terse code golf languages.

aarchi · July 7, 2021, 2:59pm

I noticed that your new hostilefork/rebol-whitespacers repo doesn't preserve any git history, so I used a combination of git filter-repo and git rebase to concatenate the whitespacers and rebol-whitespacers histories.

I've published this version with preserved history at wspace/rebol-whitespacers for now. Let me know if you'd rather have it on your user profile or in the wspace org. If hostilefork/ and you give me the go ahead, I'll do a push --force with the preserved history. If wspace/, it would be best to transfer the repo to the org so that it will redirect from your profile and the star will be preserved; I'd set you as the owner.

Here's how I did it:

git filter-repo makes it easy to remove all files that don't match the given patterns. This invocation filters whitespacers to only the files used in rebol-whitespacers (and it would be all you'd need if there was no hard fork):

git filter-repo --path rebol --path ren-c --path .github --path README

I then used rebase to replay the new rebol-whitespacers commits on top of whitespacers by

removing the “Prepare for switching to archived repository” whitespacers commit because it deletes files that are used in rebol-whitespacers.
removing the “Initial commit” rebol-whitespacers commit because it adds a dummy README.md
and editing the “Get R3-Alpha/Ren-C interpreters from collection” rebol-whitespacers commit so that the files are moved to the reorganized paths

Unfortunately, rebase overwrites the committer name, email, and time for every modified commit, so I wrote a script that stores the committer information before rebasing, then applies it afterwards. This was dreadfully annoying.

I've included the scripts I wrote to do this below. The entire process is deterministic (given a deterministic squash commit message), but I wasn't able to automate two interactive parts, so it's probably not worth trying. I've verified that the rewrite preserved all history and metadata, automatically and manually.

concat_histories.bash

#!/bin/bash -e

# Clone both repos.
# whitespacers is the base and rebol-whitespacers is left unchanged.
git clone https://github.com/hostilefork/rebol-whitespacers
git clone https://github.com/hostilefork/whitespacers rebol-whitespacers-concat
cd rebol-whitespacers-concat

# Only include Rebol- and Ren-C-related files.
# If you want to verify that these paths are correct for all commits,
# including deletions and renames, you can run `git filter-repo --analyze`
# to generate a report at .git/filter-repo/analysis/path-all-sizes.txt.
git filter-repo --path rebol --path ren-c --path .github --path README

# Rename master to main to match rebol-whitespacers.
git branch -m master main

# Merge rebol-whitespacers into whitespacers. This adds all files from
# rebol-whitespacers at their new paths, so there are duplicates for
# now. The merge commit is later discarded by the rebase, so don't worry
# about the commit message.
git remote add rebol ../rebol-whitespacers
git fetch rebol
git merge --allow-unrelated-histories --no-edit rebol/main
git remote remove rebol

# Record committer information so that it can be recovered after rebase.
../record_committers.bash ../committers.py

# Rebase to make the following changes:
# - Remove the penultimate whitespacers commit because it deletes files
#   that are used in rebol-whitespacers.
# - Squash the first two commits of rebol-whitespacers to move files to
#   their new paths and get rid of the dummy README.md.
#
# Start the rebase at "fdd7acb GitHub Actions workflow to test Rebol Whitespacers".
# If the commit hash is different, you have rewritten history
# differently and need to replace the hash in the rebase invocation.
#
# When the interactive editor opens, change pick to drop/edit/squash as
# follows:
#
#   drop 233e029 Prepare for switching to archived repository
#   pick 9b3c3fb Fix typo in Rebol/Ren-C titles; update filenames to current
#   edit 02daa2c Initial commit
#   squash aff5261 Get R3-Alpha/Ren-C interpreters from collection
#   pick 0447490 Take filename on the command line
#   pick 102a453 Break Dialect and Runtime into Separate Files
#   pick c429a5d Pretend UPARSE is the official PARSE behavior
#   pick d18e0ee Disable Syntax Highlighting for Rebol Files
#   pick 234d2b3 Update README.md
#
git rebase -i --committer-date-is-author-date fdd7acb

# Edit the rebol-whitespacers initial commit to move files to their new
# paths and get rid of the dummy README.md.
mkdir historical
git rm README README.md
git mv rebol/whitespace.reb historical/whitespace-old.reb
git mv ren-c/{whitespace.reb,README.md} .
git mv .github/workflows/test-{rebol,whitespacers}.yml
rmdir rebol ren-c
git commit --amend --no-edit

# Resolve conflicts in "Get R3-Alpha/Ren-C interpreters from collection",
# automatically choosing the changes from rebol-whitespacers.
git rebase --continue || :
git diff --name-only --diff-filter=U | xargs git checkout --theirs
git add --all

# Squash these two commits. Reword the new commit message to make sense
# in this new context (at least remove the "Initial commit" line).
git rebase --continue

# Restore accurate committer information.
../restore_committers.bash ../committers.py
rm ../committers.py

# Delete excessive replacement objects.
git replace --delete $(git replace --list)

record_committers.bash

#!/bin/bash -e

callback="$1"

# Set your name locally to "dummy" so that we can detect commits for
# which rebase changed the committer.
git config --local --add user.name dummy
git config --local --add user.email dummy@example.com

# Generate callback for filter-repo.
echo 'p = re.compile(br"\n\n\[ref:([0-9a-f]{40})\]\n")
match = p.search(commit.message)
if match:
  ref = match.group(1)
  commit.message = p.sub(b"", commit.message)

if not match or commit.committer_name != b"dummy":
  pass' > "$callback"

git log --date=raw --format='elif ref == b"%H":
  commit.author_name = b"""%an"""
  commit.author_email = b"""%ae"""
  commit.author_date = b"%ad" # %ai
  commit.committer_name = b"""%cn"""
  commit.committer_email = b"""%ce"""
  commit.committer_date = b"%cd" # %ci' >> "$callback"

# Append commit hash to each commit message.
git filter-repo --commit-callback 'commit.message += b"\n\n[ref:%s]\n" % commit.original_id'

restore_committers.bash

#!/bin/bash -e

callback="$1"

git filter-repo --commit-callback "$(cat "$callback")"

git config --local --unset-all --fixed-value user.name dummy
git config --local --unset-all --fixed-value user.email dummy@example.com

hostilefork · July 7, 2021, 6:36pm

Absolutely! I believe I've set the flag so it will let you... let me know if it doesn't.

Thanks so much, and glad you're taking an interest in it. I'm sure you've learned a bit about the Ren-C language experiment in the process. (And an experiment it certainly is!)

Fixing any bugs in terms of the spec would of course be great!

I'd like to have it run and validate the output of tests. So there's a lot of improvements to make on giving it better modes and command-line switches.

But also there's plenty to do on the implementation. You can see the thought flow in the thread above about how it got moved from where it was to where it is now. But the "VM" currently hardcodes the categories by name, and it shouldn't know about them...instead it should get a list produced by the CATEGORY. Etc.

I'll wait until you've pushed the history to work on any of that...

aarchi · July 7, 2021, 10:37pm

Alright I've now pushed to hostilefork/rebol-whitespacers and it's safe to work on that repo. I've deleted my wspace/rebol-whitespacers repo because it's no longer relevant.

hostilefork · July 9, 2021, 5:09pm

I wanted to keep some momentum going here now that there is some so I went ahead and worked on this.

First thing I did was to add command line processing to control a verbosity setting, and a --verbosity of 0 can just give the output of the program itself...suitable for checking.

That revealed that it was writing extra newlines each time it wrote out an integer/character. I got rid of that.

Then I set it up to use Ren-C to launch a subprocess of itself to run the test, so that it can capture the result into a string. It compares that result to a known output. By doing this in Ren-C itself it works on all platforms without worrying about varying diff tools or things like that.

Rather than subject you to debugging old untested code I wrote about 10 years ago, I went ahead and attacked the issues myself. There were lots of problems!

But I've now got it up and running on GitHub Actions with several tests that can feed input and validate output. They are working on Linux/Windows/Mac:

https://github.com/hostilefork/rebol-whitespacers/runs/3031025372?check_suite_focus=true#step:9:5

A bit painful is that the author of the sample programs made the unfortunate choice to use CR LF in their output. Ren-C considers that a foreign file format you need a codec to decode, just as if you were using a different character set. So the tests compare to a BINARY!, not a TEXT!.

As the author of the original Rebol one all that time ago, I'd ask the old version to be removed from the wspace/corpus. It's very broken and I'm not going to fix it. (!) I will keep it around myself just to run as a Redbol test since it's running right now on the hardcoded only case it can support.

But I'd definitely appreciate your help in pushing on the Ren-C one, as an example of the kinds of language acrobatics that we are trying. Anything would be good--even editing comments or README.md to explain it better. It's definitely still a "sketch", but I think it has potential. And it's turning out to be useful in testing.

One thing I'm wondering about is how standardized the assembly format and opcode names are...is there consensus? If not, is there a direction you would like the consensus to bend?

In the meantime, I need to get started on native-optimizing UPARSE, as it is s-l-o-w right now!

aarchi · July 10, 2021, 11:42am

Wow that's a lot of fixes. I/O fixes and --verbose especially allow more programs to run.

Great! I've submitted a PR that adds the remaining sample programs from wspace 0.3. In the process, I fixed an off-by-one error with duplicate-indexed (copy). (I'm not sure of your preferred development process—merged branches or rebased linear, open PR or commit directly to main—so I figured a PR would be the least controversial.)

read-character-to-location (readc) is not yet implemented, which is currently causing the name.ws test to fail. The reference interpreter reads UTF-8-encoded text and sets the value at the given address to the 32-bit codepoint. Many implementations just read ASCII, so that would be fine, if that's easier.

Does Ren-C have a facility for arbitrary-precision integers? If so, then fact.ws (factorial) would work for 21! and greater instead of overflowing a 64-bit integer.

I think it would be useful to add a separate LICENSE file so it can be automatically detected by GitHub and other tools. I notice that %whitespace.reb is MIT-licensed.

Yeah non-LF line endings aren't fun. I once had to deal with the edge cases of g++ and clang's LF, CRLF, and CR (Mac OS 9) line ending handling.

My philosophy with the Whitespace Corpus has been to collect all known Whitespace implementations and programs in a unified format. I avoid directly embedding project source code, instead using git submodules. This allows the author's repo to be source of truth, preferring to direct improvements upstream. To date, only your repos and buyoh/nospace have accepted any of my PRs though, so I maintain several forks in the wspace org. Most projects before 2010 were distributed as source archives and many are only available on the Internet Archive; I have converted those to git repos—any of which I would transfer ownership to the original authors, if they ask. I sort the projects in reverse date order, using the dates of first commit or release, whichever is earlier.

I'm not sure when you last looked at wspace/corpus, but there is only one entry for your project. Before we got in touch, I had two hard forks, wspace/hostilefork-ren-c and wspace/hostilefork-rebol, which were the respective %whitespace.reb and README.md files extracted from hostilefork/whitespacers using git filter-repo. Once you moved development to hostilefork/rebol-whitespacers, I merged the two entries, using the date of your first release (2009-10-08) for ordering and deleted the two now-redundant mirrors. Since it's a submodule to hostilefork/rebol-whitespacers, I can't remove historical/whitespace-reb. I don't think it needs to be removed anyways—“historical” and “old” make its unsupported status clear and it showcases Ren-C's R3 Alpha emulation capabilities.

Most Whitespace implementations are small weekend projects, for coding challenges or school assignments, so spec-compliance is not emphasized and there are frequent bugs. In general, the Whitespace community is untested write-once software and it's fine that way.

The interesting projects (to me) are those written in interesting languages (Rebol/Ren-C, jq, Idris, Whitespace[1][2], LOLCODE), with higher-level languages compiling to Whitespace (Nospace, Spitewaste, HaPyLi, WSC, Ruby subset), with optimizations (SSA form rewriting, AST rewriting, speculative JIT optimization), or with interesting compilation targets (LLVM[1][2][3][4], NASM, MASM, JVM, .NET MSIL). There's many more.

There's a lot of diverse projects that I want to do a better job of highlighting. The core of the corpus is a 5.5k line JSON file detailing the various implementations, along with some tools to generate the documentation. I think of it as a DSL in JSON. I need to write and generate more easily-digestible comparisons.

hostilefork · July 10, 2021, 7:42pm

Ah, hadn't seen you'd done that. Great. I'll do a little surgery on that file and add a README to contextualize it better.

The idea of "reading a single character" and getting control back without needing to hit Enter isn't something that could work in R3-Alpha or early Ren-C consoles--especially on Windows--as it was using a high-level API that read an entire line at a time (including cursoring/etc.)

(Now the code is more granular, and is being slowly bubbled up to encode things as Rebol values...but the events aren't quite at the level of being exposed yet... although I have offered a hook if you press TAB to anyone who wants to implement what happens when you hit tab!)

So presuming the intent is to read a character without needing a newline, this would have only have applied to piped input...which R3-Alpha was notoriously bad about. I did some egregious hacks to try and make it so that READ-LINE could work in piped or non-piped (console) situations...but READ-CHAR is a taller order.

How do other whitespace implementations handle READ-CHAR across REPL vs. non-REPL cases?

There is some experimental code using the bignums from mbedTLS. So in the "dependency minimization" philosophy the intent is to be able to share the same implementation used by the crypto.

But it's a fair ways off from being done; it's not particularly central to the more critical language design questions that need to be answered.

That seems like a pretty good way to study any emerging consensus on the instruction names or assembly syntax. I wonder if people could be lobbied to come to a standard? I'd like there to be an assembler which is also driven by the same spec being used to interpret.

Right now it's still pretty clunky. But ultimately where I want this implementation to go is drive down the size in a code golf kind of sense for each feature, and show the "force multiplier" that we have been seeking to achieve.

For instance, I just added a feature to run files from URL!s.

All told Ren-C is a hazy picture but it comes into focus a little at a time.

aarchi · July 10, 2021, 11:58pm

Most implementations, including the reference implementation, are line-buffered, so the easier way is actually spec-compliant. Line breaks also count as characters for readc and, if possible, CRLF isn't collapsed to LF.

aarchi · August 12, 2021, 1:13am

By the way, rebol-whitespacers has no top-level LICENSE file. That would help GitHub detect the license and would clear up whether the MIT license mentioned in %whitespace.reb applies to only that file or the entire project.

hostilefork · August 15, 2021, 6:40am

Thanks, I've added one...

...aaaand I also got around to doing a bit more hacking on it tonight. It's really shaping up to be an interesting demonstration.

For a little insight into what's going on, we have OPERATION which might be something like:

mark-location: operation [
    {Mark a location in the program}
    space space [label: Label]
    <local> address
][
    address: offset? program-start instruction-end
    labels.(label): address
]

One by-product of this is the function to do the work. It would look like this:

mark-location-impl: func [
    label [text!]
    <local> address
][
    address: offset? program-start instruction-end
    labels.(label): address
]

Another by-product is the rule for generating an instruction invocation. That rule looks like:

[keep @mark-location-impl space space keep Label]

Commas may help readability, but they're not required:

[keep @mark-location-impl, space, space, keep Label]

Then each category aggregates these rules together, along with the IMP, and a COLLECT instruction:

[
    [lf]  ; IMP
    collect any @[
        [keep @mark-location space space keep Label]
        [keep @call-subroutine space tab keep Label]
        [keep @jump-to-label space lf keep Label]
        [keep @jump-if-zero tab space keep Label]
        [keep @jump-if-negative tab tab keep Label]
        [keep @return-from-subroutine tab lf]
        [keep @end-program lf lf]
    ]
]

Then, at a higher level, these rules are kept as alternates.

If a rule runs successfully, we expect to see a product. Label is a rule that matches whitespace label sequences and synthesizes it, so the instruction block that the MARK-LOCATION rule would make might look like:

[mark-location-impl "  ^-  ^-"]

Where "^-" => tab.

I Think The Code Is Close To Essential Complexity Level

(...well, at least the part that's processing the specs.)

Here's the fully commented code for walking the spec and emitting a rule...

...but without comments it's actually quite short:

emit rule: collect [
     keep (compose [keep @(as word! name)])
     while any @[
        [<end>, stop]
        [ahead tag!, pos: <here>, (append args pos), to <end>, stop]
        [keep ^['space | 'tab | 'lf]]
        [
            into block! [
                param: set-word!, (param: to word! param)
                type: ['Label | 'Number]
            ]
            keep (compose [keep (type)])
            (append args compose [
                (param) (either (type = 'Label) '[text!] '[integer!])
            ])
        ]
        fail @["Malformed OPERATION spec"]
    ]
]

I cleaned it up a bit while writing this post. Looking at what's left, I think I'm starting to side with @rgchris that maybe the ^ should be reserved for "expert use"... and the ONLY keyword may be better than QUOTE because it helps convey the idea that you're not augmenting the value for the reason of the augmentation being kept, you expect it to just be a signal of "append as is":

[keep ^['space | 'tab | 'lf]]
=>
[keep only ['space | 'tab | 'lf]]

It's less intimidating.

On that note: regarding the growing appearances of @...well, it's just a fact of life there has to be some way to call out that a BLOCK! in parse isn't meant to be processed by the baseline block combinator. You can't use '[...] because generic quoting is taken for matching arbitrary content from the input series. I think it would be disastrous to say "anything goes" and start having combinators quote their block arguments and treat them as non-rule data, so being explicit feels unavoidable.

Anyway, for the size of this code, it certainly packs a punch. (Sorry for the long delay in getting back to you @razetime, but, this perhaps shows the code golf potential of the language...UPARSE is a killer. Someday maybe this will be a winning golfed whitespace interpreter! There's lots of stuff that could be skipped, like putting the [text!] or [integer!] type annotation on the parameter...I really would be interested to see how small this could go if that were the goal.)