Barriers to Source-To-Source Transformations

hostilefork · October 11, 2024, 2:23am

Doing source-to-source transformations seems like something Rebol would be great at.

Why not LOAD your file (powered by TRANSCODE)... do some structural tweaks, and spit it back out again with SAVE (powered by MOLD). What could be simpler?

But the first thing you'll notice is: Oops, all your comments are gone.

Next you'll see the indentation and spacing is almost all thrown out. Cells in arrays store a NEWLINE_BEFORE marker to record if they had a newline before them (and there's an additional NEWLINE_AT_TAIL marker on the array as a whole)...but that's all you get. MOLD just uses some heuristics to indent the code in a canon style.

Vague Proposals Appear, Then Fizzle

If you wait long enough, someone will suggest that there be a mode in which TRANSCODE doesn't toss out the whitespace information, but keeps it around somehow.

One terrible version of this idea actually interleaves weird values into the arrays, where your processing code would have to skip it:

>> transcode:verbatim "a b  c   d"
== [a b #[space! 2] c #[space 3] d]

That is a complete dead end, as it would mean any transformation code you used could not be shared with "normal code".

So what you'd need instead would be some way to expand out that NEWLINE_BEFORE flag into richer information. But cells use pretty much all their bits (although there are 32 bits free in the headers of cells in the 64 bit build...however I'd be loathe to use it for this.)

One idea would be to load the data into the cells as in the :VERBATIM example above, except have operations skip over them as if they weren't there.

On the one hand, this this would fundamentally screw with the "simple" design the language is aiming for.
On the other hand, this is essentially what UTF-8 Everywhere does for String in managing the series length (in codepoints) as being independent of the number of physical units (in bytes). It would just be extending that logic to Array, meaning arrays would have to be enumerated with a function like Skip_Element() or Step_Back_Element() instead of just incrementing a pointer with ++ and --.
I've actually thought about moving away from pure pointer incrementing and decrementing in order to allow arrays to be built out of discontiguous segments, which could help several scenarios in optimization...with reconciliation into contiguous bits being done on an as-needed basis.

Another concept would be to have cells be able to "redirect" the Extra pointer in the second slot, and set a header flag saying "the Extra in this cell can be found by following the pointer in the extra slot". And then that can point to an allocation somewhere that holds the formatting information.

Not Happening Any Time Soon, But Worth Remembering

I can't hold Ren-C's interesting ideas hostage to such things, there's too much other stuff going on.

But noticing the similarity of the UTF-8 Everywhere transition to the idea of skipped formatting cells...and then tying that into what could be the advantages of discontiguous arrays...got me to think.

We could perhaps have discontiguous strings as well--leveraging illegal UTF-8 bytes as the signal of pointers embedded in the data. It might be interesting if the mechanics could be merged.