Om: concatenative homoiconic language

hostilefork · June 13, 2024, 4:11am

_{Sorry, long delay to look at this. Personal distractions, then technical issues with the DNS, etc. but we're back...}

This was a one hit wonder on hackernews a couple years ago.

Finally sat down to read over this. And having not even heard the term "concatenative language" before... I also read the essay Why Concatenative Programming Matters.

Forth is claimed to be in this concatenative family, and oft cited as a Rebol influence. I think @gchiu is the only person around here who still (maybe?) messes with Forth. He follows the pariah variant "8th", or at least used to. I look at Forth programs and can't reasonably consider them "source code" in any kind of human sense...an illegible Turing Tarpit " …in which everything is possible but nothing of interest is easy." At best it's compiler output, you can never find a place in the program that corresponds to intent...just implementation. Someone's off-the-cuff crafting of a bytecode.

My reading is that Om sought to be a "prefix-based lambda calculus of concatenative programming" (or something?) where expressions that lack enough operands to evaluate become... uh, pointfree, I think?

"If the computation cannot be completed (due to insufficient operands), the operator that names the operation is pushed onto the output program, followed by all remaining input terms."

The pitch seems like a purist prefix Forth that I'm less likely to be able to use than the Forth I already refuse to use.

I'll mention the odd terminology thing we got into, where the question of removing one level of quoting was distinguished from removing all levels of quoting.

@iArnold had his moment to shine in suggesting NOQUOTE, because DEQUOTE was too ambiguous.

>> unquote first ['''abc]
== ''abc

>> noquote first ['''abc]
== abc

So we simply avoid the term DEQUOTE.

Of course, the craziest thing to absorb in Ren-C is the existence of "antiforms" (quoting level of -1) and quasiforms, which just show we're not in Kansas anymore.

A Justification of Generalized Isotopes

Rightly or wrongly, I'm now anchored to the semantics of an isotopic system.

"unicode-correct: any UTF-8 text (without byte-order marker) defines a valid Om program."
...
"Strings are automatically normalized to NFD, but can be explicitly normalized to NFKD using the normalize operation"

There is curious attention to Unicode-isms for a project that is so... otherwise minimal. I note that the implementation uses the obscenely complex dependency of ICU4C--which is not something that would be considered for the purposes of our project.

Ren-C also standardizes on UTF-8... but uses "UTF-8 Everywhere". This means it doesn't just enforce UTF-8 source code, but the runtime representation of strings. (As far as I can tell, Om uses fixed-sized codepoints at runtime for strings, despite linking to the UTF-8 Everywhere manifesto.)

Normalization is something that so far we have not addressed, though I've written about the topic:

O Noes, Unicode Normalization

Om only mentions the decomposed formats NFD/NFKD. This is to say that if you had a single codepoint in your file on disk like an e with an accent (é) then it would be loaded into memory as the two codepoint sequence, of an e and then an accent.

This got me to dig deeper and I seriously don't know at this point where we should plant our Amish stake when it comes to Unicode. I got in arguments with ChatGPT and Claude about why combining characters would ever be after the characters they modify, as it breaks anyone's chances of writing a stream sink. Despite the fact that Unicode was being designed in a relatively modern time where you would assume people had learned their lessons and ran their ideas through some kind of filter of intelligent people. But apparently no.