Removing ISSUE! from the ANY-WORD! category

hostilefork · September 25, 2020, 4:24am

As with many other things right now...a bunch of gears in my head calculated across all the things we've seen and came up with a potential option that would split the difference and offer new and interesting possibilities...

Unify CHAR! and ISSUE! into non-series UTF-8 data type

I don't know what the best name for this would be. UTF-8! ? TOKEN! ? But whatever you called it...

This would mean that #a would fit into the bytes directly in a cell... as would short strings like #1234567. (There's 8 bytes of payload for UTF-8 on 32-bit platforms, in addition to enough space for the unicode codepoint of the first character.) You have 16 bytes on 64-bit platforms. If you're outside the space restrictions, an immutable series node could be used.

When I say it's a non-series, that would mean that it is immutable and wouldn't break down into any further character components. So like ANY-WORD!, if you wanted to know its length or pick characters out of it, you would have to alias it as a TEXT! first. That provides the necessary atomicity, because you don't want this:

>> ch: pick "abc" 2
== #b  ; the new idea

>> pick ch 1
== #b  ; !!! This would be bad!

It's not just about leading to infinite loops, there's a sort of cognitive trap (that @earl always warned against)...when you break a composite into single unit composites of itself...without acknowledging that "something changed". It's the reason why you don't want first #{1020} to be #{10}.

Any downsides of atomic immutability have been coped with

With following Rebol2 usually a big drive for Red, it might surprise you that they went with the R3-Alpha change:

red>> length? #123
** Script Error: length? does not allow issue for its series argument

I explain the reasons why having it be an ANY-WORD! is a timebomb, and that numbers are useful... like #(555)-207-6382 or #555.251.8517 being phone numbers. Or #0080FF as a good representation for an RGB color (better than 0.128.255 for most purposes).

This allows nicer-looking characters

I've always thought #"a" was ugly, and endorsed the use of TEXT! in PARSE...even though it was slower. Because the extra decoration was too noisy...

>> parse "abc" [#"a" #"b" #"c"]   ; yuck!
== []

>> parse "abc" ["a" "b" "c"]  ; slower, but prettier
== []

But with most characters not requiring quotes to escape, it becomes much less of a problem to use the "right" way:

>> parse "abc" [#a #b #c]   ; not bad...actually, maybe better than text!
== []

You'd be allowed to write it with the quotes or without, so it could be compatible.

Case-sensitivity makes sense for "issue-like" TOKEN!

With a CHAR! and ISSUE! unification, it would be fairly obvious to have the case-sensitive comparison apply to longer strings. We could do that today, but it's more of a slam dunk if they're the same type:

>> parse "aBc" [#aB #c]
== []

>> parse "aBc" [#Ab #c]
; null

Fixes API and command line escaping problems

I've faced situations where I'm passing strings in places where quotes are the only option...like C code or the bash shell. And currently, CHAR! has no representation that works without the ugly backslashing:

REBVAL *ch = rebValue("pick [#\"a\" #\"b\" #\"c"] 2");

r3 --do "pick [#\"a\" #\"b\" #\"c"] 2"

It's very annoying, and it limits the appeal of the character type. But this would tidy up:

REBVAL *ch = rebValue("pick [#a #b #c] 2");

r3 --do "pick [#a #b #c] 2"

Additionally, I've proposed that BINARY! be moved to using & so that it can come in &FFEE and &{FFEE} and &"FFEE" forms. If that were done, then you could use escaped issue-like TOKEN!s in such cases as well:

REBVAL *token = rebValue("pick [#{ab cd} #{ef gh}] 2");

(You'll need to support spaces in this datatype, else you wouldn't be able to get the "CHAR!" for space covered... #" " ... so this would offer #{ } as an alternate form.)

We can replace two not great names

Although CHAR! said what it was, it breaks the desired "non-abbreviations"...and fixing it would require the too-long CHARACTER!.

ISSUE! wasn't very descriptive, it was just a kind of arbitrary string with a pound in front of it. But if it's packed and immutable, then we might get some kind of meaning about fitness for purpose out of the name TOKEN!. I duno. Whatever it's called, it feels like something could beat CHAR! and ISSUE! and combine them...

There's precedent for not having the atomic type range-limited

There's no BYTE! datatype. So when you ask for an item out of a BINARY! you get a value that is "indivisible" (e.g. not a single-element BINARY!). But the type itself ranges over more states than it's picking from:

r3-alpha>> bin: #{0304}
== #{0304}

r3-alpha>> pick bin 1
== 3

>> append bin 1020
** Script error: value out of range: 1020

The world didn't end because of the lack of a BYTE! type. Really, since it's just a constraint on an existing type, it's not clear that it needs its own representation in the data format.

I think we can look at CHAR! the same way. How many functions are there out there that want to do one thing with a CHAR! and a different thing with an ISSUE! ?

Changing BINARY! to & would still load historical BINARY!

It's not technically necessary that #{...} become one of the three token representations (#x #"x" #{x}). But it is made especially coherent with this proposal, and the concept has been on my mind for some time.

I've pointed out that if the TOKEN! (issue) loading rules permitted multi-line data, then you could still LOAD any historical data...you'd just have to look for cases that matched BINARY! and convert appropriately. All-hex issues would become BINARY! too, so you'd have to discern that by context. Or the system would have to keep track of some kind of "loaded with {} syntax" bit.

Anyway, point being this is a less major change than what was thought about of using & for characters. I decided that never really looked good when the & abutted characters...it's too squiggly, so you need at least &"a" instead of &a. But # doesn't have that problem... #a looks fairly nice.

My HTML character codes idea thus would probably just have to fold into the historical escaping, as #"^(AElig)". Because you would want append "text" #AElig to give "textAElig", not interpret the characters directly.

It frees another internal datatype number

Cell bits are limited, and every superfluous usage of them that can be eliminated opens doors to new types, or optimizations that use cell state as a trick.

I'm feeling pretty stoked about this!

Of course, whenever I come up with something I type it up as a preface to writing it, to see if I can talk myself out of it before going and wrangling the whole codebase. So I may hit some problem. But I'm not seeing it yet....

Most code could probably get away with CHAR? being a test for single-character tokens, and ISSUE? for greater than 1. Redbol module compatibility could be helped by that idea I mention of preserving whether there were quotes at load-time...if that stayed glued to the value, it might be enough.