Removing ISSUE! from the ANY-WORD! category

hostilefork · March 8, 2019, 10:22am

A controversial move in R3-Alpha was changing ISSUE! from being a versatile string type to a more strict WORD!.

http://www.rebol.net/r3blogs/0108.html

Not everyone was a fan. And it created several porting hassles, plus raised questions about if #123 was a legal ISSUE!, why wouldn't you be able to convert it to a WORD!?

http://www.rebol.net/cgi-bin/r3blog.r?view=0108#comments

One argument for it came from efficiency--that preprocessor references would commonly use the same text over and over, and that WORD!s were stored more efficiently as symbols. As with the use of tags like <opt> in function specs, we want to be able to do this efficiently for ANY-STRING!. Ren-C has gotten more efficient on this front by being able to store short string series in the series node...and with UTF-8 everywhere that means even longer strings can be stored in the node.

But another argument was the desire to have access to another "bindable" type.

Getting bindings by putting WORD! in PATH!/TUPLE!

One thing about getting rid of the REFINEMENT! data type is that what's going on is you really just have a PATH! with a blank at the head. This provides cool options we did not have before:

'/one
'/one/two
'''/one
/one:
/one/two:
:/one
:/one/two
''':/one/two   ; etc. etc.

And words in those paths get bindings, which can be used or not used as you wish.

Generalized-TUPLE! is on the horizon, and that would give another option:

.one
.one.two

There's no technical reason not to allow SET-TUPLE! and GET-TUPLE!. One thought I had for what it might do--which would make sense in light of the predicate usage--would be a way of doing a CHAIN or specialization of some kind:

odd?: get .not.even?
odd?: :.not.even?

So with PATH! and TUPLE!, we now have two inert bindable types that look kind of like WORD! and would have the same efficiency profile, .foo and /foo. But then, we could perhaps get more with #.foo and #/foo. And depending on how we think of FILE! (still debated) maybe %.foo and %/foo could have bindings too.

In any case, it just feels too me like there's something unnatural about the ISSUE!-as-WORD!-choice. And now that REFINEMENT! has been taken away, there's a pretty clear meaning of ANY-WORD! as being WORD!, SET-WORD!, and GET-WORD!...which makes the category name actually make sense.

Does anyone want to speak in favor of ISSUE! being an ANY-WORD!?

From what I understood, most everyone hated it. If it goes back to being an ANY-STRING!, does anyone see a problem?

I can't think of any existing usages that would break (in a meaningful way, e.g. nothing uses the binding).

hostilefork · March 8, 2019, 1:25pm

Because binding is involved in this...I'll point out this thread, where the question of relationship between strings and binding is raised:

If you're writing a preprocessor system you might argue that you'd want some context information along with #foo. But off the top of my head, there are more pressing examples where templatize {some $(foo) and $(bar)} would like to be able to know what foo and bar you mean, without having to pass those in explicitly as templatize/with {some $(foo) and $(bar)} [foo bar].

Just doesn't seem to make sense that ISSUE! should be an ANY-WORD!, when precedent clearly favors #123 being an ISSUE!, and it having string-like properties. If new ideas come along that make it easier to associate bindings with strings, that would be a good development. And we should be open to how such tricks could be done. But if those ideas don't come, I don't see what makes ISSUE! picked as a word type to be the exception to the rule.

hostilefork · September 25, 2020, 4:24am

As with many other things right now...a bunch of gears in my head calculated across all the things we've seen and came up with a potential option that would split the difference and offer new and interesting possibilities...

Unify CHAR! and ISSUE! into non-series UTF-8 data type

I don't know what the best name for this would be. UTF-8! ? TOKEN! ? But whatever you called it...

This would mean that #a would fit into the bytes directly in a cell... as would short strings like #1234567. (There's 8 bytes of payload for UTF-8 on 32-bit platforms, in addition to enough space for the unicode codepoint of the first character.) You have 16 bytes on 64-bit platforms. If you're outside the space restrictions, an immutable series node could be used.

When I say it's a non-series, that would mean that it is immutable and wouldn't break down into any further character components. So like ANY-WORD!, if you wanted to know its length or pick characters out of it, you would have to alias it as a TEXT! first. That provides the necessary atomicity, because you don't want this:

>> ch: pick "abc" 2
== #b  ; the new idea

>> pick ch 1
== #b  ; !!! This would be bad!

It's not just about leading to infinite loops, there's a sort of cognitive trap (that @earl always warned against)...when you break a composite into single unit composites of itself...without acknowledging that "something changed". It's the reason why you don't want first #{1020} to be #{10}.

Any downsides of atomic immutability have been coped with

With following Rebol2 usually a big drive for Red, it might surprise you that they went with the R3-Alpha change:

red>> length? #123
** Script Error: length? does not allow issue for its series argument

I explain the reasons why having it be an ANY-WORD! is a timebomb, and that numbers are useful... like #(555)-207-6382 or #555.251.8517 being phone numbers. Or #0080FF as a good representation for an RGB color (better than 0.128.255 for most purposes).

This allows nicer-looking characters

I've always thought #"a" was ugly, and endorsed the use of TEXT! in PARSE...even though it was slower. Because the extra decoration was too noisy...

>> parse "abc" [#"a" #"b" #"c"]   ; yuck!
== []

>> parse "abc" ["a" "b" "c"]  ; slower, but prettier
== []

But with most characters not requiring quotes to escape, it becomes much less of a problem to use the "right" way:

>> parse "abc" [#a #b #c]   ; not bad...actually, maybe better than text!
== []

You'd be allowed to write it with the quotes or without, so it could be compatible.

Case-sensitivity makes sense for "issue-like" TOKEN!

With a CHAR! and ISSUE! unification, it would be fairly obvious to have the case-sensitive comparison apply to longer strings. We could do that today, but it's more of a slam dunk if they're the same type:

>> parse "aBc" [#aB #c]
== []

>> parse "aBc" [#Ab #c]
; null

Fixes API and command line escaping problems

I've faced situations where I'm passing strings in places where quotes are the only option...like C code or the bash shell. And currently, CHAR! has no representation that works without the ugly backslashing:

REBVAL *ch = rebValue("pick [#\"a\" #\"b\" #\"c"] 2");

r3 --do "pick [#\"a\" #\"b\" #\"c"] 2"

It's very annoying, and it limits the appeal of the character type. But this would tidy up:

REBVAL *ch = rebValue("pick [#a #b #c] 2");

r3 --do "pick [#a #b #c] 2"

Additionally, I've proposed that BINARY! be moved to using & so that it can come in &FFEE and &{FFEE} and &"FFEE" forms. If that were done, then you could use escaped issue-like TOKEN!s in such cases as well:

REBVAL *token = rebValue("pick [#{ab cd} #{ef gh}] 2");

(You'll need to support spaces in this datatype, else you wouldn't be able to get the "CHAR!" for space covered... #" " ... so this would offer #{ } as an alternate form.)

We can replace two not great names

Although CHAR! said what it was, it breaks the desired "non-abbreviations"...and fixing it would require the too-long CHARACTER!.

ISSUE! wasn't very descriptive, it was just a kind of arbitrary string with a pound in front of it. But if it's packed and immutable, then we might get some kind of meaning about fitness for purpose out of the name TOKEN!. I duno. Whatever it's called, it feels like something could beat CHAR! and ISSUE! and combine them...

There's precedent for not having the atomic type range-limited

There's no BYTE! datatype. So when you ask for an item out of a BINARY! you get a value that is "indivisible" (e.g. not a single-element BINARY!). But the type itself ranges over more states than it's picking from:

r3-alpha>> bin: #{0304}
== #{0304}

r3-alpha>> pick bin 1
== 3

>> append bin 1020
** Script error: value out of range: 1020

The world didn't end because of the lack of a BYTE! type. Really, since it's just a constraint on an existing type, it's not clear that it needs its own representation in the data format.

I think we can look at CHAR! the same way. How many functions are there out there that want to do one thing with a CHAR! and a different thing with an ISSUE! ?

Changing BINARY! to & would still load historical BINARY!

It's not technically necessary that #{...} become one of the three token representations (#x #"x" #{x}). But it is made especially coherent with this proposal, and the concept has been on my mind for some time.

I've pointed out that if the TOKEN! (issue) loading rules permitted multi-line data, then you could still LOAD any historical data...you'd just have to look for cases that matched BINARY! and convert appropriately. All-hex issues would become BINARY! too, so you'd have to discern that by context. Or the system would have to keep track of some kind of "loaded with {} syntax" bit.

Anyway, point being this is a less major change than what was thought about of using & for characters. I decided that never really looked good when the & abutted characters...it's too squiggly, so you need at least &"a" instead of &a. But # doesn't have that problem... #a looks fairly nice.

My HTML character codes idea thus would probably just have to fold into the historical escaping, as #"^(AElig)". Because you would want append "text" #AElig to give "textAElig", not interpret the characters directly.

It frees another internal datatype number

Cell bits are limited, and every superfluous usage of them that can be eliminated opens doors to new types, or optimizations that use cell state as a trick.

I'm feeling pretty stoked about this!

Of course, whenever I come up with something I type it up as a preface to writing it, to see if I can talk myself out of it before going and wrangling the whole codebase. So I may hit some problem. But I'm not seeing it yet....

Most code could probably get away with CHAR? being a test for single-character tokens, and ISSUE? for greater than 1. Redbol module compatibility could be helped by that idea I mention of preserving whether there were quotes at load-time...if that stayed glued to the value, it might be enough.

IngoHohmann · September 25, 2020, 5:32pm

Just thinking out loud...
What if tokens could have non-active bindings? (So only to numbers, strings, and other issues).
Then you could say
>> #AElig: "Æ"
>> #AElig
== "Æ"
>> #aelig
== #aelig

Just a wild idea ...

hostilefork · September 25, 2020, 6:54pm

I wanted to avoid the idea of #123 having a binding at all, to keep from the troubles the R3-Alpha and Red ISSUE! had. So if it did have a binding, it would have to be only if it were non-numeric. And I also thought it would carry the codepoint cached, which (today) lives where the binding lives.

But further, it would run up against one strong point of this is that it would allow the efficient creation of things like #000000 and #0000FF and #FF0001 ad nauseum and not cause any allocation...due to packing into the cell. (if you crossed your platform's cell payload size limit and crossed into needing an immutable series allocation, it would still work, but just not have that advantage). 7 character ASCII on 32-bit and 15 character ASCII on 64-bit would fit.

As a wacky thing enabled by the UTF-8 non-conflict with pointers trickery, read-only text that aliases a TOKEN! could actually use their payloads as immediate text as well. This would mean as text! on a character or short hex sequence wouldn't perform any allocations either.

This idea is looking like a home run from so many different angles.

IngoHohmann · October 13, 2020, 5:36pm

Things to think about ...
Should it be possible to put issue!s into paths, or
should "/" be a valid character in issue!s?

(Currently #a/b is an issue!, and a/#b is a syntax error).

hostilefork · October 13, 2020, 5:40pm

I believe it is now settled that #a/b is an issue, and #/ is the slash "char!".

What we might consider instead is if strings in paths are immutable, e.g.:

>> path: 'a/"bc"/d
>> second path
== #bc

The notion forming up of what a PATH! is, is that it is not meant as a generic container for putting types in. It's closer in spirit to a WORD!. So it is a distant relative of GROUP! and BLOCK!.