"ISSUECHAR!" (=> TOKEN!) is Hitting It Out of the Park

hostilefork · October 12, 2020, 5:29am

The unification of the ISSUE! and CHAR! types is having all kinds of better-than-expected consequences. It is undeniably a good change. _{(though don't be surprised if Red finds a way to deny it...they still don't even bring up the existence of the concept of UTF-8 Everywhere even as it stares them in the face as the answer to things they need!)}

I've kept the ISSUE! name for now, just to not change too many things at once. And CHAR! is a "type constraint" of a single-character token. So although the mechanics behind this sort of constraint are destined to change...you can still use CHAR? and match single-character tokens in PARSE via a CHAR! rule:

>> parse [#a] [&char?]
== [#a]

>> parse [#aa] [&char?]
** Error: ...

So don't go changing all your CHAR! to ISSUE!. If you mean CHAR!, say it. Just bear in mind that it's a constraint on tokens... not its own unique data type.

The biggest change is this:

Use CODEPOINT OF to get the codepoint of a single-character ISSUE!, instead of to integer!. (Currently you can also PICK the codepoints off of a token positionally, so you could also use FIRST. But that's not as clear, and I'm also not sure if the behavior is desirable.)
There's a hack in so that MAKE CHAR! of an INTEGER! will still work (for now) to get a single-character issue form. But TO CHAR! of an INTEGER! no longer works. There's also now AS ISSUE! to do this...but I'm not 100% confident that this is what "AS" will mean. Using MAKE CHAR! is probably the best way to keep the callsites identified for migrating to a final solution.

Putting More "Everywhere" Into "UTF-8 Everywhere"

With the character type actually being an immutable class of string, it's no longer being stored in a cell as an integer codepoint. It's encoded UTF-8 bytes, living in the memory inside of a cell.

When all your strings are encoded as UTF-8, you don't really need the codepoint number all that often. Usually you're just looking for a character in a string, or appending it to one. That means having the encoding at hand is more important...and in those rare times you need CODEPOINT OF you pay the cost of decoding the UTF-8 into an integer.

It's the better option.

Efficient Storage of Short Issues

There's 7 bytes of space in a cell on 32-bit platforms for UTF-8 encoding (1 byte for terminator), and 15 bytes on 64-bit platforms. That's enough space for all UTF-8 single character codepoints, and if you're just using ASCII that can be enough for many short strings.

So the benefits aren't just limited to single characters. The work done to support encoding in cells supports what would have been short issues too. Notably, RGB patterns like #FFFFFF come in at 6 characters...so they'll fit in a cell even on 32-bit platforms.

Lighter Notation For Characters

Not needing the quotes around most characters makes a much nicer impression:

; with quotes

    >> first "abc"
    == #"a"

    >> if #"a" = first "abc" [print "match"]
    match

; without quotes

    >> first "abc"
    == #a

    >> if #a = first "abc" [print "match"]
    match

We know #a=b is different from #a = b due to whitespace significance, and it's nice to play a card from that hand here.

Of course not all characters fit the pattern, so people will hopefully be sensitive to the idea that #] is a space token up against a bracket, and #"]" is the bracket "token". Seems learnable.

Intuitive Similarity between #"1" and #1 Becomes Actual Similarity

Having TO INTEGER #"1" and TO INTEGER #1 act differently is just one of many examples of why this shouldn't be a difference.

Quote Notation Opens Up Forms for "ISSUE!"

Before, #"a b c" was an "invalid character". Now it's a valid token. This enables more dialect usages of a unique stringlike type.

...more...

It's nice being able to solve situations where you want characters in quotes for things like command lines, and can say "char: #a" instead of "char: #\"a\"". But taking the #{} form from BINARY! would be needed to really generalize that victory.

With tokens being generic enough to hold ANY-STRING! content, I observed that they might be able to be used in an answer for what to do when you wanted a map key to be case-sensitive. But same-cased ISSUE! and strings haven't compared equal even in Rebol2:

rebol2>> "abc" = #abc
== false

This means you'd have to turn your string into a token before looking it up in the map. But since tokens are immutable, that would force your string to become read-only in that process.

It's something predicates might solve, by letting you use a different predicate for lookup than common equality. Yet the whole premise of map lookups is that they are based on a hash function that reliably produces the same result...so if a map was hashed with one notion of lookup, you can't just pick an arbitrary function...

Anyway, there are things to think through and opportunities that this might create. I'll be interested to see. But already, it's a win.

hostilefork · October 12, 2020, 6:07pm

I thought I'd point out a cool example from the console source...where C and Ren-C are getting mixed.

There is a part where WORD!s are being used as virtual keycodes, and then there is C code that wants to react to those codes. To make it clear/easy/efficient, what happens is that a Rebol SWITCH translates the WORD!s into char` values, that can be used in a C switch statement.

       uint32_t ch = rebUnboxChar(
            "to char! switch", rebQ(e), "[",
                "'escape ['E]",

                "'up ['U]",
                "'down ['D]",
                "'ctrl-b",  // Backward One Character (bash)
                    "'left ['L]",
                "'ctrl-f",  // Forward One Character (bash)
                    "'right ['R]",
                ...

      switch (ch) {
          case 0:  // Ignored (e.g. unknown Ctrl-XXX)
            break;

          case 'E':  // ESCAPE
            Term_Abandon_Pending_Events(t);
            line = rebBlank();
            break;

You notice that previously, the branches in the switch() didn't use CHAR!. They used WORD!s, because the escaping of chars would be too annoying. (e.g. "'escape [#\"E\"]") So the WORD!s are translated to chars outside the switch with TO CHAR!.

Now using the ISSUE!s directly in the SWITCH is painless, and there's no need for the conversion:

 uint32_t ch = rebUnboxChar(
        "switch", rebQ(e), "[",
            "'escape [#E]",

            "'up [#U]",
            "'down [#D]",
            "'ctrl-b",  // Backward One Character (bash)
                "'left [#L]",
            "'ctrl-f",  // Forward One Character (bash)
                "'right [#R]",
            ...

It's a cool and sneaky way to push through that cross-language barrier. Just a reminder that these design points are "deep thoughts"...not change for the sake of change!

hostilefork · September 20, 2024, 7:30am

So I never renamed this type to TOKEN! because I've been a chicken.

Part of what makes me hesitant is the idea of people working on codebases where they are considering WORD! to be a token in their dialect, and so they would say token: 'foo.

Calling it UTF-8! might seem good, but all strings are in the category ANY-UTF8? (Should that be "ANY-UTF-8?", with the hyphen?)

And ANY-UTF8? as a category does not imply immutability. But these are immutable.

So...CHARS! or CHARACTERS! or CODEPOINTS!?

The term "RUNE" is getting some traction for meaning "codepoint" in Go and C# (excluding some illegal ranges like surrogate pairs).

We Could Call It `hashtag!`... But...

When we say:

>> second "abc"
== #b

Do we really want to say that "single-codepoint hashtags are our currency for characters"?

I don't think so.

Maybe Just Go With My Gut... Like With "Keyword"

TOKEN! is what has been on my mind since the beginning of the unification of ISSUE!+CHAR!, and it hasn't gone away.

What we name our fundamental types isn't the end-all-be-all of how a word needs to be used.