TEXT! vs. STRING!

hostilefork · May 10, 2018, 11:08am

With the libRebol API, there is the question of what to call the function that makes a string, and its shortcuts. So far:

Since UTF-8 is the default, I thought that rebString() would take UTF-8 data as a char*, and look for a '\0' byte to automatically decide how long it was.
Then rebSizedString() would take a char* and a size parameter, counting in bytes.
The UCS-2 variations I feel are probably best named with W because they are really likely to be used by Windows API programmers, where W is the settled-upon convention for the wide-character APIs. rebStringW() is more succinct than rebStringUCS2(). Since wchar_t is different sizes on other platforms, then if people want to read it as "the W is for Windows, vs. Wide" that is fine...because it really is UCS2, not wchar_t.
Shorthands for making auto-releasing values would be like rebS(), to easily create a temporary STRING! for things like rebRun("print", rebS(cstring_utf8), END); and then not have to rebRelease it later, because the rebRun does so as it passes by.

All very nice, except one thing. I believe it would be better to call it TEXT! Why rock the boat? A few reasons:

Category as member of Category is bad

I've never liked the idea of naming categories and members of the category the same thing. Saying "STRING! is an ANY-STRING!" has a loop in it, just as "BLOCK! is an ANY-BLOCK!". Categories are supposed to be ways of talking about common properties, and when you name the member of the set the same as the set...you're sort of prohibiting anything being in the set.

Sentences that sound natural to me are things like "Unlike other languages, Rebol has several string types... TEXT!, TAG!, URL!..." Saying it has several string types, but then saying STRING! is one of them is awkward. Kind of like saying "FUNCTION is a function! that generates a function!"...you can only do this so many times before people think you don't know how to taxonomize anything.

Need a name that spans WORD!, TAG!, SET-WORD!, ISSUE!...

There has been great consternation over questions like "should ISSUE! be an ANY-STRING! or an ANY-WORD!". It changed from Rebol2 to R3-Alpha and chaos ensued. Fights over it raised a question in my mind: why aren't they all unified into one category?

Certainly they can have different behaviors in the evaluator. But what is it--really--that a REFINEMENT! has in common with a WORD! that it doesn't have with a TAG!? It can optionally carry a binding, like a WORD! can. But it just evaluates to itself, like a TAG! does. You can't append to a WORD!, but if it were unbound, might you want to be able to? You can't append to a LOCK'd or PROTECT'd TAG!, either...so its not like every string is mutable in the first place.

So I have a hope to figure out a grand unified theory that will pull all their implementations together. It would speed comparisons (all using UTF-8). It would cut down on unnecessary conversions (why can't I copy/part skip 'some-word 5 4?). We've seen already that some people wanted issues to carry bindings even though they weren't evaluator-active...why not let a tag do it too?

But when they're all in the same category, what would they be called? I think STRING is a great name for the superclass. You can say "WORD!, TAG!, ISSUE!, TEXT!...these are some of Rebol's many string types...in the category ANY-STRING!"

TEXT! is more familiar for non-CS people

Certainly there's just sort of the question of going against the status quo. If all the other programming languages in the world say "foo bar" is a "string", does one want to be the only language that seems to not do that (besides VARCHAR() in SQL, or other esoterics)?

But Rebol is different in thinking of string as a category of different parts. And that is a central aspect which should be reinforced wherever possible. If early on someone is exposed to the idea that string is a category and not a concrete type, they might quickly become more adventurous in thinking about how to apply the different concrete types. I think it's a good teachable moment.

Plus: to non-programmers, string is a ball of yarn. Float is something you do in an inner-tube. Carl's initial bias in Rebol led him to err on the side of common language, and chose DECIMAL! for floating point numbers...even if this would rub computer science people the wrong way. Because other languages, types named "decimal" tend to be when you want a fixed precision model of some kind. It's specifically used to mean not floating-point.

Red pushed back on that and called the numeric type FLOAT!. So they're willing to rock the boat, but in the other direction...biasing to the CS term. I haven't really come to a conclusion myself about decimal/float, I see both sides. When pressured over it in the past, Carl was firm that he liked DECIMAL! and wasn't changing it.

But in this case, there's not as much CS culture controversy. TEXT! isn't wanted for other things, it's generally just a synonym for strings. Perhaps it would compete in a GUI as a shorthand for TEXTEDIT! or something, if it used the same type namespace.

It wouldn't break that much

Just like with the ACTION! renaming, you could say string!: text! | string?: :text? and be nearly unaffected. The name STRING! wouldn't be taken for anything else, just as FUNCTION! isn't. It would only be taken as the category name ANY-STRING!.

Pushing on the switch evaluative mechanics would make it fine to switch type of x [string! [...] just the same as switch type of x [text! [...]

So the biggest impact is probably on the C code for the interpreter itself. And as mentioned--the API.

In any case, there isn't really a particular rush, but I am having to think these things out. Because I was going to use rebT as a shorthand for making temporary variables, when I remembered this change...which means that I'd want it for splicing text in.

So something to be aware of, and if anyone wants to protest it then they may.

hostilefork · May 15, 2018, 6:26am

As with the action change, I think this is change is turning out to be quite good, and makes me wonder why I waited so long. There's not much downside, and the benefits show up quickly.

Consider making a parameter previously for [string! tag! file!]. What would you name it? It seems deceptive to call it string [string! tag! file!]. A reader who got down to the middle of the routine and saw something called string but found out it was a TAG! might be confused. So you wind up either going more generic/meaningless with a name like value, or having to be more correct with something like any-string. Which is a clunky name for a variable.

But under this revised model, string [text! tag! file!] seems just fine, doesn't it?

Also:

TEXT is shorter by 2 characters than STRING. It gets you double savings on text [text!] vs. string [string!], and possibly more in the descriptions.
"foo will be converted to string" is ineloquent, so one typically writes "foo will be converted to A string" Yet "foo will be converted to text" sounds fairly natural.

Not that big a deal, but it helps when you're trying to keep code brief.

And a very significant change that will be invisible to most, is what it does to help with grokking the C code for the interpreter. When you have a lot of routines like Make_XXX_String() which operate on the underlying series, but then be used to initialize a TAG! or FILE!. It's much better when you don't have that conflict--similar problems were addressed early on with "BLK" in the source (where it applied to what could be a BLOCK! or GROUP! or PATH!).

Seeing the change live-and-in-person, it's a good thing.

hostilefork · February 17, 2024, 7:50am

I certainly do prefer Ren-C's choices here...

Rebol2 / Red:

 ANY-BLOCK!:
    [block! paren! path! lit-path! set-path! get-path! hash!]

 ANY-STRING!:
    [string! file! url! tag! email! ref!]

Ren-C

ANY-ARRAY?:
    [block! group! (set-block!, set-group!, etc.)]

ANY-SEQUENCE?:
    [path! tuple! (set-path, set-tuple!, etc.)]

ANY-STRING?:
    [text! file! url! tag! email!]

But I just noticed that for whatever reason I never got around to changing ANY-WORD! to ANY-SYMBOL!, and I didn't remember the justification for not doing that.

Then I remembered that I'd considered there being a SYMBOL! type, which would be distinct from WORD! because it was only available in the non-sigil form, and could not be used in sequences. This would be things like / or :: or -> or whatever. You'd have to assign them like:

 [::]: func [...] [...]
 ; or
 set inside [] ':: func [...] [...]
 ; or
 set $ ':: func [...] [...]

I wavered and instead decided to try a generalized way of letting arbitrary words be legal, e.g. with spaces and such, but I think this is a dead end (and I'll write it up). So that puts the SYMBOL! proposal back on the table.

(What it would actually probably be would be that ANY-SYMBOL? would be the class that included all words as well as this weird outlier, and it would be named something else...GLYPH! or RUNE! or something...)

Something about ANY-WORD isn't as bothersome anyway, because all members of the class have "word" in their name. I don't hate it the way I hated ANY-BLOCK! describing GROUP! and BLOCK! (in fact, ANY-BLOCK! now describes BLOCK!, SET-BLOCK!, GET-BLOCK!, etc. so there's another instance of "it's okay to say ANY-XXX! if all members of the class are XXX")

This category exists, and it's now called (perhaps obviously) ANY-UTF8.

But ANY-STRING specifically covers only series types which can be mutable and have an index. ISSUE! and URL! and EMAIL! do not, and this immutability avoids problems like:

 >> x: http://hostilefork.com

 >> type of x
 == &[url]

 >> reverse x
 == moc.krofelitsoh//:ptth

 >> type of x
 == &[url]

 >> clear x
 ==

So series operations aren't available on them, but you can pass them anywhere that ANY-UTF8 is accepted.