Case Insensitivity vs. Case-Preservation (can't have both?)

hostilefork · December 23, 2020, 9:25am

Being case-insensitive for binding and equality comparisons makes Rebol/Red pretty unusual. It is certainly more costly in the implementation and harder to write. And it opens a huge can of worms to try and define what case insensitivity means at a system level...especially with UTF-8.

When I think about it, I've almost never used it (on purpose), and all it does is get in the way. (Admittedly I did for a time try typing FAIL [...] in all caps to draw attention to it even though it was defined as fail...however I found this ugly, and stopped doing it.)

Pretty much all of the reasons I cite in "Making the Case for Caselessness" could be covered just by a best practices document saying "don't use mixed-case names in your code, even though that seems to open up more unique space for identifiers". And you're done.

A Current Example of Caselessness "Getting In the Way"

I was trying to do some cleverness based on the idea of not redundantly storing the string in bound WORD!s...but just letting the name come out of the object it was bound to. This could reclaim one pointer per cell in bound words and would be extremely useful for something I am doing. (It's actually more than useful, it's probably critical to get that slot.)

But it had this Bad Effect (tm):

 >> obj: make object! [Some-Name: 10]

 >> block: [some-name, some-other-name]

 >> bind block obj
 >> block
 == [Some-Name, some-other-name]  ; ack, where'd my case go?!

That's obviously an unacceptable consequence. But when you think about it, you're not really far from having a similar "lossiness" just about anywhere in the system. It's super easy to look things up in tables and go "oh, it's there" and then fetch it back with it actually being different than what you put in.

A Losing Battle?

It's tempting to just punt on the whole thing and change the rules of the system to strip out all of the case-insensitivity. (I recall on a wiki about map case sensitivity, DocKimbel had said that if it weren't for historical compatibility with Rebol2 practices, he'd have probably wanted to make Red case-sensitive.)

So I wonder:

Is the case-insensitivity battle the right battle to be choosing, when it's different from basically all other current languages?
Is it even a winnable battle, if chosen?

It Doesn't Technically Have To Be All-Or-Nothing

The system today follows a fair number of hybridized rules. We could just bend it a bit and say that the BIND operations are based on == (is) equality, and that = still is case-insensitive.

However: this would lead to a world where o: make object! [A: 10, a: 20] is an object with two distinct fields. So you'd presumably want o/A and o/a to get those different fields. Hence some of these decisions seem tied together. Although we could say that / is case-insensitive for historical Rebol compatibility, while the . operator is case-sensitive, which might be a nice compromise...?

(It's worth noting that JavaScript+JSON are case-sensitive to both field names and data, so being able to have unique cased fields is a compatibility aspect with JavaScript objects. Though they can have spaces in key names too, so there's multiple issues to fret over.)

That doesn't change the fact that in such a world--as always--you want a case sensitive and case-insensitive comparison. It's just a question of ('A is 'a) vs. ('A = 'a). And you're saying that pathing and binding lookup uses whichever version of that is case-sensitive.

Does Anyone Object to Case-Sensitive for Just Binding?

It's independent of what = thinks about case, that can be decided separately.

Best-practices could still recommend avoiding the use of mixed case or all-caps in object keys...saving it only for situations where one is trying to achieve compatibility with some external demand. So most people wouldn't notice a difference.

We might try making . accesses case sensitive and leaving / access case-insensitive (if that's seen as necessary? Is it? No reason to do it if no one actually thinks this is important...and they can focus on more interesting distinctions.)

iArnold · December 23, 2020, 11:04am

If however the word however is not at the start of your sentence, it is not written with a capital letter 'H' at the beginning, however it is the same word however as the other however that is however spelled with all lower case letters.

I consider the example A: 10 a: 20 to be a "case" of bad practice programming, using the same word (by coincidence this time a word only of length one).

One of the charms of Rebol is its case insensitivity where even PeopleThatWantToKeepCamels are welcome and treated the same as peoplethatwanttokeepcamels (who don't like hitting shift keys the whole time).

Sure this is the base behaviour, case sensitiveness is needed in some places, the language must be able to handle such cases.

So on case basis Rebolers can be very reasonable, but if Rebol starts to look like Java or ... then the cause of the rebellion is lost forever :-/

Lets hear some more opinions.

hostilefork · December 23, 2020, 12:44pm

One of the first examples that I hit in trying out the change (besides residuals from my FAIL experiment) is that modules have been named with capital letters, e.g. system/modules/Event

I should point out that while we might tend to dislike mixed casing, if we allowed such things to be distinct it does open up pretty important space...and module names for top level scope are actually a pretty good example.

Event: import %some-event-module.reb

; you can refer to things like `Event.xxx` and still have local variables
; named `event`.

This may be an important direction, along with my suggestion that we might be able to couple ACTION! and OBJECT! in a way such that something like math [...] or math/ref1/ref2 [...] could invoke a function with refinements...while math.some-constant could be a field and math.some-other-function/ref1/ref2 arg1 arg2 could invoke another function.

Capitalizing datatypes is another concept that is popular in some languages, which might read better to some people's tastes:

make Object [x: 10, y: 20]  ; you have to hit SHIFT to get the "O"

make object! [x: 10, y: 20]  ; ...but you have to hit SHIFT to get the !

Anyway...having an open mind about casing may be necessary to be competitive in the limited space of words. There are only so many. :-/

Note: I'll also point out that if I really just wanted all-caps FAIL to act like fail, I could just say that specific thing...

FAIL: :fail

Important to remember that's available.

UPDATE: In a little less than 4 hours I was able to make the change and get a booting system, implementing the rule that PATH! access and SELECT+FIND still default to caseless by default when looking for keys. This means that if there are multiple cases of the same word, they just return the first.

The biggest cause for trouble on this to get the boot was in the headers, because it's typical to write Rebol [Title: {My Module}] and not Rebol [title: {My Module}]. The problem was that the default header object was defined with lowercase keys. So when you make default-header block it had things like the default title: {Untitled} and then the Title: {My Module} came later...meaning it was the untitled key that was found.

I changed the default object to use capitalized terms, but it shows the kind of issue that would come up. Sometimes case-sensitivity sucks, sometimes case-insensitivity sucks...but for binding (and hence object keys), case-sensitivity provides more flexibility. And there's a value to getting everyone on the same page for what case to use in their headers.

IngoHohmann · December 24, 2020, 12:02am

I'm not sure how deeply ingrained case insensitivity really is, but I myself always use the same case for all words meaning the same thing, and would only use a different case if I meant something different.
And, though Rebol should never look like Javascript, interoperability with JSON seems really important.

hostilefork · December 24, 2020, 5:48am

To reiterate an earlier point: There's something inconsistent about saying case-preservation is important, but then systemically not heeding case. When an optimization caused a historical case-preservation to lose it, I called this bad:

>> obj: make object! [Some-Name: 10]

>> block: [some-name, some-other-name]

>> bind block obj
>> block
== [Some-Name, some-other-name]  ; ack, where'd my case go?!

I think it would be also bad if everything got lowercased automatically by the system:

>> block: [Some-Name, Some-Other-Name]
== [some-name, some-other-name]

SO...if we can agree both of those situations are bad...then why wouldn't we agree that this r3-alpha behavior is bad?

r3-alpha>> load/header "Rebol [Title: {My Title}]"
== [make object! [
        title: "My Title"  ; Hey, I said `Title:` !
        name: none
        type: none
        ; ...
    ]]

As is Rebol2's habit of going the other way:

rebol2>> print mold load/header {Rebol [title: {my title}]}
    make object! [
        Title: "my title"  ; This time I said `title:` !
        Date: none
        Name: none
        ; ...
    ]

It just goes to show you can't have it both ways. Case-Preservation and Case-Insensitivity are fundamentally at odds.

But this is unfortunate:

>> [code header]: load "Rebol [title: {my title}]"
>> header
== make object! [
     Title: "Untitled"  ; ... huh?
     Date: _
     Name: _
     ; ...
     title: "my title"  ; ... grrr.
 ]

If we tuned OBJECT! to use the same trick that MAP! does at the moment, it could error when you do a case-insensitive access:

>> select header 'title
** Object has different key cases for `title`, use SELECT/CASE

>> select/case header 'title
== "my title"

>> select/case header 'Title
== "Untitled"

Doing that test efficiently would require keeping track of if object keys have synonyms; so each object expansion would need to re-check that and update some bits.

FAIL was the only example I had of deliberately using case to "stand out". I found no others, and apparently I've always stuck to the capitalization in file headers.

The best way to maintain sanity might be to couple my "error if multiple cases exist by default" above with "force case to match by default." It could give intelligible errors:

>> header/date
** Object does not have `date` field, but has `Date`

But we can't let multiple cases break binding in a case-sensitive world, because then if you have multiple cases of the same word anywhere in the user context it would conflict.

>> o: make object! [Title: {Thing}]

; `Title` is now in the user context, because all words are bound into the
; user context *before* the code runs (and makes it a field in the object).
; If you are unclear on this point, re-read:
;
; https://forum.rebol.info/t/the-real-story-about-user-and-lib-contexts/764

>> title: "hello!"
>> title
** If this errors there's already `Title` then that's bad

So binding would have to be one of the /CASE tolerant operations, which makes sense in this concept.

We have some interesting options for saying "I mean it", like doubling up slashes in the path:

>> header/title
** Object has different key cases for `title`, use SELECT/CASE
    ; ^-- more than likely, this generates an "uh oh" and people would then
    ; look and say "why is there more than one case"

>> header//title
== "my title"
   ; ^-- could be a nice syntax for getting things case-sensitively out of
   ; MAP! as well

And I've already mentioned there might be nuances between . and /, though I don't want any nuances that make me less likely to use . because I think it is going to be my preferred field selector. So above I'd probably want header.title to error on the ambiguity, because that's the safer behavior, and say header..title if I meant there's a Title: too and I'm aware of that fact.

Altogether, I'm just about convinced about case-sensitive binding. There are some epicycles to deal with, but it's rather telling that I got a working system so quickly. We know somewhere that case-insensitive comparisons have to be offered, but I think you don't want it anywhere that comes into conflict with the ability to do case-preservation, which means object keys have to preserve case...hence binding itself has to be case-sensitive.

johnk · December 25, 2020, 11:33pm

The use of UTF-8 seems a good reason alone for introducing more case sensitive behaviour. The rules around case insensitive comparison appear quite complicated and could cause more problems in the long run.

hostilefork · December 26, 2020, 9:59am

...for some definition of "behavior".

As with the "CR LF" => "LF" policy...my pitch is to take "strong bets" on trends that are going to be guaranteed to still be relevant, and put the costs of edge cases on those few who demand them.

Be sure to read over this thread regarding unicode normalization:

johnk · December 29, 2020, 3:20am

Thanks. The Unicode normalisation thread is a good read. What a mess! The é example with different behaviours in apps and filesystems is quite an eye opener.

hostilefork · December 29, 2020, 12:43pm

Two more exceptions. :-/

Rebmu

I thought Rebmu would be unaffected, since the decoding of the mixed-case input just produces entirely lowercase tokens. There's no binding of anything uppercase involved..

but once case-sensitive identifiers exist, you'd have a hard time referring to them. Because they'd be broken up (MixedCase => m: ixed c ase)

Just prohibiting mixed-case identifier usage isn't any particular problem for this domain, though there could be some exception syntax (e.g. leading backslash, like \MixedCase meaning honor the case of the next word). It's not really a big deal either way.

CSCAPE

The templating language in CSCAPE uses a weird rule to decide if the result of a code insertion should be uppercased, lowercased, or left alone.

>> items: ["lowercase" "MixedCase" "UPPERCASE"]

>> cscape "Lowercasing $<second items>"
== "Lowercasing mixedcase"

>> cscape "Uppercasing $<SECOND ITEMS>"
== "Uppercasing MIXEDCASE"

>> cscape "Leaving $<Second Items> alone"
== "Leaving MixedCase alone"

While it's pretty weird, I think it's kind of clever, and it works with the domain. There could be of course alternate shorthands like $L<...> for lowercase and $U<...> for uppercase with the default leaving it alone. But it doesn't visually cue you quite as well when you're looking at what's being put together (frequently these are fragments of #define declarations in C or things like that, and it reads much better when the case of the splice cues you to what the result will look like).

Since it's starting from a string, you might say "well, then just do the detection of which it is...then convert the string to lowercase...and load it." However, that means you would also lowercase any embedded strings in the code.

The only thing this broke looked like:

cast(CFUNC*, ${"T_" Hookname T 'Class}),  /* generic */

Strangely enough, Hookname is an enfix function which pulls in the left hand side to build an unspaced full identifier name. You might think I could have just written:

cast(CFUNC*, T_${Hookname T 'Class}),  /* generic */

But as it turns out, if the class is NULL it wants the whole hookname to be nullptr (not T_Nullptr). Which is why I did it in this weird way.

That's the only case, and I fixed it by re-uppercasing the prefix. :-/ You're not really supposed to write an essay inside the CSCAPE escapes in the first place. It's being nice by letting you put a bit of code vs. just variables in the first place.

It's a strange application, and just converting the input to lowercase works. But I'm trying to inventory every place that I hit where case insensitivity was being leveraged somehow.

CR and LF

The character constants CR and LF were defined as uppercase. This is typical with their notations in ascii tables.

With case-sensitive binding, they either need to be referred to as CR and LF ... redefined to be lowercase cr and lf ... or have synonyms.

Not sure how I feel about this one. I'm so used to seeing it capitalized that I feel you lose communication ability if you force it to lowercase. Having synonyms feels a bit wrong. I kind of would go with wanting these to just be uppercase. Anyone else have opinions?

rgchris · January 3, 2021, 6:51pm

I notice that Red is only case insensitive for the 26:26 unaccented characters.

>> make object! [café: "Coffee" cafÉ: "Scones" Café: "Tea"]
== make object! [
    café: "Tea"
    cafÉ: "Scones"
]

As is Rebol 2 but not Rebol 3 (or Ren-C)

In terms of my own case insensitivity, I use initial caps for headers (including Rebol []) as it has a formality to it but would balk at having to use said caps in accessing that information anywhere in the script system/script/header/title. The other place I use it in a mixed way is representing HTTP headers: header-proto: make object! [Content-Type: "text/html"]—there are benefits when it comes to forming headers in such a way, but again, would feel icky to access them that way in paths: header-proto/content-type

This may be a parochial opinion, but I'd be fine with the 26:26 compromise.

As a side note, was just futzing with some JS code that capitalized its camel-cased class names and did not for it's derivatives.

class GreenThing {...}
greenThing = new GreenThing(...)

Whatever funkiness currently exists with associating binding with cased represention, it is not as bad as this.

hostilefork · January 3, 2021, 7:28pm

It's best if you phrase your preferences in terms of a list of tests with desired output (or definitely not-desired output).

I pointed out the problem of case preservation...where if you make an object which already had an opinion of case on its fields, then if your derived object uses different cases you seem to have these options:

Consider the cases equivalent, and collapse the definition to use one of the cases
Consider the cases not equivalent, and end up with keys for both.
Raise an error that you're trying to mix cases of the same word...forcing the deriver to canonize their names to whatever the base used

But the way things are set up, #1 can really only easily collapse the definition to what the parent used. So if you go with this option, you lose what the derived case said.

Also, the idea of making mixed cases in an object illegal won't work with case-insensitive binding, because (for instance) the user context needs to allow you to have Foo and foo word instances bound into it. Which means #3 would have to be limited to only some class forming tools...as opposed to a rule for contexts in general.

This is why right now, we have #2... and hence multiple cases of keys.

Mark-hi · January 7, 2021, 2:16pm

I don't buy any of these arguments.
Also, case-sensitivity ruins HELP.

hostilefork · January 7, 2021, 2:24pm

One line is not a rebuttal worthy of heeding.

The most comprehensive analysis of why case insensitivity might make sense for a language (despite not being a practice in pretty much ANY language that people use today) is written by me. In that analysis I did not address the tenuous relationship between case-preservation and case-insensitivity. In this thread I do.

Getting enough bits available in a word cell to do virtual binding at any level of efficiency--without increasing the cell size--is important. There are complex mechanics which might make it possible other ways than not storing a spelling variation pointer...they'll all have some cost, but the biggest cost is just complexity.

If case-insensitivity...something no other language gets itself involved in at identifier-level, especially in the unicode era--is so mind-bendingly important, it needs a strong and completely thought-out defense. Extraordinary claims require extraordinary evidence. Not "I have some idea stuck in my head from 20 years ago that seems it might be good in the abstract, but about 2 minutes to devote to defining it now".

Not any more or less than anything else. I'd argue the impact can be much less, as there is also at hand a list of alternate spellings of the same WORD! (formerly called "synonyms")--which could be acted on to say "did you mean..."

Anything is possible, but the points need to be committed to and analyzed. That means explaining and defending a position on the case preservation of keys which I've explicitly called out twice here.

Mark-hi · January 7, 2021, 5:04pm

Of course, one-line rebuttals are not worthy. I always intended to expand upon it.

Here are two significant points to start with:
(1) There are powerful, well-used, and significant computer languages that in fact are case-insensitive. Firstly, Pascal. Then, in no particular order, Fortran, Ada, Basic (most of them), and SQL. Some SQLs even go so far as to treat the data itself in a case-insensitive manner!!
(2) Case-insensitivity is NOT a language design issue. It is a human utility issue. Languages (and file systems!) that are case sensitive are plagued with hard-to-debug errors cause not by the language, but by how hard humans actually find it is to work within case-sensitivity constraints. In fact, I would venture to say that anybody who is comfortable working in a case-sensitive computing environment has spent YEARS bending their brain into that shape, so much so that they no longer even see it as a problem, and can construct (non-human) arguments as to how it is in fact better. In case you are wondering, I am such a person, though I am now trying to at least partially undo that error from my past.

Finally, here are two links that go into some detail (some of it not so relevant, sorry) as to why case-preserving case-insensitivity is important, including replies and rebuttals and demolishing strawman arguments. Please at least peruse them:
(1) OddThinking » The Case for Case-Preserving, Case-Insensitivity
(2) The USS Quad Damage (which is in response to the above link)
If you look carefully, you will even see a position on the case preservation of keys explained and defended, specifically, that if an object has key/value 'FooBar:7' then searching for key 'foobar' should match it and show the matching key/value pair as 'foobar:7'. I understand that this may be difficult to implement.

IngoHohmann · January 7, 2021, 10:28pm

I've always liked case sensitivity in file systems (disclosure: I'm a long time Linux user).

Maybe I can attribute this to German being my native language, where all nouns have to be written with a capital letter, and morgen and Morgen are actually 2 different words.

Morgen = morning
morgen = tomorrow
(Though this is the only example I have).

I guess this makes case sensitivity normal for me.

And why should I write the same thing in differing casing? I wouldn't exchange letters as well.

Mark-hi · January 8, 2021, 2:16pm

@IngoHohmann,

The reason you could find only one example is because it's not really an example, and in fact what you are trying to say is going on does not happen in any language, for obvious reasons when you think about it.

"Morgen" and "morgen" are the exact same word with the exact same (set of) meaning(s). When used as a noun it means "morning", and when used as an adverb it means "in the morning". Just like the Spanish "mañana", or the Afrikaans "môre", or in fact the English word "morrow", though it is an archaic usage now.

Ref: dictionary - Why are "tomorrow" and "morning" the same in German? - German Language Stack Exchange

As for why you would want to change case but still mean the same thing, why would anyone ever NOT want to do that?

BlackATTR · January 8, 2021, 5:52pm

I feel that if a programming language is case-insensitive, it becomes rather important to have a decent code editor to warn you of potential clashes.

IngoHohmann · January 9, 2021, 12:25am

... aNd when useD as a noUn it has TO BE writTEn with a capItAl m is all i'M saYIng.

BeCause IT doesn't maKe sEnsE.

hostilefork · January 13, 2021, 3:11pm

It's important to distinguish the question of if people are given tools to make case-insensitive dialects easier to make, vs. is the whole underlying language itself case-insensitive.

When you make the language itself case-insensitive, you are saying that those who wish to use identifiers case-sensitively--to expand the space of names--cannot do so in the main language. I'd dismissed this as "not important" but the more I've thought about the harsh limit of words, the more it seems someone might need that space.

Anyway, I don't know that the cited Interweb posts make any arguments that really move the needle, and if anything seem more convincing that the language should be case-sensitive. I still think my writeup is far more compelling, if there is an argument for case-insensitivity as a best-choice to be made.

It seems that if anything, when you are using similarly-cased identifiers...you should be given an error, to standardize you on a canon form.

As a sidenote @gchiu had used uppercase to say REPLPAD-WRITE/HTML based on the argument that the HTML acronym is uppercased. (Why this spread to the REPLPAD-WRITE instead of replpad-write/HTML I don't know). But that is an instance where the desire for case-insensitivity seems plausible. But maybe it should be /HTML and force everyone to use the same canon all-uppercase spelling, instead of having everyone writing it differently. :shrug:

(But of course, as those who edit things like Apache configs would know, there's plenty of places where html and HTML are different in the computer world. So I think rather than a knee-jerk "A ha! See it's plausible! In a case" makes for a slam-dunk winning argument where all other angles must be dismissed.)

There are implementation tricks which would cost some performance and complexity to bring the case insensitivity back. But I want us to continue taking a good hard look at it, and so living in a case-sensitive Rebol-flavored world is a good way to find things, such as Graham's example, to include in a big picture explanation of why we pay that cost and what exact rules we need for it.

@rgchris's suggestion about only heeding the 26 ASCII characters in case is also something that we need to understand, e.g. if we don't use such an optimization, why we don't.

hostilefork · January 14, 2021, 2:28am

Canonizing the keys to is the only strategy that seems sane to me, if multiple cases are not considered an error. Lowercase seems the only option people would accept.

 >> obj1: make object! [FooBar: 10]
 == make object! [foobar: 10]

 >> obj2: make obj1 [FOOBAR: 20]
 == make object! [foobar: 20]

But is this a property of WORD! and OBJECT! only, for binding? If you make a MAP! do the keys act differently? Differently for words, or for strings?

If we give strings bindings, we might say you could round-trip WORD! => TEXT! => WORD! without losing the binding. Then if MAP! threw away case for words, you might get around it with string case sensitivity.

The question of string case sensitivity is a different one from words, and needs separate consieration.

I think this is a bigger fight to pick than most people realize, and the energy it takes to do it right is non-trivial. Ren-C is a framework for implementing anything that can be articulated coherently...there's just a lot of questions about that coherence.

I can do a trick parallel to the quoting to make it possible to store up to 3 spelling variants at word reference sites before needing to do an allocation to make the word reference sites bigger. We can presume this would be rare. But I still think getting the experience here is important.