What To Do About Horrible, Grievous, Unicode

hostilefork · September 15, 2024, 1:38pm

At some point I picked up a test from Red, which is basically this:

str: " ^(A0) ^-a b  ^- c  ^(2000) "
assert ["a b  ^- c" = trim copy str]

That's some funny business.

The hex character ^(a0) is decimal 160, which is the standard Unicode translation for  ... non-breaking space.
The hex character ^(2000) is "EN QUAD".. a space that is one en wide: half the width of an em quad.

R3-Alpha (and Ren-C) never had support for trimming these characters out. So the test fails.

But it wound up flying under the radar, somehow.

(I think I didn't actually pick up the file it was in to the tests until some time after adding it. And by the time I added it, a lot of things were breaking due to binding and I was putting off fixing everything until it was time. Now it's that time--I'm going item by item reviewing breakages and getting the tests in order.)

But when I got to this one, the log just said:

(   
    str: "   ^-a b  ^- c    "
    "a b  ^- c" = trim copy str
) "failed, test returned null"

Because I didn't go look up the test (I thought I had it right there)...I didn't realize there was funny business because the display doesn't give you any indication. Neither does Red's after the transcode:

red>> " ^(A0) ^-a b  ^- c  ^(2000) "
== "   ^-a b  ^- c    "

Even pasting it into VS Code (which I didn't, until just now) gives you terribly weak feedback that something weird is going on:

Gee. Glad I had "show invisibles" turned on--that really did a lot for me there.

(Seriously, what is the point of that feature if that's what it's going to do?)

I Don't Want This Stuff In Source Files

We can't fix the world. They're doing what they are doing. This stuff is the currency of text and you have to support it.

But we can set house rules. The default mode for Ren-C should only allow two invisible characters in source: space and newline. (And I'd like there to not be space at the end of lines.) This would be a hard rule for any script in official repositories.

I'd have saved myself an hour of confused digging if there'd been an error when I pasted in the console, telling me I was dealing with a messed-up situation. There'd have to be some conscious shift into a mode to tolerate it... temporarily as some kind of way to import a string into the system.

Not Ready To Support This Test

There's a sparse bitset implementation that has been on the shelf, but that's needed before we create unicode charsets for high codepoints.

Anyway, there are higher priorities. But I definitely do feel like there should be some alarms going off when you are reading files with disruptive codepoints. You should have to say "Yes, I want ugly codepoints" or "Yes, I want emoji".

A totally permissive TO TEXT! operator shouldn't be what people are reaching for. You should have to be explicit. (decode [@utf-8, whitespace: all, emoji: all] blob). Principle of least privilege... conservative defaults.

The names for the specializations should help guide behavior. (decode @utf8-unchecked blob). (Unnatural?)

"What does that mean, unchecked (looks up documentation)"
"Oh, I have an option to have it screen out weird whitespace? Wow! Great! "

Something like utf8-basic would make conservative choices--the same ones used by default for source code.

bradrn · September 16, 2024, 12:34am

I think this is a sensible choice. Whitespace other than the usual ASCII space just causes confusion in source code.

This, on the other hand, worries me. Both non-breaking spaces and en quads have legitimate uses in text — they’re hardly ‘ugly’ at all. (There’s even more obscure ones with legitimate uses: e.g. text in French uses the narrow non-breaking space, U+202F.)

In general, it feels to me that you’ve been trying to manage the complexities of Unicode by ignoring them. (First by trying to normalise everything, now by trying to remove annoying characters entirely.) The trouble is, Unicode is complex because the problem domain of text is complex. Speaking as someone who spends a considerable amount of time dealing with multilingual text, I can say with confidence that trying to avoid the issues does nothing but make them more complex when you inevitably stumble upon them.

As for this test itself: some parts of Unicode are complex, but not this. The Unicode character database lists precisely 17 characters with general category Zs (Space_Separator), plus one line separator and one paragraph separator. That’s 19 characters in total. I see no reason why Ren-C couldn’t recognise them.

hostilefork · September 16, 2024, 11:19am

Then you agree there are domains in which it makes sense to be conservative.

The question is: how often are usages of usermode string values based on domains where conservatism makes sense?

Maybe I'm unusual--in that I work on an interpreter, so most "usermode" strings I work with wind up being source code at some point. If I worked on other kinds of programs that had to pass-through arbitrary freeform HTML or emoji-laden SMS messages all the time, I might have a different bias.

I have strong feelings about Freedom To vs. Freedom From. e.g. I'm the sort of person who is completely convinced that allowing spaces in filenames caused far more harm than good to the field of software.

Anyway, for right now my principal agenda is to stop losing hours on this by not letting it bite me every time it comes up.

I don't know what the exact tradeoffs to be made here are, or which you'd agree with. But I give the example of pasting things into the console. And that can be a situation like you type string: -{ <<PASTE>> }- and I don't want to get arbitrary invisibles when you do that without some warning. Perhaps you'd think that falls under the "it's okay to be restrictive there, because it's source code".

Anyway, it doesn't necessarily need to be hard to put the console into a mode where it allows such pastes (console @utf8-unchecked)...I just don't want it to be the default. And I want there to be an established vocabulary for talking about these domains.

Once that vocabulary exists, I want people to use it. I've mentioned that I'm tired of all the implicitness in the system, it just breaks every time a change comes along. How the 32-bit to 64-bit change was handled in R3-Alpha and Red is a perfect example of how people love to just kick the can down the road with more bad choices after having just seen the prior methodology fail.

A COBOL programmer, tired of all the extra work and chaos caused by the impending Y2K bug, decides to have himself cryogenically frozen for a year so he can skip all of it.

He gets himself frozen, and eventually is woken up when several scientists open his cryo-pod.

"Did I sleep through Y2K? Is it the year 2000?", he asks.

The scientists nervously look at each other. Finally, one of them says "Actually, it's the year 9999. We hear you know COBOL."

bradrn · September 16, 2024, 3:19pm

Yes, I think this is the unusual domain showing. In my kinds of programs, by contrast, I very regularly have to deal with multilingual text and ‘unusual’ (to English-speakers) Unicode characters. I’d say my situation is probably more representative of the more common use-cases.

What helps me in writing these programs? Basically, a language which doesn’t get in my way when I manipulate these strings, but only does exactly what I tell it to. Arguably Haskell takes this to an unusual extreme: I can’t even directly read a file to a string, I need to read it to a byte array and then parse that as UTF-8. But it does make my job immensely easier — I can port my program to Windows. Linux and WebAssembly without having to worry that locales or OS conventions will mess everything up. That’s a win in my book.

I lean towards allowing arbitrary Unicode in strings — especially ones with such distinctive delimiters. The benefits of being able to copy-paste arbitrary text outweigh the possibility of that text having an unexpected character. (Remember, this test case was designed to be unusual… most text isn’t like this.)

Instead, on reflection, I tend to see this case as a problem with printing to the console. When Haskell prints strings for debugging purposes, it escapes all non-ASCII characters by default:

ghci> "unicöðe texŧ"
"unic\246\240e tex\359"

Why can’t Ren-C do the same? It would nicely solve the problem of copying from the console, while leaving the language itself unaltered.