O noes, Unicode Normalization

hostilefork · August 27, 2020, 4:08pm

UTF8-Everywhere has been running along relatively well...even without any optimization on ASCII strings.

But there's a next level of bugaboo to worry about, and that's unicode normalization. This is where certain codepoint sequences are considered to make the same "glyph"...e.g. there can be a single accented form of a character with an accent, or a two-codepoint sequence that is the unaccented character and the codepoint for an accent.

Since there's more than one form for the codepoints, one can ask what form you canonize them to. You can either try to get them to the fewest codepoints (for the smallest file to transmit) or the most (to make it easier to process them as their decomposed parts).

On the plus side... unicode provides a documented standard and instructions for how to do these normalizations. The Julia language has what seems to be a nicely factored implementation in C with few dependencies, called "utf8proc" It should not be hard to incorporate that.

On the minus side... pretty much everything about dealing with unicode normalization. There have been cases of bugs where filenames got normalized but then were not recognized as the same filename by the filesystem with the shuffled codepoints...despite looking the same visually. So you can wind up with data loss if you're not careful about where and when you do these normalizations.

This could get arbitrarily weird, especially with WORD! lookups. Consider the kinds of things that can happen historically:

>> o: make object! [LetTer: "A"]
>> find words-of o 'letter
== [LetTer]

The casing has to match, but also be preserved...and this means you could get a wacky casing you weren't expecting pretty easily. Now add unicode normalization (I haven't even mentioned case folding).

Do we preserve different un-normalized versions as distinct synonyms?
When you AS TEXT! convert a WORD! does the sequence of codepoints vary from one same-looking word to another?
Are conversions between TEXT! <=> WORD! guaranteed not to change the bytes?

It's pretty crazy stuff. I'm tempted to say this is another instance where making a strong bet could be a win. For instance: say that all TEXT! must use the minimal canon forms--and if your file isn't in that format, you must convert it or process it as a BINARY!.

Or there could also be an alternative type, maybe something like UTF8!...which supported the full codepoint range?

Anyway...this stuff is the kind of thing you can't opt-out of making a decision on. I'm gathering there should probably be a conservative mode that lets you avoid potential damage from emoji and wild characters altogether, and then these modes relax with certain settings based on what kind of data you're dealing with.

Mark-hi · August 27, 2020, 11:55pm

For what it's worth, I strongly feel that differently-encoded-but-canonically-equal forms should compare equal. I understand that you won't be able to tell by looking at it what bytes will be in its to-binary form (unlike case-insensitivity), but if that is important to you than you should explicitly canonicalise everything the way you want. Particularly because there is no a priori reason to prefer one canonical form over the other (which is why there are two canonical forms actually, kind of a misnomer really). It's true, file names are a bastard, but they already are, even with just ASCII.

hostilefork · August 29, 2020, 1:58pm

Most people consider languages that don't to be buggy. Though we might argue the bug is in the "standard" allowing more than one way to encode the same character. :-/

My point is just that if there are differently-normalized forms to be floating around all the time (lets just focus on WORD!s, for instance) then at any moment in time you might run down a code path that gives you one of the variants. It would bring in a random element.

Coming up with an agnostic hash that compares all the various alternate spellings the same for WORD! and looks them up properly...while allowing the variations to exist...is a performance issue as well.

I don't mind there being modes in which you can decompose characters into parts...that's just an API for picking a character apart. But I don't think those decomposed characters should be used as a medium of exchange. Especially because you'll get deceptive-seeming answers for the "length".

Being more hard-line would say the system only tolerated the compact normalized form. This would not be done silently...it would simply error unless you explicitly used a normalizing codec, so you knew you were throwing something away (a similar approach has been applied to the ^M atrocity).

But there's a difference here with ^M, because there'd have to be a constant running check on any binaries you tried to insert into strings to make sure they only used the canon forms. And these checks would have to look at the edges to make sure you didn't insert something that would be valid on its own next to something that would combine it into a new character. Combining would need to be an intentional act, where you identified the character you wanted to compose or decompose.

hostilefork · August 31, 2020, 6:52am

Jeebus.

https://metacpan.org/pod/Encode::UTF8Mac

" On OSX, utf-8 encoding is used and it is NFD (Normalization Form canonical Decomposition) form. If you want to get NFC (Normalization Form canonical Composition) character you need to use Unicode::Normalize's NFC() .

However, OSX filesystem does not follow the exact specification."

Linux doesn't seem to normalize at all:

$ cat /proc/version
Linux version 5.4.0-42-generic (buildd@lgw01-amd64-038)
(gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2))
#46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020

$ echo $'\xc3\xa9.txt' $'e\xcc\x81.txt'
é.txt é.txt

$ echo "foo" > $'\xc3\xa9.txt'

$ echo "bar" > $'e\xcc\x81.txt'

$ ls
é.txt  é.txt

Windows Console rendering makes these files look unique in the directory (e` vs. é), and the new Windows Terminal renders them both as é but makes the two-codepoint version take up two characters worth of space (so there's a margin on either side of it). :-/

GitHub allows you to add those files to a repo and push it:

https://github.com/hostilefork/unicode-pathology

Each link is treated uniquely by Chrome and Firefox, despite showing up the same in the URL bar (gets percent-encoded on copying and pasting, as here):

https://github.com/hostilefork/unicode-pathology/blob/master/e%CC%81.txt
https://github.com/hostilefork/unicode-pathology/blob/master/%C3%A9.txt

This is nuts. If you are writing a terminal program that gets one complete codepoint at a time and writes it down a pipe for output, such that you want to output as you go...you won't get this "combining" behavior unless you go back in time and edit what you've already output.

I can see there is a usability issue for people who want to enter combining characters as a sequence. But this should be a property of the editor you are using, not automatically done by the display layer.

When do you have people on one side of a connection with a dummy terminal that can't go back and "upgrade" the codepoint stream they are sending to the composite character...and yet expect the person on the other side to have a smart terminal that will see the composite codepoint they intended?!

Still leaning to the answer as to warn about non-canon data when you get it and have a special override to say "I meant to do that". I'd liken this to the situation with ^M, where the system stops the badness at the endpoints vs. contaminating every parse rule with having to support more than just plain newline. If you are unfortunate enough to have to work with tainted data, then make that taintedness as visible as you can; don't render distinct codepoint streams identically. Stop the madness.

Putting this into a concrete philosophy: if I were in charge of Chrome's URL-bar display formatting, I'd make it so only the canon composed form should render as é and then the decomposed form should look like e%CC%81.txt . I'd also say the old and "broken" Windows Terminal visual which shows the discrete codepoints is superior to the "fixed" form.

I feel this is kind of like in Unix text editors like vi where you see the ^M everywhere... at least you know that garbage is there.

I'm of the same opinion that all "invisibles" should have a rendering, albeit a faint one...this helps me see differences between tabs and spaces, or whitespace at ends of lines. The whole concept of "make underlyingly different things look indistinguishable" is flawed at its core. Whenever you do this, you have introduced potential for exploits and bugs.

rgchris · September 1, 2020, 6:39pm

What a mess!

An argument for the decomposed form as canonical would be that it'd be a smaller number of concerns when applying a transformative function such as UPPERCASE.

This is a fairly crude observation, I think I'd need another read over. On filenames/URLs, I presume supporting non-UTF8 sequences may still be a concern, e.g. url://foo-%de%ca%fb%ad or %%de%ca%fb%ad.bin—in the spirit of, if the filesystem supports it (which I don't know which ones would, again just a cursory observation—it's certainly a valid URL)—thus the only concern would be how they are FORMed.

hostilefork · September 4, 2020, 2:52pm

The W3C officially suggests to use NFC (which is the composed form):

The best way to ensure that these match is to use one particular Unicode
normalization form for all authored content. As we said above, the W3C
recommends NFC.

Actually it seems pretty much unenforceable to try and get you to use a particular normalization at runtime. Because you might grab a string sequence that is legitimate on its own (something like just three accent marks in a row)...but then paste it into where it would act in a combining way (like after a vowel).

So I'll stick to my claim that the real villain here is the automatic combination. Every display surface (terminal, GUI, etc.) should strive to show as many codepoints as are there. If your editor wants to make it so you type e-then-accent as two steps, and get an e-with-accent, then it's the editor's job to do that upgrade (or give you a keystroke to ask it to combine or decombine a character).

(I'm not sure how this would apply if there are some visual glyphs that can only be seen as composed forms, and have no canon representation as a single codepoint (?))

Stopping you from loading or saving data with non-NFC sequences in it by default seems about the best you can do, and then have you say what you want done with it. But I don't think we want to entangle the runtime with things like normalization-agnostic-hashing for WORD!-lookup. Going with the W3C's suggestion we can just say NFC is the standard.

hostilefork · December 29, 2020, 6:36am

Looking back at this in more detail, it seems that not all decomposed characters have composed forms, and composed forms are mostly for compatibility. e.g. a subset of combining forms existed in scripts where they were actually used, but all the others are legal as well. It's even legal to put combining characters on newlines and control codes (?!).

And there's apparently no limit to the number of combining characters in the spec, with numbers like 3 in Greek and 10 in Tibetan being actual legitimate cases. The only standard that seems to exist sets a limit of 30, arbitrarily!

This tool is useful for experimenting: https://onlineunicodetools.com/add-combining-characters

So What About That Non-Backtracking Terminal?

If a terminal can't go back and erase what it has already printed, it would need a different philosophy. For example: "disallow combining characters on newline, and don't output a character until you get the start of the next non-combining character, or a request for input." This is the only way it could build a buffer for each full combined character.

Hence if the following code ran on a "coherent" terminal:

write-stdout "a"  ; e.g. PRIN
wait 10
write-stdout "newline"
wait 10
write-stdout "b"
wait 10
data: ask text!

You'd see nothing before the first 10 seconds, and then a. Then nothing for the second 10 seconds, and then a newline and the a would appear. Then ten seconds would pass and nothing happened. Then "b" would be output and you'd input text at the prompt.

But as I point out, this uses a rule that's not even a rule...that you can't put combining characters on a newline. Also, there are now "combining emoji" which people are expecting to work, which is not covered by formal specs.

Anyway I'd have to write some tests to see what existing terminals do with such situations when combining characters come after a delay vs. not.

Should "CHAR!" be multi-codepoint ("Grapheme Cluster"?)

Since not all valid combining characters coalesce into single known codepoints, that tips the scales somewhat toward using the decomposed normalized form internally (vs. working with sporadic compressions that only get created sometimes, in the cases that character is common enough to have earned its own codepoint).

However, it would be unfortunate if the in-memory representation of strings wasn't something you could just generally write directly out to a file in the format expected by reading. And the W3C advocates for the compressed form...plus that's what many text editors are going to favor, as users aren't entering things via combining marks generally.

It also seems like from the user's point of view, "CHAR!" should likely be multi-codepoint, pulling together base characters and their combining characters. The foundation for this is already laid somewhat, since the "token/issue" basis of storage is done as an encoding instead of an integer...and that encoding can (if necessary) exceed the size of a cell.

This would mean length of string would not give you a codepoint count, but the number of "indivisible" character+cominer units. You'd have to say nfc of string or nfd of string to get the codepoints as arrays of integers.

I'm not sure what codepoint of char would do, since there'd be no single answer. It seems it would have to either default to the compressed form, or the intersection of where the compressed and decompressed forms are the same. But if it were the latter, you could never get the codepoint of an accented character without being more specific. (codepoint/nfc of ?) But maybe forcing you to be explicit and aware of when multiple codepoints can be returned is good... codepoint vs. codepoints giving back an array of integers.

We don't need to have all the answers (nor is it likely to be possible to, this is a very human construction and has a fair amount of incoherence). But it would be nice if the common cases worked and kept you on track...warned you when you were drifting out of the "easy" zone, and gave you enough tools to handle the harder cases if you come across them.

rgchris · December 30, 2020, 1:13am

Another wrinkle is how one constructs charsets to capture these forms.

hostilefork · June 12, 2024, 11:26pm

While reading on some details Unicode normalization, I found a writeup of how URLs could be significantly changed by the process:

Unicode normalization could change the structure of a URL · Issue #626 · whatwg/url · GitHub

That's nuts.

To quickly refresh everyone's memory on our URL! datatype: we now think of it as being the decoded form...like what people would copy and paste out of a browser's URL bar. That could contain forms not legal to GET or POST to a web server. So the percent-encoding is something one has to choose to do with a string conversion to or from a URL! if applicable:

Treat URL! as normal strings with no encoding behavior by hostilefork · Pull Request #655 · metaeducation/ren-c · GitHub

However the GitHub issue I link up top points out that normalization is another nuance that is a place for bugs / hacking / etc. This jumped out to me as a very simple--yet insidious--thing:

NFD and NFKD will normalize the following three characters to generate potentially forbidden code points:

\u2260 (≠ as one code point) to =\u0338 (≠ as two code points)

\u226E (≮ as one code point) to <\u0338 (≮ as two code points)

\u226F (≯ as one code point) to >\u0338 (≯ as two code points)

So you aren't supposed to have < or > or = in the URL, but they can pop up as part of the unicode normalization process if you decompose things.

I've been pushing against automatic transformations as a broken concept--even when it was just carriage returns and line feeds.

Here I have the same instincts creeping in. So what do we know?

We should not be building a system that normalizes or denormalize anything without being asked.
- We either accept only one format as input, or accept more formats and act neutrally.
- I am a fan of strong stances over neutral ones, in order to guide the ecosystem to a better place, even if it makes interoperability with things outside the ecosystem require more effort
  - The "automatic" things that permit "interoperability" are a false economy
  - You pay with confusion, bugs, and security holes
The bias of stored formats on disks and filesystems is NFC
- As canonized forms go it is not just the most compact, but it is also the most typical way to find codepoints encoded in practice.
- It's what the W3C suggests be used for transfer over the Internet, regardless of the internal forms a program might use
- Chrome is the leading browser by miles, and for é it canonizes both the one-codepoint (NFC) and separate e-and-accent form (NFD) to a network request for e.g. http://example.com/%C3%A9.txt. While other browsers like Firefox don't canonize, hence the two-codepoint form shows as http://example.com/e%CC%81.txt with the naked e and the accent codepoint two-byte UTF-8 percent encoded afterward.
  - Chrome's outsized influence suggests where the wind is blowing.
When you think about streaming, e.g. bytes flowing one by one from your program to something like the Windows or Linux terminal to display, decomposed forms give rise to chaos
- Let's say you are the terminal program, and over the pipe you get something like the bytes for a prompt, like > followed by >.
  - Should you display those characters on the screen? What if a composable character comes afterward, to transform it into ≯
  - You might not have the power to cursor back and erase what you already printed, depending. But even if you can, that's a lot of complexity that has to be borne by every consumer of UTF-8 bytes

This is making Latin1 seem like a saner choice, and that engaging Unicode has actually been a mistake.

the_history_of_unicode

Is "NFC Everywhere" Possible?

We already enforce atomic manipulations to strings at the byte level so that you can only insert/remove byte sequences that leave the string as valid UTF-8.

The canonization process is standardized, and ostensibly we know what the combining characters are. Wouldn't be trivial to implement, but we could say all modifications must generate NFC.

>> string: "e"
== "e"

>> append string #{CC81}  ; the accent combining character in BINARY!
** Error: Modification to string would create non-canon form

That's different from today's answer, which lets you do it:

>> append string #{CC81}
== "é"

I'm not particularly afraid to go in this direction, but there are some glitches.

I gather not all combining character sequences people might use canonize to a single codepoint. Some meaningful sequences to readers are only available as their decomposed forms, and normalization just ensures that the combining marks are in the same byte order.

This means that if you're in the position of the terminal implementer, then so long as you are dealing with a character that could be composed with something in non-canonized form you'd have to wait for a non-composable character (?) before you print anything... or be able to go back. It seems to me nonsensical that the combining characters come after the character you combine with. Maybe we should buck the trend with our own format that puts them first, at least in streaming cases.

Anyway, just wanted to write down some more thoughts about this heinous thing.

bradrn · June 13, 2024, 10:38am

I can answer this question quite straightforwardly: please God, no!

Let’s start with the difficulties it creates for programmers. Making append string #{CC81} error would be a massive nightmare for any kind of string processing. Parsing and generation would become drastically more complicated as programs would now need to guard against an ‘invalid’ character ever being in the wrong place. My own Brassica (which I still hope to reimplement in Ren-C one of these days) would become near-impossible to write without crashing unexpectedly.

Then there’s the fact that normalising strings everywhere is conceptually the wrong thing to be doing: generally you want to keep strings in their original representation. ‘Normalisation’ may sound like a desirable thing, but in general it isn’t. It’s useful if you happen to have a need to test string equality, but pretty much nowhere else. And the consequences of over-normalising can be severe: normalisation can change strings in ways which aren’t necessarily obvious. (The URL problem you gave is just one example of this.)

Another point to consider is that there are four separate normalised forms. This should give you a clue that none of them is entirely sufficient on their own. Instead, which to pick depends on precisely what you want ‘string equality’ to mean in each specific case. Forcing one normalisation method would make it difficult if one ever needs to use any of the others.

So, in sum: this is an extraordinarily bad idea, with almost no advantages. Don’t do it.

(Although, if you want, a ‘normalised string equality’ operator might not be a bad idea.)

hostilefork · June 13, 2024, 9:10pm

I hope you agree that allowing append string #{CC} would be bad, and that today's behavior is good:

>> string: "e"
== "e"

>> append string #{CC}
** Script Error: String aliased as BINARY! can't become invalid UTF-8

If you want arbitrary bytes, you have to keep everything as BINARY!:

>> bytes: to binary! "e"
== #{65}

>> append bytes #{CC}
== #{65CC}

But TEXT! must be valid UTF-8 on every operation:

>> to text! bytes
** Script Error: invalid UTF-8 byte sequence found during decoding

I'm pleased with all of that.

Per my writing criticizing the robustness principle, the system would be mandating that input already be normalized and keeping things normalized at all times. This gives a saner foundation to the process.

I did suggest that perhaps there be a middle tier... where TEXT! enforces NFC as an additional constraint on top of UTF-8!, and UTF-8! does its enforcement on top of BINARY!.

Because forcing you to use BINARY! for all non-NFC would lose the advantages of the already-existing codepoint coherence. Seems like a waste.

But most of the system would use TEXT! as currency in canon form.

Because I see this as a direct analogy to the constraint of maintaining valid UTF-8, I don't see this as being oppressive. It keeps you sane.

The current state of things is what I find oppressive:

>> single: to text! #{C3A9}
== "é"

>> double: to text! #{65CC81}
== "é"

>> length of single
== 1

>> length of double
== 2

So paralleling the "UTF-8 Everywhere Manifesto", I'd say the "NFC Everywhere Manifesto" has the potential to make people's lives better and not worse.

Giving an oft-better answer for LENGTH OF makes yet another argument for why to pick NFC.

If you were assured that all the TEXT! in the system was in NFC, you would not have any troubles when you went searching for substrings, because the substrings would also be NFC.

bradrn · June 14, 2024, 11:56pm

Yes, I agree with this. In fact, I’ll go even further: appending arbitrary bytes to a string shouldn’t ever be allowed. You should be required to use an encoding to convert it to a string before appending.

(This is what Haskell does. You can append Text to Text, and ByteString to ByteString, but to append a ByteString to Text you need to first convert it specifying an encoding.)

‘Keeping things normalised at all times’ is a good general guideline. But it’s only a guideline — not the right thing to do in all cases!

I think the key question which needs to be asked here is, ‘why does one normalise’? That link I sent you in the other thread explains the key points, but it boils down to ‘making illegal states unrepresentable’. That is, normalisation is useful in situations where un-normalised data can become nonsensical or inconsistent.

But is non-normalised Unicode really an ‘illegal state’? I would strongly argue that it isn’t, in general. Perhaps it’s useful in specific circumstances, but precisely what needs to be normalised away differs between different cases.

In fact, even Unicode itself doesn’t prescribe a single normalised form for all text — it has four, and what is normalised in one is completely un-normalised in another! So, if you really want to go down the road of ‘mandating normalisation’, you’d need four separate string types, plus a fifth for the inevitable situations where you need to process user input without further normalisation (e.g. as I do for Brassica). I don’t think it’s even close to being worth the extra complication.

hostilefork · June 15, 2024, 4:38am

Haskell is great in terms of the overall "wow isn't this great" of Haskell, but, if you're building an oddity out of modeling clay then the rules change.

There's no semantic value out of forcing people to convert something to a string prior to appending bytes if it's runtime. The type system isn't checking you so disallowing legal sequences is just stopping people from doing what they want.

I feel this is one of the moments to remind you that I did not like Rebol when I saw it, it breaks rules.

But art also breaks rules. Weeping woman, by Picasso, may not be my favorite piece of art but I like the rule breaking:

Picasso_The_Weeping_Woman_Tate_identifier_T05010_10

I have said before, that if Haskell didn't already have a following, I'd be all over it.

But it does already. It's like walking into a room full of people explaining why M.C. Escher is the greatest artist. And, maybe? But also maybe that's boring.

Anyway, yes. Haskell, or Idris, or whatever succeeds that. But I'm from C++ and still really respect C++, and, when I slip into the edges of madness of "why can't we paint code like this", I mess around with this stuff.

I kind of wish that a random standard for representing human written language hadn't gotten this far in concerns for deployed systems, just not sure what the best way to mitigate that is.

bradrn · June 15, 2024, 10:06am

Oh, sure; I was just explaining how it does things, and why it does them that way.

Unicode is far from being a ‘random standard’. It’s complicated because human written language is complicated. It may seem simple if you restrict yourself to English, but once you start to look at Hebrew or Arabic or Tamil or Khmer or Pollard — or even some less well-known Latin-based writing systems — it gets a lot less so. Unicode may have its flaws, but most of its complexity is simply humans being humans.