Realistically Migrating Rebol to "UTF8 Everywhere"


#1

History of Rebol Unicode

Rebol2 lacked unicode support, and one of the agenda items for Rebol3 was to support unicode. This was an era that predated emoji--where codepoints that would take up more than two bytes were rare. And as a cross-platform program that had to run on Windows (where Unicode support meant UCS-2 or UTF-16 or whatever people think Windows supports), the lowest denominator of 16-bit characters probably seemed fine to Carl.

The byte-sized string standard of Rebol2 had no particular prescription of what the bytes meant. As in the DOS days, the low 127 characters were assumed to be ASCII, with 128-255 being whatever was relevant in your country--or box drawing characters--or what-not. R3-Alpha adapted this by having two string formats:

  • byte-sized strings where basically the bytes encoded the first 255 codepoints of Unicode, which are ASCII plus the Latin-1 supplement

  • wide strings where the codepoints can be 0-65535, in an interpretation basically as UCS-2

Which kind of string you have was simply detected by looking at the element width of the series the TEXT! cell points at. If the series node says its elements are size 1, you have a byte-size string. If the series node says its elements are size 2, then it's a wide string.

Hence byte-size strings aren't UTF-8 encoded. You might have to re-encode a byte-size string to get UTF-8, if it contains any characters in the 128-255 range. (Clearly all wide strings must be re-encoded to get UTF-8.)

Inserting a character into a byte-sized string must thusly look at the width of the string and the codepoint, and potentially perform a "widening". The same goes for inserting any string into another--depending on how "optimal" one is trying to be, not all insertions of wide strings into byte-size strings need cause an expansion, only those which are inserting a substring that actually contains codepoints over 255.

On the subject of "trying to be optimal", there's a question of removing substrings. Once a wide string has removals, you could scan the remaining codepoints to decide if it could be compacted back to a byte-sized string again. But as it happens the code doesn't try to do this--once a string is widened, it stays wide.

Then there is an additional problem: codepoint use above 65535 has actually become commonplace, with emojis.

Is there a better way?

The kneejerk reaction to solve the lack of support for high codepoints might be to continue the widening process. That would mean 1, 2, and 4 byte string variants, with progressive widening as the need arises...and presumably little-if-any thinning. This is what Python did, and also what Red does (or at least was the plan):

Red's Plan For Unicode Support (2012)

However, I've felt uneasy about going down this road. If Rebol's goal is to eliminate unnecessary complexity and get real leverage, and UTF-8 support is a given anyway, what if UTF-8 were the only representation used?

I was leaning that way despite some technical hurdles. Then I was further emboldened by the UTF-8 Everywhere Manifesto, which articulated good arguments for why what I'd been feeling wasn't crazy. It's a good read, for anyone who hasn't read it:

The UTF-8 Everywhere Manifesto

If Rebol switched to using UTF-8 as the sole internal representation for strings, there would be tremendous advantages:

  • It would (obviously) give the missing support for high codepoints--which has been a sticking point for @IngoHohmann and others.

  • Storage would stay efficient because there'd be no "widening" of strings to undo.

  • PARSE could reuse the UTF-8 based scanner on any strings, so parse "<< 1020 {Vidlice je úžasné} >>" [thru "<<" set i integer! set s string! thru ">>"] could expose the scanner .

  • Unification of ANY-STRING! types and ANY-WORD! types would come even closer. Ren-C converted WORD!s from using their own strange static tables into UTF-8 based series, which gave them the ability to be GC'd...and now that would match what strings used.

  • Comparisons and conversions (to string! word | string = spelling-of word) would speed up across the board. With only one format, there's only one format to compare.

  • It's already the case that rebDo("print reverse", block, END) takes advantage of the assumption that C string literals are valid UTF-8, to distinguish them from other Rebol-headered-types and instructions. But even though the default ANY-STRING! REBVAL in the API containing "print reverse" would be spliced as a string literal in the values, with it being UTF-8 a modifier could allow rebDo(rebCode(strPrintReverse), block, END) to treat the STRING! as code just like it had come from a C literal.

  • Since UTF-8 support is a sunk cost already for scanning and conversions, it would definitely mean more code would be deleted than added.

If it's so great, why hasn't this been done already?

One of the main problems that the UTF-8 manifesto cites, Rebol doesn't actually have...which is quickly determining of the length of a string. The length in codepoints can be cached in the series node, and since STRING! values contain their offset in codepoints, a length calculation is simply a matter of subtraction (as it is today for an ANY-SERIES! at any position)

But picking out of strings would be relatively slow. If you want the 100th character of an ANY-STRING! today, as in i: 100 | string/:i, that merely does array access into the string data...a simple addition. With UTF-8 only, in the worst case with no optimization, each index access must scan from the head of the data for the string. (Due to UTF-8's design, it's actually relatively easy to skim forwards and backwards by character over an encoded sequence, but it's still work and has to read a lot of bytes along the way to its position).

This might not seem that much of a killer in Rebol, as people use the iterative next and back anyway...which we might imagine could cache the real byte offset of the character position in the value cell. Yet the real problem is that next and back can't presume safety in navigation of the byte stream if it's mutated out from under it. For a given ANY-STRING! there may be many potentially mutating values referencing into the same bytes, and if any one of them changed the bytes then it would mean that they could corrupt that cache...leading to incorrect or corrupt positions in the byte stream.

So the Achilles Heel of the idea is: even if we're fine letting string/:i be relatively inefficient, it would be intolerable if the following code were O(N^2) with respect to the length of the string:

 pos: string
 while [not tail? pos] [
     print first pos
     pos: next pos
 ]

But I think there's a way to finesse this. Basically, build in a mutation generation counter to each ANY-STRING! series. Then use the currently unused extra platform-pointer-sized chunk in mutable string values (which are at least 32-bits) to store some information about that stamp along with the byte offset of the character index when that stamp was taken.

So imagine if 16 bits were used for the mutation stamp, and 16 bits were used for the real byte offset. The real byte offset could be compressed as a delta from the character offset, because if character 1000 is as byte offset 2000 then storing the difference as 1000 is using fewer bits. There are also header bits which could be borrowed to further modify interpretation of the offset (bits to signal you should "multiply it by 2 and add 1", for example).

Now the problem of false-positives when the mutation counter rolls over. Well, basically you have to visit all the value cells that might have outdated mutation stamps and update the cell to say it has to re-seek. But that's basically visiting all the nodes a GC would visit, so you might as well GC while you're doing it. For that matter, each GC (or occasional GCs) could be going around and resetting these counters on strings with mutations anyway, making the need to trigger a synchronous GC on string mutation less likely.

This isn't foolproof, and there are going to be some pathological situations. I'm imagining some string algorithm that wants to insert data through one STRING! reference at one position, then write it through another STRING! reference at another position, and keep swapping back and forth. You'll wind up re-seeking every time, and these mutation checks will add overhead even to that.

Two thoughts:

  • It seems obviously useful to have a bit on strings that are known as having only ASCII codepoints, so that you can just do the byte-offset without worrying. The mutation stamp can remain unmodified until it transitions to containing at least one non-ASCII codepoint. It would be worth it to try and keep this bit fresh by any operation that incidentally has to go through all the bytes of a string for some other reason. Then those strings won't have this kind of problem.

  • If someone really does have a crazy string algorithm, Rebol's efficiency and cost curve is such that it's not that egregious to make a BLOCK! of CHAR! and do your algorithm on that. Not only does the uniform cell size of a block mean O(1) access to the elements, but the algorithm can make use of all kinds of tricks that might make it more efficient than bookkeeping a bunch of string pointers. (You could put BLANK!s in cells you plan to clear out at the end, for instance, without worrying about the problems of corrupting embedded nulls if you tried that on a raw ANY-STRING!)

I'd thought of the signature idea before and felt like it was probably workable, but it's these two last ideas that give me the final boost of "yes, this is the right plan". I'm anticipating wild success. :slight_smile:


#2

Cool! Can't wait to see how this unfolds in practice. Looks promising.


#3

I agree when you really do need to perform some extensive algorithm on your string, perhaps a string is not the correct structure for your data.

The benefits of UTF-8 support outway the downsides imho, and if in time other formats need to be supported, those can and will be added later.


#4

I wanted to write up some notes about progress on UTF-8 Everywhere, since a couple of weeks ago it was showing the first signs of life for codepoints which would previously cause errors:

And it's not just booting--it's able to do a reasonable amount. It could do this, too.

>> bincat: to-binary {C😺T}
== #{43F09F98BA54}

>> parse bincat [{C😺T}]
== true

>> parse bincat [{c😺t}]
== true

>> parse/case bincat [{c😺t}]
== false

>> test: to-binary {The C😺T Test}
== #{5468652043F09F98BA542054657374}

>> parse test [to {c😺t} copy x to space to end]
== true

>> x
== #{43F09F98BA54}

>> to-string x
== "C😺T"

As promising as that looks, I'd say it would be an optimistic to call it "half done". :-/ What needs to be done next is the optimizations avoiding needing to scan strings from the beginning every time you want to translate an index into a character position.

One of the main villains here is the GET_ANY_CHAR() and SET_ANY_CHAR(), which originated in R3-Alpha. They were intended to abstract across BINARY!, Latin1-STRING!, and wide character STRING!. So for instance, PARSE could operate generically on BINARY! and STRING! by using these macros and indexing with them.

But as long as PARSE code tries to use that generality, it's going to be N^2 while iterating. And since PARSE frequently has to iterate its searches repeatedly, that quickly becomes N^3. It's unacceptably slow for day to day use.

So that code must be converted to use iterators, and that iteration has to also be smart about mutation. It's going to take some time, and working on the API and emscripten is higher priority for the moment. But the two tasks kind of tie together, and I've had to rebase parts out of UTF-8 Everywhere a bit at a time...which kind of makes me wish I had more discipline of breaking down commits to being single changes. :-/


In looking around at languages, most are hinged on 2-bytes-per-codepoint, because that's what Windows and Java did. The only one I could find so far that's gone the UTF-8 Everywhere route is Rust. Here's a bit about their "UTF-8 encoded, growable string."

https://doc.rust-lang.org/std/string/struct.String.html

UCS-2 certainly seems an artificial and awkward liability in systems which are trying to handle emojis and such, in systems that only need to support UTF-8. So I was curious if JavaScript implementations were going toward the idea of UTF-8 internally. I found this essay on why when confronted with the need for optimization, they added Latin1 (similar to what R3-Alpha did) instead of just switching to UTF-8:

https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/

At this point you may ask: wait, it’s 2014, why use Latin1 and not UTF8? It’s a good question and I did consider UTF8, but it has a number of disadvantages for us, most importantly:

  • Gecko is huge and it uses TwoByte strings in most places. Converting all of Gecko to use UTF8 strings is a much bigger project and has its own risks. As described below, we currently inflate Latin1 strings to TwoByte Gecko strings and that was also a potential performance risk, but inflating Latin1 is much faster than inflating UTF8.
  • Linear-time indexing: operations like charAt require character indexing to be fast. We discussed solving this by adding a special flag to indicate all characters in the string are ASCII, so that we can still use O(1) indexing in this case. This scheme will only work for ASCII strings, though, so it’s a potential performance risk. An alternative is to have such operations inflate the string from UTF8 to TwoByte, but that’s also not ideal.
  • Converting SpiderMonkey’s own string algorithms to work on UTF8 would require a lot more work. This includes changing the irregexp regular expression engine we imported from V8 a few months ago (it already had code to handle Latin1 strings).

So although UTF8 has advantages, with Latin1 I was able to get significant wins in a much shorter time, without the potential performance risks. Also, with all the refactoring I did we’re now in a much better place to add UTF8 in the future, if Gecko ever decides to switch.

So it's interesting to see how tied their hands are, and to consider the benefit of being agile enough to do this change now, before its too late.


ANY-WORD! and ANY-STRING!: The Limits of Unification
2018 Retrospective: Elevating the Art
The About-1/3-Of-2018-is-Over Report
#5

Coming back to it to try and get this out the door, I think "half done" was about right. It took a week or two more work...and it's going to take a bit more work before it's "done"...it's a bit on the slow side right at the moment.

Nevertheless, I've gone ahead and committed it....UTF-8 Everywhere Lives!

There's a whole new set of interesting angles to how BINARY! and TEXT! can intermix in PARSE:

I also imported a file from the W3C to the tests, and got things started on how more purposeful tests might be written:

This approach isn't impossible...but it hinges on having a value cell in your hand at the moment of doing the lookup. A lot of places have series nodes that aren't paired with any value, so there'd be no caching.

For the moment, the main caching is just done on the series itself. Small series don't bother with a cache, larger ones could have several. It leads to orders of magnitude in speedup, and a collection of large parses (like source analysis) is reduced to the scale of "minutes" instead of the scale of "a day".

This is where the big speedup is going to come from, but I figured it would be better to phase it. Not only will the extra processing in parse help give this a test for a while, but also people can adapt to the first set of necessary changes before being hit with needing to rewrite any parse rules that expected to modify the iterated series without using parse keywords.