History of Rebol Unicode
Rebol2 lacked unicode support, and one of the agenda items for Rebol3 was to support unicode. This was an era that predated emoji--where codepoints that would take up more than two bytes were rare. And as a cross-platform program that had to run on Windows (where Unicode support meant UCS-2 or UTF-16 or whatever people think Windows supports), the lowest denominator of 16-bit characters probably seemed fine to Carl.
The byte-sized string standard of Rebol2 had no particular prescription of what the bytes meant. As in the DOS days, the low 127 characters were assumed to be ASCII, with 128-255 being whatever was relevant in your country--or box drawing characters--or what-not. R3-Alpha adapted this by having two string formats:
-
byte-sized strings where basically the bytes encoded the first 255 codepoints of Unicode, which are ASCII plus the Latin-1 supplement
-
wide strings where the codepoints can be 0-65535, in an interpretation basically as UCS-2
Which kind of string you have was simply detected by looking at the element width of the series the TEXT! cell points at. If the series node says its elements are size 1, you have a byte-size string. If the series node says its elements are size 2, then it's a wide string.
Hence byte-size strings aren't UTF-8 encoded. You might have to re-encode a byte-size string to get UTF-8, if it contains any characters in the 128-255 range. (Clearly all wide strings must be re-encoded to get UTF-8.)
Inserting a character into a byte-sized string must thusly look at the width of the string and the codepoint, and potentially perform a "widening". The same goes for inserting any string into another--depending on how "optimal" one is trying to be, not all insertions of wide strings into byte-size strings need cause an expansion, only those which are inserting a substring that actually contains codepoints over 255.
On the subject of "trying to be optimal", there's a question of removing substrings. Once a wide string has removals, you could scan the remaining codepoints to decide if it could be compacted back to a byte-sized string again. But as it happens the code doesn't try to do this--once a string is widened, it stays wide.
Then there is an additional problem: codepoint use above 65535 has actually become commonplace, with emojis.
Is there a better way?
The kneejerk reaction to solve the lack of support for high codepoints might be to continue the widening process. That would mean 1, 2, and 4 byte string variants, with progressive widening as the need arises...and presumably little-if-any thinning. This is what Python did, and also what Red does (or at least was the plan):
Red's Plan For Unicode Support (2012)
However, I've felt uneasy about going down this road. If Rebol's goal is to eliminate unnecessary complexity and get real leverage, and UTF-8 support is a given anyway, what if UTF-8 were the only representation used?
I was leaning that way despite some technical hurdles. Then I was further emboldened by the UTF-8 Everywhere Manifesto, which articulated good arguments for why what I'd been feeling wasn't crazy. It's a good read, for anyone who hasn't read it:
The UTF-8 Everywhere Manifesto
If Rebol switched to using UTF-8 as the sole internal representation for strings, there would be tremendous advantages:
-
It would (obviously) give the missing support for high codepoints--which has been a sticking point for @IngoHohmann and others.
-
Storage would stay efficient because there'd be no "widening" of strings to undo.
-
PARSE could reuse the UTF-8 based scanner on any strings, so
parse "<< 1020 {Vidlice je úžasné} >>" [thru "<<" set i integer! set s string! thru ">>"]
could expose the scanner . -
Unification of ANY-STRING! types and ANY-WORD! types would come even closer. Ren-C converted WORD!s from using their own strange static tables into UTF-8 based series, which gave them the ability to be GC'd...and now that would match what strings used.
-
Comparisons and conversions (
to string! word | string = spelling-of word
) would speed up across the board. With only one format, there's only one format to compare. -
It's already the case that rebDo("print reverse", block, END) takes advantage of the assumption that C string literals are valid UTF-8, to distinguish them from other Rebol-headered-types and instructions. But even though the default ANY-STRING! REBVAL in the API containing "print reverse" would be spliced as a string literal in the values, with it being UTF-8 a modifier could allow rebDo(rebCode(strPrintReverse), block, END) to treat the STRING! as code just like it had come from a C literal.
-
Since UTF-8 support is a sunk cost already for scanning and conversions, it would definitely mean more code would be deleted than added.
If it's so great, why hasn't this been done already?
One of the main problems that the UTF-8 manifesto cites, Rebol doesn't actually have...which is quickly determining of the length of a string. The length in codepoints can be cached in the series node, and since STRING! values contain their offset in codepoints, a length calculation is simply a matter of subtraction (as it is today for an ANY-SERIES! at any position)
But picking out of strings would be relatively slow. If you want the 100th character of an ANY-STRING! today, as in i: 100 | string/:i
, that merely does array access into the string data...a simple addition. With UTF-8 only, in the worst case with no optimization, each index access must scan from the head of the data for the string. (Due to UTF-8's design, it's actually relatively easy to skim forwards and backwards by character over an encoded sequence, but it's still work and has to read a lot of bytes along the way to its position).
This might not seem that much of a killer in Rebol, as people use the iterative next and back anyway...which we might imagine could cache the real byte offset of the character position in the value cell. Yet the real problem is that next and back can't presume safety in navigation of the byte stream if it's mutated out from under it. For a given ANY-STRING! there may be many potentially mutating values referencing into the same bytes, and if any one of them changed the bytes then it would mean that they could corrupt that cache...leading to incorrect or corrupt positions in the byte stream.
So the Achilles Heel of the idea is: even if we're fine letting string/:i
be relatively inefficient, it would be intolerable if the following code were O(N^2) with respect to the length of the string:
pos: string
while [not tail? pos] [
print first pos
pos: next pos
]
But I think there's a way to finesse this. Basically, build in a mutation generation counter to each ANY-STRING! series. Then use the currently unused extra platform-pointer-sized chunk in mutable string values (which are at least 32-bits) to store some information about that stamp along with the byte offset of the character index when that stamp was taken.
So imagine if 16 bits were used for the mutation stamp, and 16 bits were used for the real byte offset. The real byte offset could be compressed as a delta from the character offset, because if character 1000 is as byte offset 2000 then storing the difference as 1000 is using fewer bits. There are also header bits which could be borrowed to further modify interpretation of the offset (bits to signal you should "multiply it by 2 and add 1", for example).
Now the problem of false-positives when the mutation counter rolls over. Well, basically you have to visit all the value cells that might have outdated mutation stamps and update the cell to say it has to re-seek. But that's basically visiting all the nodes a GC would visit, so you might as well GC while you're doing it. For that matter, each GC (or occasional GCs) could be going around and resetting these counters on strings with mutations anyway, making the need to trigger a synchronous GC on string mutation less likely.
This isn't foolproof, and there are going to be some pathological situations. I'm imagining some string algorithm that wants to insert data through one STRING! reference at one position, then write it through another STRING! reference at another position, and keep swapping back and forth. You'll wind up re-seeking every time, and these mutation checks will add overhead even to that.
Two thoughts:
-
It seems obviously useful to have a bit on strings that are known as having only ASCII codepoints, so that you can just do the byte-offset without worrying. The mutation stamp can remain unmodified until it transitions to containing at least one non-ASCII codepoint. It would be worth it to try and keep this bit fresh by any operation that incidentally has to go through all the bytes of a string for some other reason. Then those strings won't have this kind of problem.
-
If someone really does have a crazy string algorithm, Rebol's efficiency and cost curve is such that it's not that egregious to make a BLOCK! of CHAR! and do your algorithm on that. Not only does the uniform cell size of a block mean O(1) access to the elements, but the algorithm can make use of all kinds of tricks that might make it more efficient than bookkeeping a bunch of string pointers. (You could put BLANK!s in cells you plan to clear out at the end, for instance, without worrying about the problems of corrupting embedded nulls if you tried that on a raw ANY-STRING!)
I'd thought of the signature idea before and felt like it was probably workable, but it's these two last ideas that give me the final boost of "yes, this is the right plan". I'm anticipating wild success.