AS BINARY! view of strings is back...with UTF-8! (that's only /PART!)

hostilefork · July 23, 2019, 9:33pm

Mutate strings aliased as BINARY!, and vice-versa, as in Rebol2!

Rebol2's AS-BINARY and AS-STRING provided a convenient aliasing between binary and string as Latin1 single-byte characters:

rebol2>> b: as-binary s: "hello"
== #{68656C6C6F}

rebol2>> append b #{68}
== #{68656C6C6F68}

rebol2>> s
== "helloh"  ; binary mutation reflected in original string

rebol2>> append s "ello"
== "hellohello"

rebol2>> b
== #{68656C6C6F68656C6C6F}

That was lost when R3-Alpha's internal string format became too unpredictable (swinging between Latin1 and UCS2) and was only canonized as UTF-8 for I/O. Red suffered a similar fate.

But with UTF-8 Everywhere as the fixed internal format of strings, Ren-C has done some voodoo to bring it back.

It offers a more generic AS operation, along with higher-than-UCS2 codepoint support:

>> b: as binary! s: "hello"
== #{68656C6C6F}

>> to binary! "🐱"
== #{F09F90B1}

>> append b #{F09F90B1}  ; add that high-codepoint cat!
== #{68656C6C6FF09F90B1}

>> s
== "hello🐱"

>> append s "hello🐱"
== "hello🐱hello🐱"

>> b
== #{68656C6C6FF09F90B168656C6C6FF09F90B1}

But a binary alias of a string is constrained to staying as valid UTF-8:

>> append b #{FEFEFEFE}
** Internal Error: invalid UTF-8 byte sequence found during decoding

You can actually alias WORD! as BINARY! also, without doing a separate allocation. But it will be read-only view, so all you're doing is saving on memory and GC load:

>> b: as binary! 'immutable-word
== #{696D6D757461626C652D776F7264}

>> append b #{1020}
** Access Error: series is source or permanently locked, can't modify

Similarly, you can alias words as strings...again without making a new allocation, but with the same read-only constraint:

>> t: as tag! 'append
== <append>

>> append t "nope"
** Access Error: series is source or permanently locked, can't modify

The /PART refinement has just been implemented for UTF-8

The controversial behavior can be discussed on issue #2096 (which you can discuss on that ticket). But what R3-Alpha and Red choose to (buggily) implement is that it applies to the target series only...and is thus measured in the units of that series:

>> append/part "abc" [100 "de" "fg"] 2
== "abc10"  ; 2 string units, not "abc100de" from 2 block units

The argument is that COPY/PART on the source series gives you that form /PART if you need it, so this is "strictly more powerful". Rightly or wrongly... Ren-C is now doing it hopefully less buggily (though almost certainly with its own bugs), but with UTF-8 Everywhere support.

If you like, you can limit how much of a binary you extract from UTF-8, counted in bytes:

>> to binary! "🐱"
== #{F09F90B1}

>> append/part #{} "🐱" 2  ; e.g. 2 bytes (half a cat)
== #{F09F}

Extracting bytes from UTF-8 will always work. Going the other way, not all binary strings are valid UTF-8. But as long as the number of characters you ask for in that section of the binary are valid, having other invalid bytes isn't a problem...only when you ask for part out of the unchecked region:

>> append/part "" #{F09F90B1F09F90B1FEFEFEFE} 2  ; e.g. 2 characters
== "🐱🐱" 

>> append/part "" #{F09F90B1F09F90B1FEFEFEFE} 3
** Internal Error: invalid UTF-8 byte sequence found during decoding

If a binary is actually an alias of a UTF-8 string, this can be more efficient by not rescanning... (though the code is still in its early life yet, so it has a number of areas for improvement).

Pretty cool, eh?

hostilefork · July 23, 2019, 9:33pm

The /PART refinement has just been implemented for UTF-8

Enough people have been confused about the purpose of /PART that I wonder if /LIMIT makes more sense. It seems /PART usually has more possibilities than just an INTEGER! of how many series units to add to the destination...where negative numbers can mean something other than "nothing" and series positions can be used.

Philosophically I kind of agree that if /PART makes its way into too many functions where COPY/PART would suffice, it may have a viral nature like the /INTO virus. Finding a more general mechanic--such as being able to make "series slices" without copying them--could keep every routine from thinking it needs it.