In trying to implement UTF-8 everywhere, there is a strong need to discern in the C code the difference between number of bytes and number of codepoints. Preferably with different C datatypes, but also preferably with different naming.
It's not clear that the distinction between LENGTH and SIZE will make it to Rebol userspace, because the fact that strings have an underlying byte size different from their number of codepoints shouldn't be visible. So the standard LENGTH should be kept. However, choices on both issues will affect API users.
Here are a few observations:
-
Rebol has traditionally used LENGTH to mean number of conceptual units in the series. This is reinforced by operations like
SER_LEN(s)
, which does not return a number of bytes (unless the series units are byte-sized, e.g. a BINARY! series) -
C has a standard datatype for measuring sizes in bytes, called
size_t
. It is defined as the datatype returned by sizeof(). -
C has traditionally used the function
strlen()
to count bytes up until a null terminator byte in a series ofchar*
(pointers to possibly-signed-or-unsigned-char). strlen() returnssize_t
.
When applied to UTF-8 data, the appearance of "len" in the strlen() function call--or in names of variables holding the result--is contentious with the semantics of "length" for the strings being operated on. Currently the variables in question are a mix of len_bytes, num_bytes, and just plain old "misleading" len.
I think that probably the best way to attack this is to stick with the Rebol standard of LENGTH being used when talking about "number of abstract units" and SIZE being used when talking specifically about "number of bytes".
This isn't the direction C++ went. For containers like std::vector (which we might think of as analogous to series), they went and defined a different* type from size_t, unique to each container called container::size_type
:
So where Rebol has been conflating "length", C++ is conflating "size".
To square another circle here, I'll suggest doing #define strsize strlen
, then using strsize() on UTF-8 data, and always using size_t
when describing byte lengths.