Length, size, num_bytes, num_chars...?

hostilefork · November 27, 2017, 4:47pm

In trying to implement UTF-8 everywhere, there is a strong need to discern in the C code the difference between number of bytes and number of codepoints. Preferably with different C datatypes, but also preferably with different naming.

It's not clear that the distinction between LENGTH and SIZE will make it to Rebol userspace, because the fact that strings have an underlying byte size different from their number of codepoints shouldn't be visible. So the standard LENGTH should be kept. However, choices on both issues will affect API users.

Here are a few observations:

Rebol has traditionally used LENGTH to mean number of conceptual units in the series. This is reinforced by operations like SER_LEN(s), which does not return a number of bytes (unless the series units are byte-sized, e.g. a BINARY! series)
C has a standard datatype for measuring sizes in bytes, called size_t. It is defined as the datatype returned by sizeof().
C has traditionally used the function strlen() to count bytes up until a null terminator byte in a series of char* (pointers to possibly-signed-or-unsigned-char). strlen() returns size_t.

When applied to UTF-8 data, the appearance of "len" in the strlen() function call--or in names of variables holding the result--is contentious with the semantics of "length" for the strings being operated on. Currently the variables in question are a mix of len_bytes, num_bytes, and just plain old "misleading" len.

I think that probably the best way to attack this is to stick with the Rebol standard of LENGTH being used when talking about "number of abstract units" and SIZE being used when talking specifically about "number of bytes".

This isn't the direction C++ went. For containers like std::vector (which we might think of as analogous to series), they went and defined a different* type from size_t, unique to each container called container::size_type:

So where Rebol has been conflating "length", C++ is conflating "size".

To square another circle here, I'll suggest doing #define strsize strlen, then using strsize() on UTF-8 data, and always using size_t when describing byte lengths.