Moving Away From "NULL termination" (END!) of BLOCK!s

hostilefork · December 30, 2020, 10:29am

Ren-C preserved an idea from R3-Alpha...which was that there would be a cell type byte reserved to signal the end of an array. This is a bit like how null terminators are used with C strings. However, arrays also tracked their length. So it was redundant information.

In R3-Alpha, the special cells were given the END! datatype. Sometimes you would see bugs that would leak the existence of this internal type to the user. Ren-C hid it more effectively, by not making it an actual "type".

On the plus side, this provides a clean-looking way to walk through the values in an array:

Cell* item = Array_Head(array);  // first cell pointer in the array
for (; Not_End(item); ++item) {
    ...
}

However, there are several downsides:

You have to pay for a dereference on each step. item is a pointer, and you have to follow that pointer to its memory location to read a byte there to see if you've reached the end. This probably isn't that bad, because odds are you are going to be working with that memory inside the loop anyway. But maybe you aren't...and you certainly aren't going to be for the last cell.
You typically wind up paying a cell's worth of cost for this convenience. If your array is empty, it still needs space for at least one cell. If your array has one cell, it needs space for two. If it has two it needs space for three, etc. This isn't just an extra byte (as in C '\0' termination)...it's 4 platform pointers. So 32 bytes of oft-wasted space on 64-bit platforms for a mostly empty cell.
But rounding up by 1 is even worse than wasting one cell... because it propagates to rounding up in the memory pool block size, and memory pools are sized in multiples of 2 (2, 4, 8, 16, etc). So if what you really want is a two-cell array--e.g. enough for a/b, you move up to the next size and take a chunk from the 4-cell pool. A 4-cell array needs to come from the 8-cell pool. Etc.

Should We Scrap This Idea?

It's bothered me for a while, but since it might make enumeration faster in some cases I've let it hang around. Having a terminator has helped catch out of bounds cases more easily.

But I think the time has come to demote termination to a debug-build-only practice. It's gotten in the way of too many interesting optimizations.

Data point: Red doesn't do it. They just store the pointer to the tail of the data (in the slot where R3-Alpha stored the length). It works either way since you can calculate the length by subtracting the head from the tail...or calculate the tail by adding the length to the head. I'd imagine the tail is needed more often.

The code isn't usually that much worse:

Cell* item = Array_Head(array);
Cell* tail = Array_Tail(array);
for (; item != tail; ++item) {
    ...
}

But sometimes there were cases that a function would be passed a Cell* resident in an array, without passing the array also. And then it would enumerate that value forward until it reached an end. Such routines aren't all that common, but a few do exist. They'd need to be revisited.

It's not that huge a deal, and kind of trivial in the scheme of things. But it would touch a lot of code. But, as usual in Ren-C...the asserts can keep it running.

END signals would still exist

The END cell type is important for other reasons. It's used in rebEND as a terminator for C va_list arguments, and that's not going away. There are other applications which are beyond the scope of this post to explain.

And as I say, termination of some kind would probably continue in debug builds. So they might over-allocate to have enough room at the tail to put an end cell, just to get errors to trigger if you went past the limit.

So let's not malign the END marker too much. It has been a valuable contributor.

hostilefork · December 30, 2020, 6:50pm

A post was split to a new topic: Should There Be Distinct Debug/Release Builds?

hostilefork · February 14, 2021, 7:23am

This is now in place. It saves a decent amount of memory, as short series are pretty common (e.g. PATH!s like a/1 have two cells in them, and it's better to be able to fit them in the 2 cell pool instead of needing to put them in the 4 cell pool).

It's probably going to hit some bugs here or there for a while. But still, the code continues to be fairly robust during such reorganizations.