Where the Series Ends: Simplifying Out of Bounds Rules

hostilefork · April 9, 2019, 12:16am

Most languages with arrays have you keep the index separate. Languages that do something fancier typically abstract the process into "iterators"...which may be more complicated than simple integers.

But Rebol made a strange decision to fold an integer into its arrays and strings (the ANY-SERIES!). It is summarized by someone who didn't like it on the C2 Wiki page "Why isn't Rebol Popular":

>> s: "I hate this approach"
== "I hate this approach"
>> s: next s
== " hate this approach"

But this 'I' isn't lost.

>> back s
== "I hate this approach"
>> s
== " hate this approach"
>> s: head s
== "I hate this approach"

Whether you love it or hate it, having a hidden index opens a tremendous can of worms.

If you ask around, few people would know the answers to semantic questions involving this. You get an inkling of how complex it is if you just COPY a series that's not at its head...the data before the index is not copied.

>> x: next "abcd"
== "bcd"

>> y: copy x
== "bcd"

>> back x
== "abcd"

>> back y
== "bcd"

This issue crops up everywhere. What about when you REDUCE a block into another? What about if you COMPOSE?

(UPDATE 1/23/2021: Red has some acknowledgment of the category of problems, see Red Issue #4810 and the linked tickets in, where differences between MAKE and COPY are mentioned)

It also means ANY-SERIES! are actually iterators on data that can be mutated through other references. If you are pointing into a string or block at index 1000, and someone clears all the data out of that string, your value cell still holds the index at 1000. What should the semantics be?

Firstly: why does Rebol have this feature at all?

For such a weird thing you'd have to think it would be good in some way. And it does have several concrete advantages:

It cuts memory use by more than half for series + index. The index is slipped into an otherwise-unused spot in a Rebol "cell". That spot is the size of a platform pointer. (R3-Alpha actually had another spot available, but Ren-C utilizes this for "binding", which is how blocks representing function bodies can stay connected to the specific instance of a function invocation they represent, to give "specific binding" on the words nested underneath them). Storing an independent index would require another INTEGER! cell. But it's worse than that, because that cell would need to live in a variable--meaning there'd have to be a context key cell for it.
It means you don't need multiple return values to return a series and an index.
It reduces the amount of code you have to write. It's not just a matter of putting the series and index into the same variable for the storage size that represents. It's all the storage and processing for the units of code in the references. Where today you can just pass series, you would have to pass series index.

So weird though it may be, it's pragmatic. The language would be pretty different without it, and would need an iterator concept or it would be far worse.

Beyond saving space, there's no "magic" involved

There's not any kind of "strong theoretical basis" for Rebol's inclusion of an index in an ANY-SERIES! value. It has all the weaknesses of an independent integer index in a C/JavaScript/Python-type language. Sticking it in the value itself solves nothing.

So I think it's a bad idea to have the behavior be any different from if the index was being held independently.

This means series should be able to hold an arbitrary integer... negative, or past the end. back back back "abc" should take 3 steps back from the head at 1, to be at index -2. And it should take next next next to get it forward to the head.

Every operation that can be done on a series with its internal index should thus be a synonym for doing that operation on at series index with an external index. This means instead of defining two sets of behaviors, we only have to define one.

hostilefork · November 3, 2019, 2:41pm

Many moons after I made the proposal, @hiiamboris has proposed the same idea for Red (in a way that is a lot harder to read):

https://github.com/red/red/wiki/[PROP]-Series-evolution

He agrees generally on the need for more invariants, but specifically the idea of an unconstrained index:

"An idea is to allow series' offset ( index? ) take any integer value."

I've already tried to tackle the problem with the FIND matrix a bit:

hostilefork · October 22, 2020, 12:02am

Looking at some implementation code recently, I was again struck by just how easy it is to crash the system with out-of-bound series indices.

Because functions like NEXT and SKIP don't generate indices that are out of bound by default, you tend not to see series values which hold an index not supported by the data they point to. This means it usually only arises when you have a non-head reference to a series somewhere when another reference does modifications.

To see what happened, I tried changing the index to an unsigned value (an intptr_t) and created two separate accessors: one for unchecked access (VAL_IDX()), and left the typical reads as triggering failure if used and out of range (VAL_INDEX()).

Not much happened. Most of the tests that errored were fairly esoteric, and fixable:

str: "12"
same? str at str -2147483648

So this is just a case where the SAME? routine was written to compare the VAL_INDEX() of two series. The system has no way to know that the code wasn't planning on taking the extracted VAL_INDEX() and using it to index into the data array...so it erred on the side of caution. SAME? would be an example of one of the functions that would be specifically edited to test the raw indices.

I'm pretty much okay with getting errors out of most routines:

 >> copy at [a b c] -10
 ** Error: operating on series with index < 0

If you do index math and your indices wind up out of range...or if someone edits a series out from under you...then I think it's good to get an error rather than cover it up.

For the moment I made the answer to length of be #[void] if the series is out of range. At this time, I don't know if that's better or worse than raising an error.

hostilefork · October 22, 2020, 9:31pm

Just to make this more concrete, let's look quickly at R3-Alpha.

Here is the definition for BLK_SKIP():

 #define BLK_SKIP(s, n)     (((REBVAL *)((s)->data))+(n))

That is a raw access of data by integer index. If the data pointer s->data contains 10 things and your n is 11, you just went out of bounds.

Then there is VAL_BLK_DATA() which does such an access using the stored series and index inside a cell:

#define VAL_BLK_DATA(v)    BLK_SKIP(VAL_SERIES(v), VAL_INDEX(v))

If we look at the definition of VAL_INDEX() we see there's nothing checking that the index is in range of the data for that series...it just fetches the value in the cell:

#define VAL_INDEX(v)    ((v)->data.series.index)

The only way this could be safe would be if the value cells somehow maintained a rule that the index would never get out of sync with the length of the data. But there's no such rule... and as the code is written, there can't be.

If you look at a function like Cmp_Block() for block comparison, you'll see it blindly running VAL_BLK_DATA() on two series without checking to see if the indices are out of bounds.

You may wonder why something so clearly broken wouldn't crash more often. The answer boils down to two things:

Making out of bounds series typically only happens when a series is modified through two references.
Series memory is not typically reclaimed in R3-Alpha.

So try this one out:

r3-alpha>> data: [a b c d e f g]     
== [a b c d e f g]

r3-alpha>> ref: skip data 4
== [e f g]

r3-alpha>> clear data
== []

r3-alpha>> [e f g] = ref
== true

r3-alpha>> ref
== []

The MOLD of the reference data was sensitive to the idea that the index was past the updated series length, but the EQUAL? was not...so it used an out-of-bounds reference to compare. All the CLEAR operation did was put an end marker at the head of the series...leaving the rest of the data intact, so the comparison succeeded on out of bounds memory.

Red doesn't have this particular behavior...but there's nothing about their model that's any better. So it's really kind of like how the rendering of ref knew to take the past-boundary condition into account. Glancing at it, I gather their EQUAL? does a length calculation based on the index and the present series position before comparing item-wise.

Systemic Problems Need Systemic Solutions. I wish there were some kind of smarter answer than paying for the cost of a length and bounds check on data accesses...and conservatively failing by default. But the only better answers involve immutable arrays (e.g. Clojure).

(Note: This means that functions like "VAL_BLK_DATA()" (now named VAL_ARRAY_AT()) necessarily need to be involved in checking the length--vs getting a pointer and trusting it can just run until it sees a terminator. So I wondered if being index-based instead of terminator-based was not such a bad idea...you've already got the length in hand, so why not count it down...and then arrays don't have to be allocated at one past their needed size. But I tried changing the code for this, and I didn't like it...much more obfuscating and risky.)