Modifying While Iterating: Crash, Nonsense, Predictable, or Illegal?

hostilefork · August 6, 2020, 10:30pm

Every imperative language has the problem of when you modify a data structure, you run the risk of confusing other code that is looking at that data structure at the same time. Whether it's threads competing for access, or just a loop whose body modifies the data being looped over...the problem comes up.

The options for dealing with it are:

Crash - You can just crash. R3-Alpha had this in several places--for instance PARSE, where a data pointer (REBVAL*) into the rules series was incremented along the rules...even though code was permitted to change the rules as you went (via GROUP!s). The rule series might grow and be retargeted at a larger memory block, leaving the old pointer recycled and invalid.
Nonsense - You can make the iteration never crash, but be protected in such a way that it does something to avoid crashing. Basically this means you modify only in ways that can be checked by iterations. However, this means additional checking must run in the iterations to make sure it picks up on the signal that something has changed. For instance: in the PARSE example above there could be a check of the length and a re-fetch of the pointer by index; if the length is out of bounds, it could error. If still in bounds--even if between rules--it could take a shot at keeping running, and just fail as if the nonsense had been how it were written all along.
Predictable - Some languages offer guarantees about what kind of modifications are safe, and which ones are not. For instance it may say if you are iterating a map, then anything you remove that you haven't visited yet won't be seen...or if you insert then anything you add won't cause something already visited to be seen again. They might tell you explicitly when all bets are off--and say the effects may vary, or even if they may crash. (C++ has container guarantees, and things outside the guarantees may crash.)
Illegal - This has been Ren-C's strategy so far...to simply say you cannot modify things that are being iterated. Some routines (like FOR-NEXT) aren't considered to be iterating, because they are just running NEXT on a variable each time, and so have the same between loop invariants as if you did it yourself.

It's likely that most people agree that crashing is not a good choice. Python says:

"I'm not saying it's uncrashable. I'm saying that if you crash it, it's a bug unless proven harebrained."

So what they get are exceptions, e.g.:

my_list = [1,2,3,4,5,6,7,8,9] 
for i in range(len(my_list)): 
    if my_list[i] == 8: 
        del my_list[i] 

 IndexError: list index out of range

Rebol (and Red's) history seems to aspire to non-crashing Nonsense. There's not a lot of emphasis on what the promised invariants are, but they give lip service to the idea of "crashing is bad". So if you point out a crash, a patch is made at that point based on some testable property (index out of bounds, etc.) and a non-crashing behavior is picked. Repeat as crashes are discovered.

I've felt like illegal has worked out pretty well for Ren-C. It's conservative...but the general rule of thumb is that modification during iteration is a so-called "code smell". You can always do it another way. And if this ever became a real performance problem, a dedicated native that caters to a certain explicit stylized modification could optimize for that particular case (e.g. REMOVE-EACH).

But stackless introduces a problem for locking series: it means locks are no longer necessarily taken in a stack-like manner, so locking cannot be done via a single bit.

Previously series had a single bit on them... SERIES_INFO_HOLD. This bit is in the same bitflag set as other sources of read-only-ness, such as SERIES_INFO_PROTECTED (can be turned on and off) and SERIES_INFO_FROZEN (permanently immutable). So all three could be conveniently checked in one masking operation on modification.

What would happen in an operation that wanted to take a hold is that it would look to see if the series already had SERIES_INFO_HOLD. If so, it would do nothing. If not, it would set a flag on itself to say that it took a hold it needed to release...and release when it was done.

This does not work in a stackless model. A sort of minimum would be that the series would need to have a lock count that was incremented and decremented. The miserly design for series does not currently have space to put that in the series node. Something will have to grow, somewhere.

Python's approach seems to please people enough...with questions on Stack Overflow about modifying while iterating having people say "avoid it". They don't have locks, and the code just does what it does.

This is a deeper question for Rebol than for Python, because the locking questions apply to source code itself (source blocks, parse rules, etc.)...not just user data structures.

I hate to give up putting iterative holds on series, and letting chaos win. We wouldn't want that in our filesystems, why allow it in programming languages? The problem of releasing locks is something I think that the DEFER mechanism can probably handle pretty well.

I've been thinking about some hybrid approaches which use a bit if it's sufficient, but only break out into a locking table if there's more than one iterator. Build options could decide: no locks (allow crashes), cheap locks (reference counts), heavy locks (be able to tell you which frame took the lock).

I'll probably give an enhanced locking method a shot, and stay the course on prohibiting modification during iteration (unless using an iterator that specifically accounts for that).

hostilefork · August 7, 2020, 6:32am

Out of curiosity I decided to go ahead and crack open Red's concept of FOREACH and how it is implemented.

I found FOREACH* here in %natives.reds. (Remember that it is Red/System code, it's best to think of this as being sort of C-like, despite the syntax looking Rebol-ish.)

It has two helper functions based on whether the iteration variables are a BLOCK! or not. So foreach x ... vs foreach [x y] ... (a block of words to iterate, vs whether the data being iterated is a block). These helpers are called FOREACH-NEXT and FOREACH-NEXT-BLOCK.

Looking at the more fundamental case first of FOREACH-NEXT, the primary logic is:

result: loop? series
if result [
    _context/set word actions/pick series 1 null
    series/head: series/head + 1
]
result

Roughly we see that this is built on top of PICK mechanics with an argument of 1. A copy of the series cell is held by the FOREACH so it keeps updating the head in that cell as it goes. RESULT is a local variable which determines if there is an element to pick, and if it is then it performs the action...and RESULT is also returned to the FOREACH* higher-level caller to tell it when to stop evaluating the body.

So the "safety" in this process comes from LOOP?...which just essentially asks if the series index is past the tail of the data buffer. The buffer is retrieved each time, and everything here is speaking in terms of indices instead of pointers.

This reveals an implementation detail, that maps are just BLOCK!s of data with some specially marked missing elements (that was true of R3-Alpha and Ren-C as well...the accelerated performance comes from a sidestructure associated with the block). It means that those missing elements must be skipped, which is done by a call to a different "enumerate-and-set" variation of a pick-based helper called SET-MANY for maps that is called MAP/SET-MANY. This knows how to skip over deleted keys.

At first I thought that for this to work they'd have to maintain an invariant where there are no MAP_KEY_DELETED entries at the tail of the series... or their LOOP? would say "yes there's more data" for the enumeration when all it would find would be MAP_KEY_DELETEDs. When their map removal didn't account for this, I wondered why you wouldn't get a bug from it, until I saw the reason:

result: map/set-many blk as red-hash! series size

So although their LOOP? check said there was data in this case, all MAP_KEY_DELETEDs got another vote to send a failed result back to the FOREACH to not continue the loop.

So what does this mean?

It's dirt-simple code. It refreshes buffers and pointers on each iteration, and offers non-crashing and relatively obvious invariants. If you add an item to a map, you may or may not see it in the enumeration of upcoming elements...based on whether it hashes before or after your enumeration point.

For instance, let's try deleting the a key from a map and then adding a y key while enumerating:

red>> m: make map! [a: 10 b: 20 c: 30]
red>> remove/key m 'a
red>> n: 0
red>> foreach [k v] m [n: n + 1 if n = 1 [m/y: 40] print [k v]]
b 20
c 30
y 40

That didn't hash into the same bucket as a, so it didn't reuse the slot... leaving it as MAP_KEY_DELETED. The easiest way to pick something that hashes into the same slot as a is just to use a itself again, so let's add that during the enumeration and see the difference:

red>> m: make map! [a: 10 b: 20 c: 30]
red>> remove/key m 'a
red>> n: 0
red>> foreach [k v] m [n: n + 1 if n = 1 [m/a: 40] print [k v]]
b 20
c 30

Promising such behavior may not leave the doors open to much more sophisticated implementations of map...e.g. maybe a sophisticated implementation would have trouble promising you wouldn't see the same key twice in an enumeration if you shuffled the map. Maybe that's okay, I don't know.

One thing I do know is that this kind of approach would be a problem for UTF-8 Everywhere and strings. We do not guarantee being able to rapidly get to a byte address in a string by index (although there is a strategy for trying to mitigate that through caching). But it becomes a a good reason for FOR-EACH and friends to put an enumeration lock on the string.

Not having a locking system also means every bit of evaluator code has to be paranoid all the time. If any piece of user code can modify anything anywhere, then every time you run the evaluator it has the potential to invalidate the assumptions your evaluator was mid-running.

I've given a lot of examples of this, and they are trivial to come up with:

red>> obj: make object! [x: 10]
red>> do code: [obj/x: (clear code recycle 20)]
**crrrrash**

Being able to put holds on these series make Ren-C robust.

ren-c>> obj: make object! [x: 10]
ren-c>> do code: [obj/x: (clear code recycle 20)]
** Access Error: series has temporary read-only hold for iteration

(You also get a performance benefit, because you can use direct pointers without worrying about them changing. Look at how Red has to re-fetch the buffer, check the indexes...a direct pointer that you can trust won't change out from under you is more optimal.)

In such cases, you're witnessing what happens when the evaluator and usermode code are experiencing contention over the same data. I feel like not letting people actively interrupt extant enumerations--while a step toward the "immutability religion" of pure functional programming--isn't all that restricting in practice. When you look at the big picture, this just seems like a better answer for the system overall.

iArnold · August 7, 2020, 6:35am

We also do not want the chaos to be in our databases.

There is no way of telling beforehand what the programmer is trying to achieve, but making this impossible is the other end. I think it is a good thing to be aware of.

Mark-hi · August 7, 2020, 1:51pm

Well, I don't like crashing any more than the next guy. But, we are aiming for interpreted, not compiled (unlike Red), so, there is a further choice: break into the debugger. This, I like. And it completely avoids the, in my opinion in the best case unending, prevent-guns-from-shooting-foot "problem".

hostilefork · December 3, 2020, 5:59pm

Modifying While Iterating: Crash, Nonsense, Predictable, or Illegal?

Q. What should the following code do?

do code: [append code [print "addendum"]]

Here's Rebol2 and Red's answer:

rebol2/red>> do code: [append code [print "addendum"]]
== [append code [print "addendum"] print "addendum"]

While R3-Alpha does this:

r3-alpha>> do code: [append code [print "addendum"]]
addendum

So R3-Alpha noticed the new addition of code to what it was executing, and Rebol2 and Red didn't.

While either answer may seem reasonable in some way, the general case of having the array shuffled more drastically does not have a sensible answer. As outlined in my post, I've liked Ren-C's answer better:

ren-c>> do code: [append code [print "addendum"]]
** Access Error: series has temporary read-only hold for iteration

...but I broke it, and I don't know exactly how to fix it.

I was stitching together arrays into one continuous execution. So you could have one array that says [1 +] and another that says [2 * 3] and put them together to act as [1 + 2 * 3]. The problem is that protections only applied to the segment of the array you were currently processing.

Implementation-wise, there was a hold on an array as long as you were enumerating. It would drop the hold after the enumeration was done... but before the last function ran.

So you wouldn't be able to write:

>> do code: [clear code print "Can't do this"]

But you would be able to write:

>> do code: [print "This was allowed" clear code]

...because by the time it actually ran the CLEAR command, the enumeration was finished.

I worked around it in a sort of dumb way, but this just points out the continued need for a solid philosophy about this topic. We can't really hit perfection with this model, but we can quantify the imperfection and say what the guarantees are and aren't. And I don't consider "there are no promises, it's all pretty much random" to be acceptable.