Remove-each x [for-each map-each remove-each] [x = 'remove-each]

hostilefork · August 22, 2017, 1:01am

TL;DR => How do people feel about REMOVE-EACH locking the array which is to be removed from during the removal process (so it appears unmodified while running the loop body, even if removals have been indicated). Then all the removals would be effected at the end. Else, we are going to have to get rid of it as a native, and it will be written as a mezzanine with FOR-NEXT or something. I explain below.

Most iterating constructs in Ren-C apply protections to the iterated series. This ranges from DO, to PARSE, to CASE, etc. While it may seem "more flexible" to not put a temporary lock on a series during such iterations, this introduces quite a number of headaches.

Lack of a lock means the code doing the iterating must have some kind of safety across modifications, and that's not free. If you're lucky enough not to crash, you still wind up with a hodgepodge of semantics.

For instance, what should this do?

cases: [
    true [
        print "hi"
        loop 10000 [insert cases [true [print "expanded"]]]
    ]
    true [
        print "tail"
    ]
]

case/all cases

R3-Alpha did not cache the series tail at the beginning of the CASE, nor is it caching pointers to values in the block. It rereads the series length each time through the loop after user code has run (which has a "cost"), and compares the current index to that. Then, Do_Next() has to speak in terms of the index and block, re-fetching it each time. Hence you get hi hi hi hi... until you run out of memory.

Those wacky folks over at Red do cache the length. Which means you get hi and expanded and that's it. If you're curious about what happens when you shorten the array being iterated, well...

>> cases: [1 [print "1" clear cases] 2 [print "2"] 3 [print "3"]]

>> case/all cases
1
2
3

>> cases
== []

And yes...there it goes, reading beyond the tail, into essentially invalid memory. If they had a garbage collector and it ran, it would presumably not protect the values beyond the tail...and these stale reads could crash on GC'd pointers inside those values. These kinds of bugs were everywhere in R3-Alpha. My crystal ball suggests Red will probably try patching with something that fetches the length each time--but again, each such patch adds a bit of cost, to preserve a nonsensical invariant, and... I'll just stop before I get on too much of a rant here.

I'm glad to say that the workings of frames in Ren-C do the right locking and permit caching, plus generally having workarounds for getting the dynamism people were seeking with the above kinds of manipulations.

But this brings me to the topic of the question: the problematic primitive remove-each. By its very definition it removes things from a series without creating another series. That means that series can't be "locked" during the enumeration...at least not if you expect to see the effect of the removals during the enumeration.

Moreover, if you're going to see the effects during the enumeration it is very inefficient, effectively O(N^2). Each time you do a removal, you have to slide all the data in the underlying array to fill in the gap and leave a coherent array.

Basically, the native's hands are tied to where it gets nearly no benefit from being a native at all. One might as well implement it as a FOR-NEXT, and then at least the fact that the construct is all usermode would mean that one isn't fiddling with trying to write it in C and do all the bookkeeping of caches.

It is technically possible that the array could be locked during enumeration, flip some bits on the values in the array as it went, and then when REMOVE-EACH is finished do a compaction all in one step. That means that you would not see the effects of the removals until after the loop was done. We don't exactly have a ton of bits we can use for this inside the values themselves; so you might wind up with a parallel dynamic allocation of a bit array to do this.

hostilefork · August 22, 2017, 3:13am

Ah, I just realized a problem...REMOVE-EACH on a non-array (e.g. just a string or binary or similar) has nowhere to sneak a "remove later" bit on the bytes of the string or binary.

So at least in the general case, you can't suspend the removals until the end of the enumeration without some allocation proportional to the size of the series. As long as you're doing that, the implementation of REMOVE-EACH might as well just be building the new series data block as it goes, and just do a memory trick of a low-level swap of the underlying pointer for the series when it's complete.

Which is probably how it should have been written in the first place. But, so long as it's arrays of Rebol values and not a BINARY! or STRING! or such, it may be possible to do it in place with one sliding step...if we can chew out a header bit for the purpose. I'll try it.

Brett · August 22, 2017, 12:00pm

I don't recall writing code to add items during a remove-each, possible but probably not. Seems like something that would be likely to break with an implementation change (funnily enough), and it seems contrary in sense to the operation.

Hm, idle thought having read your explanation: should a remove-each like operation take a bitset as an argument for which elements should be removed? It's an idle thought, because I don't really use bitset for anything other than what charset returns.

hostilefork · August 22, 2017, 9:00pm

Indeed, you can't use it effectively unless the semantics are pinned down. And pinning down the semantics means making a commitment to something that ties the hands of the implementation in some ways.

Immutability is conservative in terms of the semantic burden on the construct. The only thing it requires is maintaining and checking a locked bit (well, doesn't require, some systems say "don't change it or all bets are off", which given the lack of commitment from R3-Alpha and Red is effectively what you're winding up with). But if we were going to have any locking at all (e.g. PROTECT) then checking those bits was a sunk cost systemically.

I think it's probably best to just generate a new series for the strings and binaries, and it may even be better in general for the arrays...to push kept elements to the data stack, and build an array from them. Because it would mean the new allocation would be an exact right size for the data. This bias of trying to size data exactly right when possible exists elsewhere in the system. But there are several options once we go with the immutability during, and it plugs the holes I'm looking to plug.

The deferment does introduce some semantics questions though. What should happen if there is a BREAK mid-way through? What about a FAIL?

hostilefork · August 23, 2017, 3:52am

We have options here, but I went ahead with "go ahead and do the removals" for all exit paths...failing included: