Stopping the /INTO Virus

hostilefork · June 23, 2018, 1:27am

In R3-Alpha, an /INTO option was added to REDUCE and COMPOSE. It blended the functionality of INSERT into these routines, so as to avoid the overhead of creating an intermediate series that would just be thrown away:

>> data: copy [a b c]
>> insert data reduce [10 + 20 30 + 40]
>> data
[30 70 a b c]

>> data: copy [a b c]
>> reduce/into [10 + 20 30 + 40] data
>> data
[30 70 a b c]

So no new functionality is added...this is a refinement whose sole purpose is to be a lower-overhead way of doing what you could do already.

But...it's narrower. There's no /PART refinement, so you're going to get all of the reduced data inserted if you use /INTO. There's no /DUP, so you'll get one copy only. There's no /ONLY, so arrays will be spliced in. And from a Ren-C perspective, there's no /LINE (which APPEND+MODIFY+INSERT have now)...so you can't have the inserted data give a newline marker.

Plus, /INTO just has INSERT semantics, and returns the tail of the operation. You can't do MODIFY. And if you want to optimize append data reduce [...] you'd have to generally say head reduce/into [...] tail data. Noting that each function call in the evaluator has cost, and path dispatch takes longer than ordinary dispatch in the first place, one might wonder just how much it's saving...?

I don't want to get into a bunch of artificial examples of /INTO usage to show what it's faster or slower at...but to make my point here, some R3-Alpha timing of that:

>> data: copy [a b c]
>> delta-time [loop 1000000 [append data reduce [10 + 20 30 + 40]]]
== 0:00:00.481017

>> data: copy [a b c]
>> delta-time [loop 1000000 [head reduce/into [10 + 20 30 + 40] tail data]] 
== 0:00:00.397192

I'm sure you can craft some situations where it can be shown to perform better...especially when the GC is taken into account. But what I'm trying to get at is that I think this is the wrong place to be looking for optimization.

It's asking users to write their code in an unnatural form with more limited options than they'd expect from APPEND+INSERT+MODIFY. (While trying to write the above example I got mixed up and tried to write reduce/into data [stuff to reduce] because saying where to put the reduced data first feels more natural...)
It creates a cognitive uneasiness ("am I doing this right? should I have used a REDUCE/INTO"), leading people to write less clear code in pursuit of a performance benefit that may or may not materialize.
Everyone writing an operation that generates a new series will now wonder if they have to make an /INTO version as well.
REDUCE's increased complexity means more documentation, more refinements to fulfill in the frame, more checking of those refinements, more cost to evaluate the PATH! when the refinement is used, etc.
This is all for no increase in functionality.

While all these points are a bit grim, it's actually #3 that worries me most. An even more frightening thought is if people start worrying about adding the missing /PART, /DUP, /LINE components. We can see that /INTO is a precedent that suddenly creeps into the source level consciousness of everyone writing code.

There are a lot of areas to look at for making the system run on the whole faster without the downsides that /INTO brings to the table. There should be a heavy skepticism about introducing these kinds of things. Maybe it can speed some things up, but for what collateral damage?

So I want /INTO to die, and tackle optimization in more systemic ways. And it's easier to tackle systemic optimization when there's less code, following more predictable rules.

Note: Just looking at the one use of COMPOSE/INTO in R3-Alpha for a moment:

opt [#":" copy s2 digits (compose/into [port-id: (to integer! s2)] tail out)]

...it's a good data point on this being a bad direction. Compare with:

opt [#":" copy s2 digits (append out compose [port-id: (to integer! s2)])]

I think playing to Rebol's strengths means doing everything possible to make the second case acceptable enough in performance that such distortions aren't worth it within the domains it would be applied to.

(I'd also like to add that use of CHAR! literals like #":" in PARSE rules that are processing ANY-STRING! is another area where we need to work to make sure that it's not sufficient performance benefit to warrant doing that instead of plain ":". Those are the kinds of things that need to be solved at acceptable cost as well, to get the source to be as clear as it can. You should only be using character literals if matching CHAR! values in BLOCK!s...)

hostilefork · June 23, 2018, 12:04pm

My confidence of /INTO being bad means the deed is done and /INTO is gone.

Though I had one little moment of pause when noticing that COLLECT/INTO was able to KEEP into a string:

>> collect/into [keep 'A keep [B C]] x: copy ""
== "ABC"

What it's doing is basically equivalent to the following:

>> x: to text! collect [keep 'A keep [B C]]
== "ABC"

The difference is, the /INTO version "folds" any data it gets into the string as it goes. Using TO TEXT! after-the-fact means that if you do thousands of KEEPs, you'll wind up with a collected BLOCK! that's thousands of elements long...that can't be GC'd as you go. COLLECT/INTO would contribute material to the string one element at a time with no intermediate block.

Having a string target for /INTO when the source material is a block is thus materially different. When the target is a block anyway, you wind up gathering the same amount of total state...and the GC will clean up any intermediates. But targeting a block instead of folding a string means you might run up against an operation your system had enough memory to do with the fold, but not with the block.

Still...COLLECT/INTO an ANY-STRING! is a false economy

The folding nature of /INTO might make it sound good on paper. But if we look at the big picture, we can see why it's not very compelling. I'm going to write this all out because I want there to be no doubt that exterminating it was the right move.

You only get the TO TEXT! semantics of the underlying INSERT. There are a lot of ways to turn blocks into text, or to fold strings together. How often is what INSERT would do exactly what you wanted?

Most of the time I've used COLLECT with strings I find myself writing spaced collect [...]. But SPACED treats single characters differently from strings. It assumes you don't want them to participate in the delimiting but want them treated as-is:

>> spaced ["a" "b" #"c" "d"]
== "a bcd"

This is actually very important, because it means newlines don't get delimited since they are a CHAR!. You see weird behaviors on that in R3-Alpha/Rebol2/Red:

red> form reduce ["a" newline "b"]
== "a ^/b"

If a COLLECT/INTO is just folding text and throwing the material it has away, it can't do such subtleties.

You're losing the power of the evaluator. KEEP itself doesn't evaluate, so once something is folded into a string...you don't have any bindings or anything to evaluate anymore. That's lame. One thing about KEEP is that you don't have your hands tied with it... you can say things like:

x: unspaced collect [
   for-each [flag string] data [
       ...
       if flag [keep 'reverse]
       keep string
       ...
    ]
]

But if you folded it as you went, you'd wind up needing some nested collect:

x: collect/into [
   for-each [flag string] data [
       ...
       keep reduce collect [
           if flag [keep 'reverse]
           keep string
       ]
       ...
    ]
] result: copy ""

The second is less clear than the first. And as I said in the first point, it's not like UNSPACED is even the logic you'll be wanting.

The KEEP result doesn't return what you added and may not be that interesting. You get what was passed to keep passed through just like when you're adding to a block:

 >> collect/into [print ["KEEP returned:" mold keep [A B C]]] result: copy ""
KEEP returned: [A B C]
== "ABC"

It's just another axis of "your application might have different ideas of what's useful"

If you actually have an issue with scale, your problem is probably complicated enough you'd prefer your own emitter. Stressing the point again: there are tons of ways one might want to make strings from blocks, and the odds COLLECT/INTO picked the one matching your scenario are slim. Probably even slimmer if you're dealing with some gigantic bunch of data that requires folding as you go.

COLLECT is deliberately not a hard function to write (and it's even less so without /INTO complicating it!) You can roll your own emitter very easily:

result: copy ""
emit: specialize 'append [series: result]

Now your EMIT has /ONLY and /DUP and /LINE and /PART. That was quick. But what if we wanted to have a different return than result that we've accumulated so far (which APPEND returns). How about an ENCLOSE?

result: copy ""
emit: enclose (specialize 'append [series: result]) func [f [frame!]] [
    f/value: to-text f/value ;-- will be return result due to ELIDE
    elide do f ;-- runs the append (F/VALUE unavailable after DO F)
]

So now your emit [A B C] gives you "ABC" as a result, the part of the string it added. Sky's the limit.

And that's why you shouldn't shed a tear for the folding /INTO

Summarized simply:

It will always be clearer and more powerful to to express a string-targeting COLLECT/INTO as a block-targeting collect with operations that then process that accumulated block.

Thus, the only reason you would have ever favored a string-targeting COLLECT/INTO would be if your problem was one of performance or scale such that holding that intermediate block (and elements it holds) would be excessive. But anyone in that situation probably has complex needs that the limited COLLECT/INTO behavior wouldn't meet anyway!

So it's better to have people learn how to use the powerful tools for designing COLLECT-like abstractions themselves and keep COLLECT simple.

So you shouldn't miss it, nor should you be missing the following bugs, all of which are now "resolved" in Ren-C:

#2081 Make REDUCE/INTO and COMPOSE/INTO work when targeting any-string
#2061 REDUCE/into of non-block doesn't insert into target
#2062 COMPOSE/into of non-block doesn't insert into target
#1748 REDUCE/into and COMPOSE/into bypass PROTECT #1748
#620 EXTRACT /into buffer option #620
#623 READ /into buffer option. #623
#709 MAP-EACH /into buffer option
#621 Change COLLECT /into option to use insert semantics

And you can see why I call it a virus. It's not adding functionality, but all of a sudden here it is being tacked on to everything and anything.

hostilefork · December 2, 2020, 4:39am

^-- me in June 2018

Gregg Irwin in December 2019: "TL;DR Remove /into everywhere in Red."

Is /into a good idea at all, anywhere? I vote No. It came about in R3, and I think was championed by Brian Hawley. I'm biased, as I never liked it (Brian I like fine, just not /into). AFAIK, it has never been a proven win. It's an unnecessary complication.

Not that talking sense is going to make much impact over there...

rgchris · December 8, 2020, 5:49pm

/INTO has always seemed a bit awkward. At a native level, it would seem marginally better (readable) to have INSERT/REDUCE or APPEND/REDUCE, but as REDUCE is kin to COLLECT and COMPOSE, you wouldn't want to have refinements for each of those plus the infinite variants of composers that one could conceive. Whether as a mezzanines or within an 'optimization' module, it may be better still to have dedicated functions more suited to this: COLLECT-INTO/COMPOSE-INTO/REDUCE-INTO. For tidyness if perhaps less optimized, an INTO function that takes a prefix argument for the composition method—COLLECT INTO.

I'm not quite sure how I feel about this as REDUCE is a fairly essential function and I'm reducing () it to a specialization of a REDUCE-like function that takes a block argument as a target—perhaps it should ever have been thus.

This goes back to another pet of mine and that is to have consistent natives within some type of Core module so as not to pollute the 'global' namespace. Essentially a new user can do help "collect" and only have the prescribed one appear, but can still access the more elemental one by doing core/collect copy [] [...collect body...] (or by another name: primitive/collect make block! 16 [...collect body...])