Semantics of UPARSE's FURTHEST

Upon the announcement of UPARSE, @Brett listed as his secondmost missing feature the idea of knowing "how far a parse got":

  1. An ability to return the furthest input point matched and the rule that caused rollback from there on parse failure. During development of rules this generally indicates the rule that is not properly specified.

I implemented a wrapper around combinators which basically got control before each combinator, and after. Every parser that succeeded I'd look at how far it said it got in the input...and if that was further than the previous "furthest" it would bump it.

But what happens when a combinator calls a parser that succeeds, but that success isn't used?

Examples would be things like AHEAD, or TO:

 >> [result furthest]: uparse "aabbcc" [to "bb"]
 ; how can FURTHEST be "bbcc" and not "cc"

TO iterates, checking to see if "aabbcc" matches, then if "abbcc" matches, then finds a match at "bbcc".

But the reason it found a match at "bbcc" is that it ran "bb" on that and got "cc". The "bb" combinator has the wrapper, and so FURTHEST got pushed.

TO and AHEAD actually used to work on accident, before changing multi-returns, because they didn't ask for the remainder from the parser parameter that they called.

Previously the nature of multi-return was that you'd actually get the variable to write to as the multi-return parameter. This made them compatible with historical refinements like /NEXT of DO that passed variables, and was just an alternative syntax for that. It's far superior to do proxying to those variables at the end of the function...and allow the multi-return to be read and written as a normal variable.

But when you could see the variable, there was the ability to choose a different behavior when it was a "BLACKHOLE" indicator (#) that you weren't planning to use the result.

So the TO combinator would pass in a blackhole saying it wasn't planning on using the remainder of the parser it was calling on each iteration. This meant the wrapper actually didn't find out how far the combinator TO called got, because there was no variable to write to. If it tried to read the index of it would fail, so the protection of not updating furthest "just did the right thing".

But that's not a general solution...

There was a parallel problem with rollback, where not all succeeding parsers were supposed to contribute to an in-progress COLLECT. This led to the PENDING mechanism, where some parsers had to explicitly route the output and say "yes, I want this to be collected".

(Hence combinators don't "roll back" collections, they just don't commit to routing what they aggregate from component parsers until they are finished.)

Which leads to the discovery of a quirk... TO does not manage its pending explicitly...it uses the default "auto-routing". Hence the success of the last parser it calls--whose advancement it doesn't want--counts in a collect:

>> uparse "aabbcc" [collect to [some keep "b"], elide [2 "b" 2 "c"]]
== ["b" "b"]

Is that right? (cc: @rgchris)

If so, it means "I want to consider the called parser's pending results, yet I don't want to incorporate its advancement in FURTHEST."

That would rule out something I was pondering, which was if there was some way of tying those two things together (only advance furthest if the pendings are used).

Pathology Studies: How About MINMATCH?

I made MAXMATCH as a case study of different approaches to influence on COLLECT:

maxmatch.parse.test.reb

Similarly we could ask about MINMATCH, and what its participation with FURTHEST should be.

MINMATCH wouldn't have the crutch of not having asked for the remainder from its component parsers. So it wouldn't have worked accidentally.

It could call two parsers...have both succeed...and then only advance the smaller amount of the two. We might say this "foils" a wrapper-based approach to updating furthest, as it would be advanced by the larger amount.

At The Moment, TO and AHEAD Skew FURTHEST a Bit Far

I can't stop the presses and not incorporate the multi-return change. So right now, you get this:

>> [result furthest]: uparse "aabbcc" [to "bb"]
; null

>> furthest
== "cc"

I'm hesitant to burden combinator authors with another parameterization just for this feature.

I am a little bit wondering if it's worth it to convey to a function that a multi-return result is "requested, but will not be written back to a variable". There are some optimizations functions might make that are pertinent to that. So we might be able to go back to how it used to be where if the remainder is requested as a "blackhole" then the furthest is not updated. It could fix TO and AHEAD.

But MINMATCH would still be the kind of thing that would raise questions.

2 Likes