Semantics of UPARSE's FURTHEST

hostilefork · July 9, 2022, 12:21pm

Upon the announcement of UPARSE, @Brett listed as his secondmost missing feature the idea of knowing "how far a parse got":

An ability to return the furthest input point matched and the rule that caused rollback from there on parse failure. During development of rules this generally indicates the rule that is not properly specified.

What I've done is make it so combinators use a common generator COMBINATOR. This generator takes the function body you give it, and stuffs in some boilerplate parameters (like the INPUT and STATE). But it also wraps your code with some more boilerplate that can run before and after the parser.

The current idea of the "parser state" is just to pass around the FRAME! of the UPARSE operation itself. So if you have any global state you want visible to the parse you can put it there. Hence the state gives every combinator access to the arguments, return values, and locals of the invocation.

I made FURTHEST a multi-return value. The hooked combinators are run, and then if they succeed they're checked to see if they got further than any previous combinator. If so they update furthest.

>> [_ furthest]: uparse "aaabbb" [some "a" some "c"]
; null

>> furthest
== "bbb"

I Notice TO and AHEAD Skew FURTHEST a Bit Far...

Consider the case of the TO combinator. It's supposed to move the parse position to right before an instance of the matching rule.

But the subtlety of backing up that position is lost on FURTHEST...which just noticed that a successful parser run occurred, and updates the high water mark:

>> [result furthest]: uparse "aabbcc" [to "bb"]
; null

>> furthest
== "cc"  ; not "bbcc"

It's a problem that's kind of a parallel with rollback. Which leads to the discovery of a quirk! TO does not manage its "pending" list explicitly...it uses the default "auto-routing". Hence the success of the last parser it calls--whose advancement it doesn't want--counts in a collect:

>> uparse "aabbcc" [collect to [some keep "b"], elide [2 "b" 2 "c"]]
== ["b" "b"]

Is that right? (cc: @rgchris) If it's not right, then it would seem that any KEEPs inside a TO rule should never have an effect. That seems strictly less powerful than being able to grab things when you find an ahead-match, so I think it's okay.

But with FURTHEST it's less clear.

Pathology Studies: How About MINMATCH?

I made MAXMATCH as a case study of different approaches to influence on COLLECT:

maxmatch.parse.test.reb

Similarly we could ask about MINMATCH, and what its participation with FURTHEST should be.

It could call two parsers...have both succeed...and then only advance the smaller amount of the two. We might say this "foils" a wrapper-based approach to updating furthest, as it would be advanced by the larger amount.

I'm hesitant to burden combinator authors with another parameterization just to express distinctions for the purposes of FURTHEST. But it's a good thought experiment for what the limits are.

I'm not really sure what the behavior here would be.

Maybe you can look at how the furthest detection works and explain in the context of the code what you would want.

hostilefork · November 2, 2023, 4:54am

hostilefork:

I made FURTHEST a multi-return value. The hooked combinators are run, and then if they succeed they're checked to see if they got further than any previous combinator. If so they update furthest.
>> [_ furthest]: uparse "aaabbb" [some "a" some "c"]
; null

 >> furthest
 == "bbb"

Giving back FURTHEST as a multi-return had to be axed, for several reasons.

For one thing: PARSE now raises a definitional error when it does not reach the end of input. Returning an error isotope is fundamentally incompatible with unpacking return values from a returned block isotope, you can't do both!

And secondly: PARSE is designed to allow you to use the full bandwidth of return values, including multi-returns. This means combinators themselves might be multi-returns (I've suggested TALLY could be a multi-returner, giving the count as a primary product but also including the synthesized product):

 >> [count result]: parse "aaa" [tally ["a" ('a)] | tally ["b" ('b)]]]
 == 3

 >> count
 == 3

 >> result
 == a

Or you can just explicitly evaluate to a muti-return PACK from a group evaluation, for whatever reason...

parse "..." [... accept (pack [<a> 2]) ...]

Stop-Gap Measure: a "Crappy" FURTHEST Hook

An early adaptation of @BlackATTR's Query had used the FURTHEST return value from UPARSE. But that needed to backtrack into painful inclusion of furthest: <here> markers in every rule. That's no good.

But UPARSE's generic hookability allows for a rule-stepwise debugger

So of course, simply asking how far it got is not hard to ask for. Just use a hook which writes to a variable you specify. I hacked this up separately as PARSE-FURTHEST:

>> parse-furthest "aaabbb" [some "a" some "c"] 'far except [
       print ["Furthest was:" mold far]
   ]

Furthest was: "bbb"

Well, it's better than nothing!

But What Do We Really Want?

One way of looking at this could be that FURTHEST is a field in the error that gets returned.

>> parse "aaabbb" [some "a" some "c"] except e -> [
       assert [e.id = 'incomplete-parse]  ; only raised error?
       print ["Furthest was:" mold e.furthest]
   ]

That's not bad, other than it seems there might be a lot of things you want to know about a failed parse besides just FURTHEST. You'd want to know the actual parse position, but other aspects about the rule state as well. Is it too much stuff to pack into an error object?

You might have been keeping running track of the line and column (which I've wanted to find a way to expose via <line> and <column> combinators, where applicable). That has to be updated as the parse proceeds, you wouldn't want to have to recalculate that in case of an error...?

So it seems to me that there may be need for a parse state object which can be made available to reflect after a parse that's successful -or- unsuccessful. The PARSE function would be a convenient wrapper on top of this, but you could dig deeper for more serious needs.

For right now, the PARSE-FURTHEST hack keeps the mechanism behind the functionality tested, but there's clearly a lot to think about here (and I believe looking at Haskell for inspiration on these kinds of questions is wise).