Getting Hooks Into "Events" during PARSE

hostilefork · August 11, 2021, 9:55pm

All right... hold onto your little blue coffee cups... because I've got something rather awesome working!!!

TL;DR - Show Me the Code

Here is MAXMATCH-D (default "rollback") and MAXMATCH-C (custom "rollback"), implemented for your parsing pleasure!

(Note that this also shows how you can define a COMBINATOR in your own files and call it directly. This lets it step in to fill the feature hole that PARSING-AT had to patch around...

...Also see the nice demonstrations of how BLOCK! grouping can be reorganized freely without bugs or worries. You can use them or don't, and the engine sorts it out!)

How It Works

Quick Refresher: The role of the UPARSE engine is to wire up the parser combinators (you might think of them as the "keywords") that can take parameters of some number of rules. They are used to create parsers which just process the input and return a synthesized result and a remainder.

(For instance: BETWEEN is a combinator which takes in the input and two parsers. But once an instance of BETWEEN is "combinated" it becomes specialized with the right parser instances and becomes a parser itself. When called from the outside it appears as a parser with only the input parameter...as its combinator parameters have been fixed as the appropriate parser functions.)

I spoke about adding a third return result representing whatever "pending material" a combinator has accrued. For the moment let's just say this is a BLOCK! of values of various types. Combinators like COLLECT or GATHER will filter through these blocks and pull out the parts they think are relevant to them. But most combinators just want to aggregate results together and pass them through.

I've called this third result "pending". And what I have done is I notice when you write a COMBINATOR spec block if you explicitly mention the pending: return result or not. If you do not mention it, it is assumed you want the "default aggregating behavior".

So what is the default aggregating behavior? Well, when the UPARSE engine fills in the parser parameters to your combinator, they all start out with their own pending return result. But what the common combinator prelude can do is specialize out that parameter and wire it up as a sequential aggregator.

Hence if you don't indicate you have a pending: return result from your combinator, the parsers your combinator is specialized with will appear not have pending return results either! They'll only return the 2 classical results: the synthesized value and the remainder of input to process. All the wiring happens behind the scenes.

But if you do say you have a pending: result, the parsers your combinator receives will all demand to return that third pending parameter. And you need to make the decisions of what to do with each pending result to produce your combinator's own result.

Everybody Got That?

Executive summary that I gave @BlackATTR:

By default all parser invocations you call that succeed add to the result, in the order they are called. Failed parsers don't add to the result. This is good enough for a lot of things, like SOME for instance.
- If SOME calls its one parser it takes as a parameter 5 times and it succeeds, and then one time and it fails, it will succeed (since it matched at least one time). But it's happy enough just giving back the aggregate of those 5 successful calls in order. So it does not mention pending: in its spec, and just gets the automatic behavior.
Such a default is not good enough for the BLOCK! combinator. If it calls uparse "ab" [keep "a" keep "a" | keep "a" keep "b"] then it doesn't want ["a" "a" "b"]. Mere success of the parsers it calls is not enough, it has a higher-level idea of "alternates" and a whole alternate group must succeed to keep its result.
- So it mentions pending: in its spec, which means the parsers it gets don't have their pending results specialized away. It builds the appropriate result.

Now To Build More Features On It!

The thing is that this block of "pending" results is kind of a monolith, it accrues everything which could be a mixed bag of stuff.

What I was thinking is that any QUOTED! items are assumed to be for COLLECT. We can just say COLLECT is probably one of the more common things. Then, maybe any GROUP!s are deferred code. (e.g. code that will only run if the entire parse succeeds...or some marked phase, maybe)

Fanciful example, where the GROUP! combinator will delay any code when prefaced with the <delay> tag. Let's assume if the group was just (<delay>) it's an error to guide you to say ('<delay>) if you actually want to evaluate to the TAG! delay...

>> uparse "aaabbb" [
    some "a" (print "hey A!") (<delay> print "delay A!")
    some "b" (print "hey B!") (<delay> print "delay B!")
]
hey A!
hey B!
delay A!
delay B!

>> uparse "aaaccc" [
    some "a" (print "hey A!") (<delay> print "delay A!")
    some "b" (print "hey B!") (<delay> print "delay B!")
]
hey A!
; null

I mused that a phase would just be a bracket that says "you don't have to wait to the absolute end, go ahead and run the delays you've accumulated now":

>> uparse "aaabbbccc" [
    phase [
        some "a" (print "hey A!") (<delay> print "delay A!")
        some "b" (print "hey B!") (<delay> print "delay B!")
    ]
    some "c" (print "hey C!")
]
hey A!
hey B!
delay A!
delay B!
hey C!

Anyway so this mishmash of a PENDING block could contain mixes. If it sees [(print "delay A") '(print "delay A")] it knows that the unquoted thing is something the delay mechanism pays attention to, while the QUOTED! thing is something for COLLECT.

Then maybe BLOCK! is used for emit, e.g. the WORD! and value for the object. This could be stuck in the stream with everything else, like emit x: integer! => [(print "delay A") '(print "delay A") [x: 1020]] ... basically this big glommed together bunch of results that are being collected filtered and discarded.

The arrays that are produced give up their ownership, so when your combinator gets a pending array back you can party on it all you want. So COLLECT might look at a pending array and realize it's all QUOTED! items, and just go through and unquote them and return that as the collected array--without needing another allocation.

(I think these will be common cases. Anyway the GLOM mechanic is what I introduced to try and make it cheap so that you can also just return BLANK! So you're not paying for parses that don't use any of this to make empty arrays at every level of every combinator... otherwise a rule like [100 "a"] would be generating 100 empty arrays for no reason.)

I Think UPARSE Is On Track to be the !

The amazing thing is just how well slinging usermode code of frames and functions and specializations around is working. "It's programming Jim...but not as we know it..." -- it's the language malleability that Rebol has tried to promise but was just not feasible in usermode until Ren-C.

But of course, there's a lot of work to do here. It's slow and the errors are bad. But this experimental implementation is passing the barrage of tests--to which I've added everything from the rebol-issues database, and closed many issues and wishes that are all now solved and answered!