Getting Hooks Into "Events" during PARSE

Brett · July 27, 2021, 7:42am

When thinking about an automated collection mechanism for parsing in Rebol 2 in the dim dark past, I thought I needed every matching/unmatching rule to be tracked for stack push and pop. I decided I couldn't achieve that without Rebol providing some sort of hook facility for it. I asked and was ignored.

Nevertheless pressing on I realised a few things:

I could have a practically useful result by tracking success/failure for particular rules.
Embedding some sort of bound output code in a rule, breaks the reusablity of the rule. I didn't want to have to rewrite the same rules for the same data format, just because I wanted to extract different information for a different application.
Theoretically, I also wanted to hook and leverage some other user's rules or Rebol2's built-in rules too.
Some rules are used to seek for a particular pattern, or test multiple patterns at the same position, some of these are sometimes meant to fail - in order to progress the overall match. We probably don't want to collect this class of rules outside of a debugging situation.
To solve the above I ended up with requiring rules to be tracked to be defined as words, allowing me to automatedly hook, track and emit them into a structure (load-parse-tree.r). Not rocket science, nor perfect, but it was useful.
Something of an aside, reflecting back on the usefulness of the output, it would have been nice to have had a tree type in rebol, a node that was combined object (properties - its head?) and series (contents or body) as one type that could perform the role of one or other or both. So many data structures look like this.
While definitely nice, I idly wonder now whether a collect mechanism is complete enough to achieve the things people want to use parse for, that perhaps parse should be able to emit actions to be evaluated by some sort of user specified machine.

Anyway, I guess I'm alluding to the point that separating some sort of user specified specifications or actions outside of the rules themselves might be useful.

hostilefork · July 28, 2021, 9:40am

When looking at the collect design I wanted to put that in terms of something general enough to support other things, like the GATHER/EMIT. So I was thinking that what would actually happen would be that the collected items would be tagged somehow with what flavor of thing was being aggregated, and the controlling construct would sift through for the kinds of things it wanted... GATHER would look for items gather-taged, COLLECT would look for items collect-tagged.

But to your point on getting "events", the idea of having a generic UNDO-ON-FAIL seems interesting:

>> uparse "aaa" [undo-on-fail ([print "Undoing!"]) some ["a" (print "a!")] "b"]
a!
a!
a!
Undoing

There's no top-level construct doing the orchestration there, like the COLLECT or GATHER. So it's like UNDO-ON-FAIL wants to subscribe to the block combinator's rollback of alternates.

Definitely!

But when one speaks about reusable and generic code, it might be hard for UPARSE itself to serve two masters. Even though it might seem like grammars are "lower level" than something as flexible-seeming as UPARSE, it might be that they're actually a distinct domain. So building something like GRAMMAR on top of UPARSE could ultimately actually be the "right" answer.

This in particular I'd like us to look at, and see what we can do about the pain points...asking "what features in UPARSE would make this easier to write." (I've already mentioned that "parsing-at" is subsumed by just writing your own COMBINATOR.)

Hence getting %load-parse-tree.r ported to Ren-C and having tests seems like a good case study! I'll take a crack at it when I get a moment (going to be driving today...) Do you have more tests than the one in the comment hanging around?

hostilefork · July 29, 2021, 6:41am

A post was merged into an existing topic: Generalizing Unwinding in UPARSE

Brett · July 29, 2021, 2:44am

Um, not really. I used it a lot in an ad-hoc fashion, extracting structured data from webpages and things where a pattern was evident. I'll include a different example further down.

At the risk of talking about too much old stuff, I'm going to add some more thoughts on complex parse work. I realise, that these may not be your priorities, but I'm adding them here because this is a reasonably relevant time to contribute to thinking about "things people might want to do with parsing" and what facilities could support those goals.

A complex use of load-parse-tree was part of a process to automatedly extract and translate Augmented Backus–Naur form (ABNF) from RFC documents to Rebol 2 parse rules (abnf-parser.r). The block returned by load-parse-tree was rewritten to become parse rules.
It occurred to me at some point that allocating new memory for constant patterns (literals) was a waste of memory, particularly for large input sources. I took a stab at distinguishing non-terminals, terminals and literals in an early ren-c version of get-parse-tree.r. It's the same essential logic as load-parse-tree but returns a more complex result. It has some rudimentary tests.
Once input was broken into tokens, it was hard to easily represent these tokens in a linear parse-able stream but type them. It seems like a facility to be able break up the inputs and annotate the parts in a more efficient way might be interesting.

Note: See also post on Visual Parse Debugging

I think these things encompass most of the things I've found interesting with parse over the years.

For the sake of completeness and to give another "test" for load-parse-tree.r I'll add the commands and sources below.

Here is the setup. (Note that json-structure is defined in https://github.com/codebybrett/rebol2/blob/master/working/misc/json-structure.r)

t: read %example.json
j: context json-structure

Running load-parse-tree:

>> print mold load-parse-tree/ignore j/grammar [parse/all t j/grammar/json] [wsp string value]
[
    json [
        object [
            member [
                json-string "menu" 
                object [
                    member [
                        json-string "id" 
                        json-string "file"
                    ] 
                    member [
                        json-string "value" 
                        json-string "File"
                    ] 
                    member [
                        json-string "popup" 
                        object [
                            member [
                                json-string "menuitem" 
                                array [
                                    object [
                                        member [
                                            json-string "value" 
                                            json-string "New"
                                        ] 
                                        member [
                                            json-string "onclick" 
                                            json-string "CreateNewDoc()"
                                        ]
                                    ] 
                                    object [
                                        member [
                                            json-string "value" 
                                            json-string "Open"
                                        ] 
                                        member [
                                            json-string "onclick" 
                                            json-string "OpenDoc()"
                                        ]
                                    ] 
                                    object [
                                        member [
                                            json-string "value" 
                                            json-string "Close"
                                        ] 
                                        member [
                                            json-string "onclick" 
                                            json-string "CloseDoc()"
                                        ]
                                    ]
                                ]
                            ]
                        ]
                    ]
                ]
            ]
        ]
    ]
]

Example.json:

{"menu": {
    "id": "file",
    "value": "File",
    "popup": {
      "menuitem": [
        {"value": "New", "onclick": "CreateNewDoc()"},
        {"value": "Open", "onclick": "OpenDoc()"},
        {"value": "Close", "onclick": "CloseDoc()"}
      ]
    }
  }}

hostilefork · July 31, 2021, 6:01pm

I did a little twist to UPARSE so that when a WORD! is not recognized from the keywords list, it dispatches to a "WORD! combinator". The spec is pretty simple, it looks like:

word!-combinator: combinator [
    return: "Result of running combinator from fetching the WORD!"
        [<opt> <invisible> any-value!]
    value [word!]
    <local> result
]

So it has the usual (implicit) parameters of the INPUT, and the job of synthesizing a return result and the REMAINDER of how much input is left to process. But since it's a datatype combinator it receives the WORD! to process as the value parameter

So what it does is:

Errors if the word looks up to NULL, because I'm pretty sure we've decided that...as being what's valuable about the ~unset~/NULL/BLANK! distinction.
Acts as a no-op and synthesizes a ~void~ isotope if it's a BLANK!, which makes it act just like an empty block rule would. Again, because I think we've decided that is the best behavior...this ties into the value of the "3 intents"
For combinators that don't need parameters it will delegate to them. I'm not sure how many types to allow here, but have decided INTEGER! variables in PARSE are too obfuscating...and really, there wouldn't be any parser here to pass to the integer combinator to call N times (the rules aren't a parameter, and not available here).

So with a combinator to handle these non-keyword-WORD!s in hand...I got the idea that wrapping the default WORD! combinator with ENCLOSE could be an elegant replacement for %load-parse-tree.r.

The TRACKED-WORD! Combinator

Don't belabor the details of the following code too much.

It just adds a little prelude that marks the position of the stack it's gathering at the beginning, and pushes a line describing the word that is fetching the rule. The epilogue notices success or failure...and on failure pops everything added since the mark (including the line the prelude pushed). On success it pushes a line on the stack that says what part of the input was spanned.

tracked-word!: enclose :default-combinators.(word!) func [
    f [frame!]
    <static> indent (0)
][
    let input: f.input  ; save to use after DO F
    let remainder: f.remainder  ; (same)

    let mark: tail stack  ; save where the stack was
    append stack spaced [collect [repeat indent [keep tab]], f.value, "["]

    indent: me + 1
    let result': ^(do f)
    indent: me - 1

    if result' <> null [  ; rule fetched by word succeeded
        let consumed: mold copy/part input get remainder
        append stack spaced [
            collect [repeat indent [keep tab]], "] =>" consumed
        ]
    ] else [clear mark]  ; roll back stack

    return unmeta result'
]

(I'm sorry if the ^ and unmeta are confusing but they are the easiest way to handle the piping of things like unsets and void around...)

Now Let's Make a PARSE Variant That Uses It

We didn't HIJACK the actual WORD! combinator, because then all UPARSEs would do this. We created a new function that can act as a combinator (since it has the same interface)

Now let's make a PARSETREE function that uses a combinator set that includes it:

tracked-combinators: copy default-combinators
tracked-combinators.(word!): :tracked-word!

trackparse*: specialize :uparse [combinators: tracked-combinators]

trackparse: enclose :trackparse* func [f [frame!]] [
    stack: copy []
    do f also [
        for-each line stack [print line]
        clear stack
        return ~none~  ; suppress visibility of successful result
    ]
]

Let's Try It Out!

>> foo-rule: [some "f"]
>> bar-rule: [some "b" foo-rule]

>> trackparse "fff" [foo-rule]
foo-rule [
] => "fff"

>> trackparse "bbbfff" [bar-rule]
bar-rule [
    foo-rule [
    ] => "fff"
] => "bbbfff"

>> trackparse "bbbfffbbbfff" [2 bar-rule]
bar-rule [
    foo-rule [
    ] => "fff"
] => "bbbfff"
bar-rule [
    foo-rule [
    ] => "fff"
] => "bbbfff"

We can keep going...

>> meta-rule: [some bar-rule]

>> trackparse "bbbfffbbbfff" [meta-rule]
meta-rule [
     bar-rule [
         foo-rule [
         ] => "fff"
     ] => "bbbfff"
     bar-rule [
         foo-rule [
         ] => "fff"
     ] => "bbbfff"
] => "bbbfffbbbfff"

Neat! But is it ACTUALLY Working?

Well, er, no. And it goes back to the same issue that COLLECT had.

The problem is that what we care about for keeping or rejecting is contextually based on who asked, not based on the mere success or failure of the rules themselves.

So we're back to that same point where it's like we are just running code unconditionally in a GROUP!, with no concern about whether the overall rule succeeds or not.

>> trackparse "fffyyy" [foo-rule some "x" | foo-rule some "y"]
foo-rule [
] => "fff"
foo-rule [
] => "fff"

So @Brett, do you agree that the desired output there should have just been one FOO-RULE?

Either way, I hope you think this is kind of impressive. You can try the above in the web REPL...

Brett · August 1, 2021, 1:33am

Yes.

With load-parse-tree.r, I struck this. I found that in practice there's often a way a way forward to be found within the grammar of what you're parsing, by finessing the definitions. For example, maybe foo-rule here is like some sort of "guard rule" that perhaps we don't actually want to collect. Nevertheless, that's not a general solution which was annoying.

As a Rebol 2 user, I had no access to the internals of parse, so hooking the rules by name was a neat interface to the tool, but really the only way forward that was available to me at the time if I wanted to use Rebol's parse. The downside being that it was an incomplete representation of the parse giving rise to the problem.

I'd hoped though having access to the operations of parse, a general solution becomes available. I mean for example that to the parser, a block represents one or more alternative sequences and that only one branch of these alternatives succeeds - therefore the result of the "alternatives" operation is simply the result of the alternative sequence that succeeded. On this basis [foo-rule some "x" | foo-rule some "y"] must only return one sequence result.

In the same way, a user might desire that for some applications, only a successful parse which is only known once parse has completed, should return a tree. I'll call this the "functional view".

As an aside. From my fallible memory, I think Carl wasn't interested in a mathematically correct parser (the functional view), I believe he was more interested in practically processing dialect instructions. Could call this the "side effect view".

If the two views represent two different types of Parse, then "events" might be emitted at different times by the two parsers.

I do!

hostilefork · August 1, 2021, 6:40am

Yay!

I really feel that the fabric of the FRAME!...and all these tools surrounding it (specialization, enclosing, adapting) is serving the system as well as I predicted or better.

Given how many people seem to enjoy the dynamism of Redbols even in the primitive pre-FRAME! form, I think once laypersons find they can do language design black magic with it...they'll be hooked.

Red has continued in this tradition--e.g. by not being interested in rolling back the KEEPs in COLLECT.

But from this side of the fence (@rgchris, you, me, @BlackATTR, presumably @IngoHohmann, anyone else who would actually use the feature...) that makes COLLECT seem nearly useless.

This is certainly an odd line to be walking in the parsing world. We know from experience that being able to just ram some code into a PARSE rule is extremely helpful. You have to twist things into a pretzel in a Haskell Parsec to get any effectful code wedged in--which is an important discipline for maintaining complex systems. But it's a pain for a throwaway script where you just want to get something done.

Today what's easy to find out in the "side-effect-view" (as in my hooked WORD! combinator above) is "did this rule run in a way such that any GROUP!s in it would have been executed".

On the surface it appears what the "functional view" wants to know is "did this rule ultimately contribute to success". Though I mentioned this can't be automatically intuited. (With the BLOCK! combinator it's a higher-level concept than whether each individual parser called succeeds, it has to be a member of a complete alternate group that succeeded... and my hypothetical MAXMATCH gave another example of "multiple successful parsers, but pick just one" to show it's not a problem unique to BLOCK!.)

I mentioned that lining up that "functional view" for COLLECT would require a third result piped by the combinators...but maybe the collected things should be pending imperative code fragments. You either want the fragments to run or not based on the semantics of your operation.

This might lead us to ask... is running a GROUP! inline in a rule that doesn't ultimately succeed really that useful? Should they be queued instead? Well...running synchronously powers certain things like the whitespace interpreter I wrote...which relies on other "side-effect view" aspects like marking positions and seeking.

But running immediately may be the less useful thing, and queuing should be the default for GROUP!? Those who disagree could use a different combinator set...

How About a Generic Queued Imperative Code Mechanism

I think it's worth a shot, to just see what kind of results can be accomplished that serve COLLECT, GATHER, load-parse-tree, etc. If one (...relatively...) simple mechanism can cover those three, it's already better than anything that has existed prior.

We're not going to get attribute grammars or other pure FP type things out of this approach. But that's not what we're aiming at. Also, that already exists.

When @(...) groups come back on the scene, they already have an intended meaning... "literal match". So I don't think any symbols will be left for "queue imperative code" (maybe /(...) ?) Would have to be a quoting combinator in order to say something like queue (code)... and queue is a bad keyword since it's such a common variable name.

Whatever you call it, the mechanic behind this QUEUE combinator would be the same thing used by KEEP. There would be no actual appending to the collection done by KEEP, just a queuing of the append to the collection...which would not run unless the rule ultimately succeeded. It would be in line in the order as orchestrated by the piping of the third return result...and be in effect if it ultimately made it piped out of the combining parser.

I'll make a stab at it.

hostilefork · August 11, 2021, 9:55pm

All right... hold onto your little blue coffee cups... because I've got something rather awesome working!!!

TL;DR - Show Me the Code

Here is MAXMATCH-D (default "rollback") and MAXMATCH-C (custom "rollback"), implemented for your parsing pleasure!

(Note that this also shows how you can define a COMBINATOR in your own files and call it directly. This lets it step in to fill the feature hole that PARSING-AT had to patch around...

...Also see the nice demonstrations of how BLOCK! grouping can be reorganized freely without bugs or worries. You can use them or don't, and the engine sorts it out!)

How It Works

Quick Refresher: The role of the UPARSE engine is to wire up the parser combinators (you might think of them as the "keywords") that can take parameters of some number of rules. They are used to create parsers which just process the input and return a synthesized result and a remainder.

(For instance: BETWEEN is a combinator which takes in the input and two parsers. But once an instance of BETWEEN is "combinated" it becomes specialized with the right parser instances and becomes a parser itself. When called from the outside it appears as a parser with only the input parameter...as its combinator parameters have been fixed as the appropriate parser functions.)

I spoke about adding a third return result representing whatever "pending material" a combinator has accrued. For the moment let's just say this is a BLOCK! of values of various types. Combinators like COLLECT or GATHER will filter through these blocks and pull out the parts they think are relevant to them. But most combinators just want to aggregate results together and pass them through.

I've called this third result "pending". And what I have done is I notice when you write a COMBINATOR spec block if you explicitly mention the pending: return result or not. If you do not mention it, it is assumed you want the "default aggregating behavior".

So what is the default aggregating behavior? Well, when the UPARSE engine fills in the parser parameters to your combinator, they all start out with their own pending return result. But what the common combinator prelude can do is specialize out that parameter and wire it up as a sequential aggregator.

Hence if you don't indicate you have a pending: return result from your combinator, the parsers your combinator is specialized with will appear not have pending return results either! They'll only return the 2 classical results: the synthesized value and the remainder of input to process. All the wiring happens behind the scenes.

But if you do say you have a pending: result, the parsers your combinator receives will all demand to return that third pending parameter. And you need to make the decisions of what to do with each pending result to produce your combinator's own result.

Everybody Got That?

Executive summary that I gave @BlackATTR:

By default all parser invocations you call that succeed add to the result, in the order they are called. Failed parsers don't add to the result. This is good enough for a lot of things, like SOME for instance.
- If SOME calls its one parser it takes as a parameter 5 times and it succeeds, and then one time and it fails, it will succeed (since it matched at least one time). But it's happy enough just giving back the aggregate of those 5 successful calls in order. So it does not mention pending: in its spec, and just gets the automatic behavior.
Such a default is not good enough for the BLOCK! combinator. If it calls uparse "ab" [keep "a" keep "a" | keep "a" keep "b"] then it doesn't want ["a" "a" "b"]. Mere success of the parsers it calls is not enough, it has a higher-level idea of "alternates" and a whole alternate group must succeed to keep its result.
- So it mentions pending: in its spec, which means the parsers it gets don't have their pending results specialized away. It builds the appropriate result.

Now To Build More Features On It!

The thing is that this block of "pending" results is kind of a monolith, it accrues everything which could be a mixed bag of stuff.

What I was thinking is that any QUOTED! items are assumed to be for COLLECT. We can just say COLLECT is probably one of the more common things. Then, maybe any GROUP!s are deferred code. (e.g. code that will only run if the entire parse succeeds...or some marked phase, maybe)

Fanciful example, where the GROUP! combinator will delay any code when prefaced with the <delay> tag. Let's assume if the group was just (<delay>) it's an error to guide you to say ('<delay>) if you actually want to evaluate to the TAG! delay...

>> uparse "aaabbb" [
    some "a" (print "hey A!") (<delay> print "delay A!")
    some "b" (print "hey B!") (<delay> print "delay B!")
]
hey A!
hey B!
delay A!
delay B!

>> uparse "aaaccc" [
    some "a" (print "hey A!") (<delay> print "delay A!")
    some "b" (print "hey B!") (<delay> print "delay B!")
]
hey A!
; null

I mused that a phase would just be a bracket that says "you don't have to wait to the absolute end, go ahead and run the delays you've accumulated now":

>> uparse "aaabbbccc" [
    phase [
        some "a" (print "hey A!") (<delay> print "delay A!")
        some "b" (print "hey B!") (<delay> print "delay B!")
    ]
    some "c" (print "hey C!")
]
hey A!
hey B!
delay A!
delay B!
hey C!

Anyway so this mishmash of a PENDING block could contain mixes. If it sees [(print "delay A") '(print "delay A")] it knows that the unquoted thing is something the delay mechanism pays attention to, while the QUOTED! thing is something for COLLECT.

Then maybe BLOCK! is used for emit, e.g. the WORD! and value for the object. This could be stuck in the stream with everything else, like emit x: integer! => [(print "delay A") '(print "delay A") [x: 1020]] ... basically this big glommed together bunch of results that are being collected filtered and discarded.

The arrays that are produced give up their ownership, so when your combinator gets a pending array back you can party on it all you want. So COLLECT might look at a pending array and realize it's all QUOTED! items, and just go through and unquote them and return that as the collected array--without needing another allocation.

(I think these will be common cases. Anyway the GLOM mechanic is what I introduced to try and make it cheap so that you can also just return BLANK! So you're not paying for parses that don't use any of this to make empty arrays at every level of every combinator... otherwise a rule like [100 "a"] would be generating 100 empty arrays for no reason.)

I Think UPARSE Is On Track to be the !

The amazing thing is just how well slinging usermode code of frames and functions and specializations around is working. "It's programming Jim...but not as we know it..." -- it's the language malleability that Rebol has tried to promise but was just not feasible in usermode until Ren-C.

But of course, there's a lot of work to do here. It's slow and the errors are bad. But this experimental implementation is passing the barrage of tests--to which I've added everything from the rebol-issues database, and closed many issues and wishes that are all now solved and answered!

Brett · August 12, 2021, 2:38am

Lol, took me a while to get the reference.

I've got something rather awesome working!!!

That does appear to be rather awesome! I like the "do more work only if required" approach with the pending results.

The thing is that this block of "pending" results is kind of a monolith, it accrues everything which could be a mixed bag of stuff.

So I guess this is the part where an application will get specific about how the parse will be interpreted like how might the same underlying parse rules be mixed with different application needs to produce different result structures. I'm imagining a user sharing a set of parse rules with others who then achieve different resulting structures or calculations for their own needs. Is that a layer of rules-with-emits that call into rules-to-parse-the-syntax or something different?

I realise case studies would be of value. C-lexicals and the build process might be one. Json type response data might be another. Will keep it in mind.

hostilefork · August 12, 2021, 6:04am

So I've gone ahead with my proposed implementation (the version where GROUP! can be prefixed with (<delay> ...)) and implemented the PHASE construct.

You can see how simple the GROUP! combinator is. It looks for the <delay> tag, and if it finds it, it uses NEXT to skip past the tag, and adds the remaining group data to the pending list. PHASE just filters anything that's a group out of the pending list and runs it. There's a PHASE automatically added to the top-level of every UPARSE operation.

Because this is the understood protocol of meaning of GROUP!s in the pending list, any combinator can stick deferred code into that list--just by adding groups to it.

So...with that tool in hand, I went back to tackle the problem that we saw above with the hooked-word-combinator demo (I'm now calling it "TRACKPARSE" as opposed to parsetree...will save that for something more closely giving your results). As a refresher, the problem was:

>> trackparse "fffyyy" [foo-rule some "x" | foo-rule some "y"]
foo-rule [
] => "fff"
foo-rule [
] => "fff"

We didn't want the FOO-RULE to be contributing to the stack log when it was a member of the alternate that failed.

The way I approach it now is to push groups of deferred code for appending strings to the stack. It actually sticks one string append before all the pendings that are returned by the WORD!'s processed rule, and then one string after all the pendings.

(By no means am I suggesting this is an ideal way to do this, but it is just a corrected version of the off-the-cuff code from before!)

It achieves the desired result!

It's probably easiest to just look at the implementation to see what's happening in the COLLECT, GATHER, and PHASE cases.

COLLECT is siphoning out the QUOTED! values from the pending material. It unquotes them.
GATHER gets the BLOCK!s, all of which are assumed to have the form [var: value]
I've already shown that PHASE (which also brackets the entirety of a UPARSE) gets the GROUP!s

(Efficiency sidenote: I might should have used something like @[...] blocks for GATHER, and use BLOCK!s for KEEP to complement the QUOTED!s. This would save splicing until the very end COLLECT when you have a better idea of how big the total series will be.)

In any case, this strategy will obviously run out of datatypes at some point. So once the common "lightweight" values are spoken for, an ecology based around something like an OBJECT! which uses a key like combinator: as a tag to know whether to pay attention to something is probably the safest bet. Perhaps something lighter weight like EVENT! which could put a label in the cell spot where a block index would usually be would be of use here.

What I'm trying to do here--though--is to make this intrinsically hackable. If you want QUOTED! to mean something else in the pending list, it's not like you can't rewrite COLLECT and KEEP. Everything is supposed to be modular and comprehensible.

I realise case studies would be of value. C-lexicals and the build process might be one. Json type response data might be another. Will keep it in mind.

I definitely want to get scenarios worked through before going down the rabbit hole of optimizing all this with native code. The %examples/ directory in the parse tests can hopefully be home to some good challenges of the model. Throw hardballs at it!

And if there's any chance you can just take a little time to skim through the parse tests and see if everything "jibes" that would be great:

https://github.com/metaeducation/ren-c/tree/master/tests/parse

Anything you want to test should work (very slowly) in the web REPL. I'm trying to be good about taking basically every experiment I type down and making sure it gets incarnated as a test instead of just tried once and forgotten about...