Introducing The Hackable Usermode PARSE ("UPARSE")

hostilefork · February 28, 2021, 11:43am

!!! BIG NEWS !!!

I've griped about historical PARSE's weaknesses for some time. To name a few:

Users can't add new keywords, or tweak existing ones. It's a monolithic chunk of C code that you have to recompile to get any changes to. And since Rebol2's heyday, there have arisen a competitive landscape of "parser combinators" which allow users to build new parse "keywords" and compose parsers out of each other.
- Haskell has several to choose from I've written about (Parsec, Attoparsec, Megaparsec). These offer composability, performance, rigor, and even error messages to guide you to correcting your input where the rule failed.
- ...but search the web for "parser combinator [L]" for any language [L]. You'll find plenty of other people who have figured out how to break down your parse into components and avoid RegEx... "nom" for Rust, "kern" for Clojure
Parse rules are very hard to reuse. Because executing a rule doesn't produce a "result value", any side-effects a rule has wind up being written in some global variable.
- The finite set of keywords users are given for data capture are limited to capturing data at the level of the input series (or elements of it).

The R3-Alpha codebase for PARSE developed organically over a long period of time, and that shows.

While there's lip service to the idea that you can write fluid code using delimiters only when you want to, the irregular implementation gives you errors... if you are lucky (if you're unlucky it just churns along)

r3-alpha>> parse ["aaaa"] [and any-string! into some 2 "a"]   ; no AHEAD
 ** Script error: PARSE - invalid rule or usage of rule: into

r3-alpha>> parse ["aaaa"] [and any-string! into [some 2 "a"]]
== false  ; !!! WHAT? (it doesn't even know what it can't do) 

r3-alpha>> parse ["aaaa"] [and any-string! into [some [2 "a"]]]
== true

Despite being an all new codebase, Red is similarly irregular:

red>> parse ["aaaa"] [ahead any-string! into some 2 "a"]
*** Script Error: PARSE - unexpected end of rule after: into

red>> parse ["aaaa"] [ahead any-string! into [some 2 "a"]]  ; needs block
== true

The best way to start drafting a new-and-much-needed architecture for PARSE is to write it first as usermode code. It will be slow as molasses...but if the design is done right, then it could be gradually optimized. Those optimizations are more likely to give cross-cutting advantages to other code in the system.

Not only that: if the architecture is good enough, it should be able to create backwards-compatible variations... like PARSE2 for Rebol2 PARSE semantics, PARSE3 for R3-Alpha, REDPARSE for Red...

Meet the new PARSE! (codenamed UPARSE)

I actually did write a protozoan draft implementation in a day-ish. What better way to convince people that they can make powerful dialects than to show them that they--too--could have cooked up PARSE in a day?

But let me be completely clear: the ability to even approach writing such a thing is fundamentally built on Ren-C core mechanics. Most notably FRAME! interchangeability with functions and partial specialization. You wouldn't want to try something like this in Rebol2, R3-Alpha, or Red.

Since the original prototype, it has been built up and become a pretty thorough effort!

%uparse.r

(UPDATE: While once actually called uparse, the function has now taken the name parse...with the old parse codebase running as parse3. For performance reasons, some code still uses parse3.)

Killer Feature: It Can Synthesize ANY-VALUE!

UPARSE isn't limited to return results of true and false. It actually raises an error if it cannot process its rules completely:

>> parse "abd" ["a" "b" "c"]
** Error: BLOCK! combinator at tail or not thru

This opens up all normal values as fair game for return! Every "combinator function" (historically what you might have thought of as "keywords") has the option of returning a result, and the BLOCK! combinator that succeeds will return its last result.

Literal matches return the rule itself:

>> parse "abc" ["a" "b" "c"]
== "c"

A GROUP! is a combinator that always succeeds, and returns its result:

>> parse "abc" ["a" "b" "c" (1000 + 20)]
== 1020

When you use a SET-WORD! in a rule, what you're actually capturing is the result of the rule on the right.

>> parse "aaab" [stuff: collect [some keep "a"], "b"]  ; commas optional, but cool!
== "b"

>> stuff
== ["a" "a" "a"]

The sky is the limit with this approach. When you consider that things like COLLECT are not "built-in" but the sort of combinator you can create after-the-fact, the true power starts to hit you. Here's GATHER, which you can use with EMIT to make objects:

>> parse [x 10 y 20] [gather [
       emit reaction: (join "WOW" "!!!")
       some [name: word!, emit (name): integer!]
   ]]
== make object! [
    reaction: "WOW!!!"
    x: 10
    y: 20
]

Customizable, and Easier To Understand Than You'd Think!

UPARSE can take an optional MAP! of "combinators". This is a map from keywords to stylized "parser functions". These functions take the input and other parsers as parameters...then calculates how much of the input was consumed and returns it.

An easy one to look at would be the TALLY combinator, which just counts up the number of times a rule matches. This post isn't intended to be a complete guide to the mechanics, but just to show you the scale of what it takes to add a feature:

tally: combinator [
    {Iterate a rule and count the number of times it matches}
    return: "Number of matches (can be 0)"
        [<opt> integer!]
    parser [action!]
    <local> count pos
][
    append state.loops binding of 'return  ; registration needed so BREAK works

    count: 0
    cycle [
        [^ pos]: parser input except e -> [  ; definitional error if parse mismatch
            take/last state.loops  ; unregister from the BREAK-able stack
            remainder: input
            return count
        ]
        count: count + 1
        input: pos
    ]
    fail ~unreachable~
]

And voila!

>> parse "abbcccabb" [tally ["a" repeat 2 "b" | repeat 3 "c"]]
== 3

But what is this PARSER function parameter to TALLY? It's basically a black box. The UPARSE engine itself figures out what that parameter should be, just like DO usually figures out expressions. So a PARSER is what you get when a COMBINATOR has had its own dependent parser parameters specialized out, to then only require the input.

(If that sounds brain-bending, the only part you need to know is that TALLY is able to get its work done without knowing anything about how to interpret ["a" repeat 2 "b" | repeat 3 "c"]. UPARSE does the behind-the-scenes work of "fusing" the BLOCK! combinator, TEXT! combinators, and INTEGER! combinator...and passes the implied parser representing that operation.)

Composable Architecture Avoids Redbol's Ad-Hoc Pitfalls

Something impressive about UPARSE is that it uses partial specialization via FRAME! to build coherent rules without blocks. It "just works".

In the introduction I pointed out Red and R3-Alpha stumbling over INTO usage. UPARSE's SUBPARSE (like Topaz'z INTO) is arity-2 so it has an implicit AHEAD as the first parameter, but here we can see it working in the free-form fashion:

>> parse ["aaaa"] [subparse any-string! some repeat 2 "a"]
== "a"  ; Success! (e.g. not an error!)

DATATYPEs Have Combinators Too!

The implementation is so modular that every datatype is implemented through its own combinator. And you can override them when you run UPARSE.

Let's try doing something totally weird just to GROUP!, leaving everything else the same

weird-combinators: copy default-combinators  ; start with the default list

weird-combinators.(group!): combinator [
    {Weird group combinator: test if equal to next input element}
    return: "Return a block saying whether it matched or not"
        [block!]
    value [group!]
    <local> result block
][
    result: do value
    block: if (first input) = :result [
        compose [match: (value) = (first input)]
    ] else [
        compose [mismatch: (value) != (first input)]
    ] 
    remainder: next input  ; skip input either way
    return block 
]

>> weird-parse: specialize :uparse [combinators: weird-combinators]

>> weird-parse [10 20 30] [collect [keep (2 + 8) integer! keep (5 + 15)]]
== [match: (2 + 8) = 10 mismatch: (5 + 15) != 30]

Whoa... that's wild. I thought I'd demonstrate something radical just to get people thinking, but more obviously useful variants would do things like hook the WORD! combinator to get a running log of rules that trigger.

Most interesting is that BLOCK! itself is a combinator. So you can hijack the parse at its most fundamental level. You could dynamically pre-process the block and pass it to the default combinator.

And in fact, this is already in use to create PARSE2, a Rebol2-compatible PARSE.

Wild New Combinators Are Finessing Old Problems

UPARSE is providing a playground for attacking classic annoyances.

r3-alpha>> parse "We all remember <<this problem>> (?) don't we?" [
     thru "<<" copy thorn to ">>" skip skip  ; ugh. [to ">>" ">>"] is also ugh
     space "(?)" to end
]
== true

r3-alpha>> thorn
== "this problem"

With value-synthesis, problems like this can be cast in terms of combinators inspired by other languages (and beyond). Here's BETWEEN, a 2-argument rule...which I'll put in brackets even though I don't need to:

>> parse "We all remember <<this problem>> (?) don't we?" [
     thru "<<" rose: [between <here> ">>"]
     space "(?)" to <end>
 ]

>> rose
== "this problem"

You could have just written that as THRU BETWEEN, but I wanted to show off how it solves the problem from the point of the old understandings. Now I'll show off the ellipsis combinator too.
>> parse "We all remember <<this problem>> (?) don't we?" [
      ...  rose: between "<<" ">>" space "(?)" ...
   ]

>> rose
== "this problem"

@rgchris and I are of the opinion that COLLECT is nigh-useless without backtracking. e.g. in Red:

red>> parse "ac" [collect [
    "a" keep (<a1>) "b" keep (<b>)
    | "a" keep (<a2>) "c" keep (<c>)
]]
== [<a1> <a2> <c>]  ; Bah.

So a careful generic design has been made for that, which is extremely clever (if I do say so). COLLECT is only one of the clients, but an obvious one:

>> parse "ac" [collect [
    "a" keep (<a1>) "b" keep (<b>)
    | "a" keep (<a2>) "c" keep (<c>)
]]
== [<a2> <c>]

There's Much, Much More...

...and you are invited to help throw in your ideas, as well as document and test this huge new design. It's slow for now, while features like stream parsing continue to be prototyped. But when it's ready, combinators will be native and optimizations pursued!

BlackATTR · February 28, 2021, 1:35pm

Wow!! This should prove incredibly useful for some kinds of dialects.

A challenging aspect of dialect design can be interpreting sequences of values into valid expressions. We're not talking about large volumes of input here; this is something which benefits more from flexibility of PARSE than raw speed.

I'm curious to see if this will tie into user-defined datatypes. This opens up doors to interesting new possibilities. I'm thinking custom interpreters. And maybe something like DELECT could be revisited here?

hostilefork · March 2, 2021, 4:14am

Where I really want to focus here is on making this an exemplary codebase for how someone would build a dialect.

So that means tackling the details of things like error delivery...and the ever-looming cloud of debugging.

But also, what role binding plays. Here I'm breaking up the dialect's behavior into a map of keywords which is supplied to the dialect engine...that maps those keywords into units that the dialect understands (combinators). Is this concept typical? Is it how BITWISE should work?

In terms of adding features, I really want to see it work with streaming. We need some test examples, like a PORT! that just sits there generating random IP addresses...and then run a UPARSE on that which prints them out...where the amount of memory used stabilizes and does not grow without bound.

So there's a ton to worry over just by making UPARSE a good example without pushing too far into the imaginary...

Brett · March 2, 2021, 4:40am

Very nice. This looks to be an excellent direction.

Things I suggest should be supported if not already:

An ability to return a parse tree result (terms matched + segments matched). This is fabulous for data extraction formed into a structured result.
An ability to return the furthest input point matched and the rule that caused rollback from there on parse failure. During development of rules this generally indicates the rule that is not properly specified.
Maybe uparse can provide some user evaluation context services passed through to the user actions, so that once parse has finished successfully, your result could be the evaluation of the parse tree whatever that means in your application, without having to code all that context tracking yourself. I'm unsure about this, because if you have #1 in a decent form maybe that's all you need. Still having access to things uparse keeps track of, like the successful branch of terms you're currently on, maybe useful.

hostilefork · March 2, 2021, 8:19am

Glad you like it, and hopefully enough that you'd want to come back and work on it...this is the democratization of the design of PARSE, right here!

I took a crack at it! See: Semantics of UPARSE's FURTHEST

This is something where maybe hacking up a demo example would be useful. I don't really know what you would be looking for with this, but maybe you can discern by looking at the gist of uparse.reb whether it's the sort of thing that would be feasible to mock up.

I added GATHER and EMIT pretty quickly, and I think they might point in the direction of filling in some of the holes in the design.

Brett · March 3, 2021, 11:16am

An ability to return a parse tree result is demonstrated in Rebol2 here: http://codeconscious.com/rebol-scripts/load-parse-tree.txt (also linked from comment here: Trello)

While the example shown demonstrates a grammar, I've used this structural output a lot with extraction from various text and html documents - very useful, but with much room for improvement.

However, you're GATHER and EMIT are really interesting.

The ability to extend parse with better context and flexibility like you're Uparse framework is something I'd wanted for a very long time. I'd made comments to Carl back in the day, but I lacked the language and examples to make a convincing case at the time, and as you have pointed out, changing parse was probably prohibitive for testing these ideas at the time.

hostilefork · March 3, 2021, 2:54pm

I hadn't noticed until just now, but topaz.red has an OBJECT keyword in it with similar behavior. However, it makes all SET-WORD!s into "emits", which seems a bit restrictive.

Interesting, though the case you give is stylized to give statically known names for all the rules. If your rules are not provided as a table in this way, then you have nameless rules...and now dynamically generated rules (e.g. by :(...)).

Even for the rules with names, there's currently no hook on rules-fetched-through-WORD!. This is because that fetching process doesn't use a combinator...it's the process by which combinators are themselves selected.

Rules that do SEEK or mutations would raise questions in providing this tree product.

It may be that the form you've done it in is the right one...as a feature that demands some amount of stylization of the input rules, and if you are willing to limit yourself to that style then you can get a meaningful output from it. But since it requires that stylization, it may not make sense to be a feature in the stock parse...but be built above it as you have done.

We'd certainly want to do better with TRACE features. The /VERBOSE option in UPARSE is currently quite pathetic.

>> uparse/verbose "aaabbb" [some "a", some "b"]
RULE: [some "a", "some b"]
INPUT: "aaabbb"
---
RULE: [, "some b"]
INPUT: "bbb"
---
RULE: ["some b"]
INPUT: "bbb"
---
== "aaabbb"

Maybe the generic version of your parse tree feature--that would work on any convoluted parse rules that came along--would be more like a trace log?

hostilefork · May 15, 2022, 7:50pm

2 posts were split to a new topic: What Were APROPOS and PARSING-AT About?