Introducing UPARSE: The Hackable Usermode PARSE

:eyes: :eyes: :eyes: Big News! :eyes: :eyes: :eyes:

Recently I've talked about making a few tweaks to PARSE...like changing how SET-WORD! acts.

But we'd still need a Redbol version. And even small backwards-compatibility changes are not possible in the way PARSE has been historically done. It's a monolithic chunk of C code that you have to recompile to get any changes to.

This is a weakness... because when you look at "parser combinators" in other languages they are generally usermode codebases. Haskell has several to choose from I've written about (Parsec, Attoparsec, Megaparsec). Search the web for "parser combinator L" for any language L. You'll find plenty of other people who have figured out how to break down your parse into components and avoid RegEx... "nom" for Rust, "kern" for Clojure..

The best way to start drafting a new-and-much-needed architecture for PARSE is to write it first as usermode code. It will be slow as molasses...but if the design is done right, then it could be gradually optimized. Those optimizations are more likely to give cross-cutting advantages to other code in the system.

And what better way to convince people that they can make powerful dialects than to show them that they--too--could have cooked up PARSE in a day?

Meet UPARSE!

I actually did write a draft implementation in a day-ish. But over time, it has been built up and become a pretty thorough effort. Here it is:

%uparse.reb

(Note: The web build doesn't have it built in, so you have to run do <uparse> to get it. Plain executables include it with the mezzanines. But even still, it's factored as a separate script so you can mess with it and run it to overwrite the implementation.)

To be clear: the plan here is ultimately that the design of UPARSE would become the architecture of real PARSE. It would just be backed by more native code so it would perform better. But since it would be modular, you could interchange native and usermode functions on a piece-by-piece basis.

Something impressive about it is that it uses partial specialization via FRAME! to build coherent rules without blocks. You can write things like:

>> uparse ["aaaa"] [into some 2 "a"]
== ["aaaa"]

That kind of composite rule wasn't historically possible without blocks. In Red, INTO requires a block:

red>> parse ["aaaa"] [into some 2 "a"]
*** Script Error: PARSE - unexpected end of rule after: into

red>> parse ["aaaa"] [into [some 2 "a"]]
== true

R3-Alpha is worse, because its irregular implementation means it doesn't even know what it can't do...so it misses error reporting cases and just churns along:

r3-alpha>> parse ["aaaa"] [into some 2 "a"]     
** Script error: PARSE - invalid rule or usage of rule: into

r3-alpha>> parse ["aaaa"] [into [some 2 "a"]]
== false  ; ??!!! What

r3-alpha>> parse ["aaaa"] [into [some [2 "a"]]]
== true

Customizable, and Easier To Understand Than You'd Think!

UPARSE can take an optional MAP! of "combinators". This is a map from keywords to stylized "parser functions". These functions take the input and other parsers as parameters...then calculates how much of the input was consumed and returns it.

An easy one to look at would be the WHILE combinator:

func [
    {Any number of matches (including 0)}
    return: [<opt> any-series!]
    input [any-series!]
    parser [action!]
][
    let pos: input
    cycle [
        if tail? pos: any [parser pos, return pos] [
            return pos
        ]
    ]
]

It stops when the tail is reached, or when the parser function it is given returns null.

But what is this PARSER function parameter? It's basically a black box. The UPARSE engine itself figures out what that parameter should be, just like DO usually figures out expressions. Because the routines defer that process to a vetted mechanism, you get nice coherencies like the into some 2 "a" working, without SOME or INTO needing to belabor the details of how that chain is made.

Not Just Keywords...But Datatypes Too

So a concept here is that datatypes get functions in the map as well. Here's one for GROUP!:

func [
    return: [any-series!]
    input [any-series!]
    value [group!]
][
    do value
    return input  ; just give back same input passed in
]

It's nothing particularly special. But it shows that you could...at runtime...override it in your parse variant.

Most interesting is that BLOCK! itself is a combinator. So you can hijack the parse at its most basic level. You could dynamically pre-process the block, to transform it from something that is an augmented parse form into something the parse would accept.

Demonstrating the Power: Adding BETWEEN

As a simple demo of why this is so powerful, let's add a feature. We do this by making our own combinator collection.

Since you're familiar with Redbol-PARSE, let's start with the Redbol parse combinator set. This is the one where GET-WORD! is rigged to capture the parse position, SET and COPY take WORD! arguments, etc.

We'll copy those combinators, and then add our new one:

myparse-combinators: copy redbol-combinators

Now we'll add BETWEEN to the mix:

myparse-combinators/('between): func [
    return: [<opt> any-series!]
    input [<opt> any-series!]
    'var [word!]
    parser-left [action!]
    parser-right [action!]
][
    if not let start: parser-left input [return null]  ; advance left parser

    let limit: start  ; end copying at start of any right match
    cycle [
        if input: parser-right limit [  ; found it
            set var copy/part start limit
            return input
        ]
        if tail? limit [return null]
        limit: next limit  ; check next position for start of right match
    ]
    return null
]

Now all you need is a specialization of UPARSE that uses the map of combinators with your new entry. If you're using the web build, you have to run: do <uparse>

 myparse: specialize :uparse [combinators: myparse-combinators]

Will it work? Let us see...trying from the web console:

>> myparse "aaaa(((How cool is this?))aaaa" [
    some "a", between x some "(" some ")", some "a"
]
== "aaaa(((How cool is this?))aaaa"

>> x
== "How cool is this?"

Boom. Notice how it distinguished the idea of getting arguments literally out of the rule (the X) because the parameter was quoted in the combinator spec...whereas the non-quoted parameters were put together as rules.

It's Even Cooler Than That...

That syntax for the BETWEEN above is fairly ugly. Stuck in the historical style.

But UPARSE is already a step ahead...with the new SET-WORD! behavior. Plus it has its nicer BETWEEN in the box:

>> uparse "aaaa(((How cool is this?))aaaa" [
    some "a", x: between some "(" some ")", some "a"
]

>> x
== "How cool is this?"

The way it works is that not only do combinators update their position, but they can also yield a result separate from that update. SET-WORD! is able to be to the left of any combinator that has that additional result (a multi-return).

But This Is Was Just One Day Of Work, So...

UPDATE: Obviously much has changed since the original post!

...of course, one day of work standing on implementations of FRAME! and partial specializations that have been belabored and thought over for years, looking toward culminating in this kind of design...

So tons of big questions here. But I think this is an extremely good start for how to be thinking about these problems. Certainly much easier to study than C code, and it can be optimized once it's understood better.

When designs work this well, it puts us back on the map.

3 Likes

Wow!! This should prove incredibly useful for some kinds of dialects.

A challenging aspect of dialect design can be interpreting sequences of values into valid expressions. We're not talking about large volumes of input here; this is something which benefits more from flexibility of PARSE than raw speed.

I'm curious to see if this will tie into user-defined datatypes. This opens up doors to interesting new possibilities. I'm thinking custom interpreters. And maybe something like DELECT could be revisited here?

Where I really want to focus here is on making this an exemplary codebase for how someone would build a dialect.

So that means tackling the details of things like error delivery...and the ever-looming cloud of debugging.

But also, what role binding plays. Here I'm breaking up the dialect's behavior into a map of keywords which is supplied to the dialect engine...that maps those keywords into units that the dialect understands (combinators). Is this concept typical? Is it how BITWISE should work?

In terms of adding features, I really want to see it work with streaming. We need some test examples, like a PORT! that just sits there generating random IP addresses...and then run a UPARSE on that which prints them out...where the amount of memory used stabilizes and does not grow without bound.

So there's a ton to worry over just by making UPARSE a good example without pushing too far into the imaginary...

Very nice. This looks to be an excellent direction.

Things I suggest should be supported if not already:

  1. An ability to return a parse tree result (terms matched + segments matched). This is fabulous for data extraction formed into a structured result.

  2. An ability to return the furthest input point matched and the rule that caused rollback from there on parse failure. During development of rules this generally indicates the rule that is not properly specified.

  3. Maybe uparse can provide some user evaluation context services passed through to the user actions, so that once parse has finished successfully, your result could be the evaluation of the parse tree whatever that means in your application, without having to code all that context tracking yourself. I'm unsure about this, because if you have #1 in a decent form maybe that's all you need. Still having access to things uparse keeps track of, like the successful branch of terms you're currently on, maybe useful.

1 Like

Glad you like it, and hopefully enough that you'd want to come back and work on it...this is the democratization of the design of PARSE, right here!

Being able to know the farthest point reached is something that's on the surface not as easy to do with the process being so decentralized. But I gave it a shot.

What I've done is I've made it so combinators are made with a common generator COMBINATOR. This generator takes the function body you give it, and stuffs in some boilerplate parameters (like the INPUT and STATE). But it also wraps your code with some more boilerplate that can run before and after the parser.

The current idea of the "parser state" is just to pass around the FRAME! of the UPARSE operation itself. So if you have any global state you want visible to the parse you can put it there. Hence the state gives every combinator access to the arguments, return values, and locals of the invocation.

I made FURTHEST: an optional return value. The hooked combinators are run, and then if they succeed they're checked to see if they got further than any previous combinator. If so they update furthest. (This is only done if the furthest was requested.)

In the web build:

>> do <uparse>

>> uparse/furthest "aaabbb" [some "a" some "c"] 'far
; null

>> far
== "bbb"

(Note: you can also use [value progress furthest]: uparse ... or opt out of the first two with [_ _ furthest]: uparse ..., but refinements work as well)

I'm not really sure what the behavior here would be.

Maybe you can look at how the furthest detection works and explain in the context of the code what you would want.

This is something where maybe hacking up a demo example would be useful. :slight_smile: I don't really know what you would be looking for with this, but maybe you can discern by looking at the gist of uparse.reb whether it's the sort of thing that would be feasible to mock up.

I added GATHER and EMIT pretty quickly, and I think they might point in the direction of filling in some of the holes in the design.

That fulfills the request for identifying how far the parse got before failing. Nice.

An ability to return a parse tree result is demonstrated in Rebol2 here: http://codeconscious.com/rebol-scripts/load-parse-tree.txt (also linked from comment here: Trello)

While the example shown demonstrates a grammar, I've used this structural output a lot with extraction from various text and html documents - very useful, but with much room for improvement.

However, you're GATHER and EMIT are really interesting.

The ability to extend parse with better context and flexibility like you're Uparse framework is something I'd wanted for a very long time. I'd made comments to Carl back in the day, but I lacked the language and examples to make a convincing case at the time, and as you have pointed out, changing parse was probably prohibitive for testing these ideas at the time.

2 Likes

You tended to try and build parse extensions like "APROPOS" and "PARSING-AT" that I did not understand at the time. Perhaps it would be a good time to explain what those (or other ideas) were so that they could be tested as combinators.

(Note: Be sure to look at the post on resolving SET inconsistencies, if you're not subscribed to the PARSE forum section...)

I hadn't noticed until just now, but topaz.red has an OBJECT keyword in it with similar behavior. However, it makes all SET-WORD!s into "emits", which seems a bit restrictive.

Interesting, though the case you give is stylized to give statically known names for all the rules. If your rules are not provided as a table in this way, then you have nameless rules...and now dynamically generated rules (e.g. by :(...)).

Even for the rules with names, there's currently no hook on rules-fetched-through-WORD!. This is because that fetching process doesn't use a combinator...it's the process by which combinators are themselves selected.

Rules that do SEEK or mutations would raise questions in providing this tree product.

It may be that the form you've done it in is the right one...as a feature that demands some amount of stylization of the input rules, and if you are willing to limit yourself to that style then you can get a meaningful output from it. But since it requires that stylization, it may not make sense to be a feature in the stock parse...but be built above it as you have done.

We'd certainly want to do better with TRACE features. The /VERBOSE option in UPARSE is currently quite pathetic.

>> uparse/verbose "aaabbb" [some "a", some "b"]
RULE: [some "a", "some b"]
INPUT: "aaabbb"
---
RULE: [, "some b"]
INPUT: "bbb"
---
RULE: ["some b"]
INPUT: "bbb"
---
== "aaabbb"

Maybe the generic version of your parse tree feature--that would work on any convoluted parse rules that came along--would be more like a trace log?

1 Like

Apropos was about exploring late rebinding of blocks into different contexts.

The whole words, contexts and tokens thing awaits further exploration. For example, in some recent scripts when I emit words into a block as data, I want them as symbols deliberately stripped of default bindings. I want that block data to have meaning applied by the receiving context - not baked in and not having to use Parse to do it, which also means some custom reduction and block rewriting techniques (reduce-if-bound otherwise leave-as-is, etc).

parsing-at was a rule generation function allowing a rebol expression to determine a parsing match.

E.g. this should succeed.

an-odd: parsing-at x [if attempt [odd? x/1] [next x]]
parse [3] an-odd

Looking "at the biggest issue to sort out in UPARSE"

A valuable exercise.

Umm.. taken literally that's seems limiting.

Taken more broadly as interpretation, evaluation of a tree, maintaining the context I referred to earlier, then yes.

Perhaps combinators can return the tree structure, if I understand right.

Maybe. A log that emits a bunch of parse actions that can be rebound and reduced or played back to build a tree or extract or whatever, or rebuild the original input.
So maybe this is relevant, maybe not, but this is how I'm processing CSV now - it's cute and I get a human readable log as part of the bargain.

>> c: {1,2,3^/4,5,6}
== "1,2,3^/4,5,6"
print mold s: scan-excel-text #"," c pipe collect-script
[
    opn
    itm "1"
    itm "2"
    itm "3"
    eol
    itm "4"
    itm "5"
    itm "6"
    eol
    cls
]

That bunch of instructions can be played or piped through some interpreter to yield a result specific to the interpreter - one will create a tabbed delimited file, another to build a bunch of block records, another to rebuild the original CSV, etc.

Maybe a parse trace log like that could be similarly evaluated in different ways - whether it's powerful to do so....I don't know.

[now going back to a considerable number of photographic assignments with imminent due dates...]

1 Like

Under the new framework, this seems like COMBINATOR. Although right now, combinators don't let you choose the name for INPUT, they're all stylized to assume INPUT is the name.

We'll definitely have something like this, that you can use for making a rule that you reuse...or just for calling directly.

GET-GROUP! is looking like it's going to stay on track as a way of saying "evaluate this group and then splice the result into the parse stream". Since LOGIC! true means keep parsing (as does NULL), and LOGIC! false means stop parsing, this is useful for many cases.

You could always capture, and say:

[x: capturing rule, :(test x)]

And that gives you one way of doing things. If the rule fails you'll be rolled back to where you were before the capture. Though it's not letting you scan and move the output position arbitrarily.

Okay, thanks for reminding me what APROPOS was. Not particularly related to parsing, but it does relate to binding.

Ren-C continues to strive to be a substrate for exploring binding ideas...I did this little writeup to try and look at it from the big picture:

Binding Re-Examined From First Principles

The technical parts are trying to get better while staying functional. I'm not that worried about slowing things down if I can get at some answer that is right for some definition of right that lets you do cool things. I just feel historical Red and Rebol has only let you do broken things...