Haskell parser combinators

hostilefork · March 12, 2024, 5:24am

Thank you for summarizing! I've written a little here before:

Parser combinators were not an influence on the original Rebol PARSE (circa 90s), but were what I drew inspiration from for UPARSE.

One thing to notice is that UPARSE doesn't have a sequencing combinator. Instead, it has a BLOCK! combinator.

And it's the BLOCK! combinator that generally does the "combinating". It can technically be replaced, but it is the most complex combinator and so difficult to replace entirely... it's almost like the block combinator is UPARSE. So it is more likely that it would be augmented vs. people wanting to rewrite it entirely.

When one of the parsers the block combinator has "combinated" fails, it then scans literally for the next | in the input and starts applying rules there. This is true also historical Redbol, in that you can write any old (LOAD-able) gibberish after a failing rule, due to the skipping:

red>> parse "aaa" [some "b" fhqwhgads ??? | some "a"]
== true

(Should a debug mode exist that runs through and ensures these regions will combinate, just not call the parsers? Probably not in this language paradigm. Mismatching rules can guard against dereferencing undefined variables and such, and that's kind of how the whole language works... it's modeling clay.)

UPARSE adds in another element to BLOCK!'s recognized-literally vocabulary of ||, that is a synonym for putting everything to the left of it in a block:

[rule1 | rule2 | rule3 || rule4]
=>
[[rule1 | rule2 | rule3] rule4]

[rule1 | rule2 || rule3 | rule4 || rule5]
=>
[[rule1 | rule2] [rule3 | rule4] rule5]

(Well, it's a synonym so long as you haven't redefined the BLOCK! combinator itself to other meanings, but if you did you might also redefine || while you're at it.)

Anyway, I'm rather fond of ||.

Seems unfortunate that the default is biased that way, and you have to tack on more to get the likely-intended meaning.

There was significant discussion on what to call the "0 or more" combinator. Historical Redbol called this ANY and I did not like that (instead pursuing an ANY that is aligned with the ANY construct in the language).

But I do not like MANY either--it might be worse. Ultimately I decided that OPTIONAL SOME (or OPT SOME) is just what you would write out.

On the linguistic front, we don't have a word for "zero or more" in English, and maybe there is a reason for that. What I found in going through replacing the old "ANY"=>"OPT SOME" was that there were a lot of cases where someone reached for ANY when it actually wasn't a zero-or-more situation, so I liked it being explicit.

Many cases I looked at tidied up (and by many I mean many, not zero). I found this code removing 0 or more newlines at the head of a series via ANY:

parse series [
    remove [any newline]
    ...
]

But when you rephrase this with OPT SOME it suggests a better factoring:

parse series [
    remove [opt some newline]
    ...
]

It reads clearest when you bring the OPT outside, to say you're optionally removing some newlines:

parse series [
    opt remove [some newline]
    ...
]

Plus it's now more obvious that the whole expression will be NULL in the case when no newlines are removed, and leverage that.

Note that the block is not necessary there:

parse series [
    opt remove some newline
    ...
]

And shows a bit of the dynamics of choice we like about the medium.

To prematurely return a result from UPARSE (without requiring reaching the end of input), the ACCEPT combinator can be used.

>> parse [a b c 10 20 <d>] [some [word! | integer!]]
** Error: PARSE partially matched the input, but didn't reach the tail

>> parse [a b c 10 20 <d>] [some [word! | accept integer!]]
== 10

>> parse [a b c 10 20 <d>] [some [word! | accept [let i: integer! accept (i * 3)]]]
== 30

R3-Alpha added this as "RETURN", but Red doesn't seem to have carried that forward. I like using a different word.

UPARSE combinators can synthesize antiforms, including NIHIL. So there's an ELIDE combinator.

>> parse [a b a b b c c] [accumulate ['a | 'b], some 'c]
== c

>> parse [a b a b b c c] [accumulate ['a | 'b], elide some 'c]
== [a b a b b]

This is particularly helpful to reach for in cases where an ACCEPT that returned prematurely would not complete your match. e.g. you synthesized your result, but you still have more matching to do, and want to avoid the inconvenience of making and naming a variable for the synthesized result and then evaluating to it at the end.

(Note that COMMA! doesn't do anything, they are skipped by the BLOCK! combinator so long as they are between combinated parsers. some "a", some "b" is legal but not some, "a" some, "b").

Rebol PARSE has lots of performance issues. e.g. when doing something like a seek, a block rule is re-applied at each location:

red>> parse "aaaaab" [thru [(print "testing") "b"]]
testing
testing
testing
testing
testing
testing
== true

There are attempts in historical Rebol and Red to try and optimize where possible, e.g. the implementation of to "b" optimizes the find to do a faster seek for the pattern. But you lose that optimization as soon as you change it to to ["b"], and it recurses the parser.

Ren-C dials things up to 11, including the crappy performance. UPARSE is knowingly painfully turtle slow, because it is really just a sketch of the architecture for how parser combinators would be done. But as its broken up into functions, those functions can be rewritten as native C when the time comes (though I want to make sure the usermode implementations are kept in sync and can be swapped in and still work equivalently--so I imagine maintaining each usermode combinator and its C implementation in the same file).

I also want to make sure all the right hooks are in place to do things like stepwise debugging:

ReplPad Visual PARSE Debugger

But I'm trying to put that together as part of a bigger story of what kind of infrastructure exists for dialect debugging and debugging in the main language (e.g. you'd be able to "step in" to a dialect at the Rebol instructions--like an assembly view--or at the level the dialect offers).

The most useful thing is bringing this outside perspective to the sort of "people who grew up on C64 BASIC" who tend to populate and design Rebol. (You've already usefully pointed out that OPTIONAL is a perfectly acceptable word and explained the Haskell sense of TRY...)

Like I say above, the UPARSE design is still very much prototype-y. It's at the "amazing it works at all" phase, though it's already the most tested PARSE variant by far (e.g. by subsuming Red's PARSE tests). But by being all usermode code, it's still at the point where there's plasticity in the design to incorporate new ideas.

The big areas where ideas are needed most are in error delivery, and what the story is for binding tools within parse to help dialect authors accomplish cool things easily. And any architectural ideas to bolster performance (while still keeping the pleasing ergonomics) would be welcome.