Haskell parser combinators

I’ve seen a few parsing-related threads recently in this forum. I haven’t been commenting much on them, since I don’t have enough experience with UPARSE (or for that matter historical PARSE) to say anything helpful.

On the other hand, I do have a lot of experience with parser combinators in Haskell. In terms of structure and purpose, they’re probably the concept most similar to UPARSE that I’ve seen from another language. So I thought I’d write this post in the hope that something, somewhere, might eventually prove useful somehow.

High-level overview

The first and most important thing to realise about parser combinators is that they’re not built into the language. They’re simply ordinary libraries which make it easy to construct parsers compositionally.

(I mean, for that matter, UPARSE isn’t built in either. But there’s a different level of integration with the rest of the language.)

The basic approach is to define a new datatype for parsers, usually called something like Parser a. This is a parser which can process a string (or other datatype), returning a result of type a if it succeeds. Alongside this there will be a set of primitive parsers, most notably char (to parse single characters), return (which always succeeds with a result), and empty (which always fails).

(Of course, practical libraries will have more primitives. megaparsec has a nice selection, with good comments.)

The most interesting bit is the way these parser combinators are combined to create larger parsers. This, of course, relies on the standard abstractions of Haskell. Two operators are particularly important:

  • Sequencing: p >>= f is a parser which runs p and passes the result to function f, then runs the resulting parser.
  • Choice: p <|> q is a parser which runs p, then backtracks and runs q if that failed.

From these basic elements you can define a very wide range of generic combinators, for instance these useful ones:

-- sequencing ignoring result
(>>) :: Parser a -> Parser b -> Parser b
p >> q = p >>= \_ -> q

-- Match a whole string
string :: String -> Parser ()
string "" = return ()
string (c:cs) = char c >> string cs

-- 1 or more
some :: Parser a -> Parser [a]
some p =
    p >>= \first ->
    many p >>= \rest ->
    return (first : rest)

-- 0 or more
many :: Parser a -> Parser [a]
many p = some p <|> return []

A basic implementation

This is all a little abstract, though. Probably the easiest way to understand parser combinators is to implement them.

A basic implementation is quite simple. A parser is just a function which takes in a string, and on success returns the result alongside the unparsed portion of the string:

newtype Parser a = Parser (String -> Maybe (String, a))

You then run the parser simply by unwrapping the function and applying it to your input:

runParser :: Parser a -> String -> Maybe (String, a)
runParser (Parser p) input = p input

Simple parsers work as you’d expect:

char :: Char -> Parser ()
char c = Parser $ \input -> case input of
    (c':cs) | c == c' -> Just (cs, ())
    _ -> Nothing

return :: a -> Parser a
return a = Parser $ \input -> Just (input, a)

empty :: Parser a
empty = Parser $ \_ -> Nothing

The combinators require more elaborate state-threading, but are still straightforward:

(>>=) :: Parser a -> (a -> Parser b) -> Parser b
p >>= f = Parser $ \input ->
    case runParser p input of
        Nothing -> Nothing
        Just (partiallyParsed, a) ->
            runParser (f a) partiallyParsed

(<|>) :: Parser a -> Parser a -> Parser a
p <|> q = Parser $ \input ->
    case runParser p input of
        Just result -> Just result
        Nothing -> runParser q input

This simple code suffices for a surprisingly wide variety of tasks. For instance, a slight variant is present in the base library as ReadS. (The variation is that ReadS can produce multiple results.)

Managing backtracking

Unfortunately, this has serious problems with any larger-scale use. For one thing, it can’t do error reporting beyond ‘it failed somewhere’. But there are even bigger problems with this implementation of the choice combinator (<|>):

  1. It behaves unpredictably: any error anywhere will result in the parser backtracking to the last choice, and so on until every single choice has been exhausted. This is bad for the time complexity, plus it leaves you no chance of ever reporting errors nicely.
  2. It holds onto the input string during the whole time p is executing. This creates a space leak, which grows larger the more nested choices you have.

The first parser combinator library to solve these issues was Parsec. Its design is described in this paper (which seems very readable). I won’t describe its implementation here… but the basic idea is simple to summarise: each parser keeps track of what it’s consumed, and commits to a parser as soon as it’s consumed any characters.

Some parsers require no changes under this new semantics. For instance, string "first word" <|> string "second word" works without problems. If it sees an input character f, the first choice string "first word" matches immediately, meaning the library can commit to the first parser, and know that the second parser never needs to run. On the other hand, upon seeing an input character 's', that first parser will fail on that very first character: since no characters have been consumed, control passes over to the second parser to succeed or fail.

On the other hand, more elaborate lookahead becomes more difficult. For instance, string "word 1" <|> string "word 2" no longer works: as soon as this parser sees 'w', it matches the first parser, so the second parser will never be run.

To solve this, Parsec introduces a new primitive combinator: try, which creates a parser which never consumes characters on failure. The effect of this is to allow backtracking over whatever is in the try. For this example, try (string "word 1") <|> string "word 2" would work as expected.

The nice thing about this approach is the control it gives you. For instance, I can write things like:

(try (string "App") >> string "le") <|> string "Apricot"

Here, as soon as the parser reaches a second 'p', it can commit to the first branch. But if it sees another character, it’s still able to backtrack through the string "App" and move on to the second branch. Of course, this is a contrived example — but in general, being able to specify the control flow of your parser like this is exceedingly useful in larger parsers.

What use is this for Rebol?

Quite probably, none at all.

But like I said, I’m hoping potentially some of this might end up useful somewhere. If it can act as inspiration which helps solve any problems, I’m happy.

And, of course, now that we have this thread for it, feel free to ask me any questions about parser combinators you might have!

2 Likes

Thank you for summarizing! I've written a little here before:

PARSE vs. Haskell's (X)PARSEC

Parser combinators were not an influence on the original Rebol PARSE (circa 90s), but were what I drew inspiration from for UPARSE.

One thing to notice is that UPARSE doesn't have a sequencing combinator. Instead, it has a BLOCK! combinator.

And it's the BLOCK! combinator that generally does the "combinating". It can technically be replaced, but it is the most complex combinator and so difficult to replace entirely... it's almost like the block combinator is UPARSE. So it is more likely that it would be augmented vs. people wanting to rewrite it entirely.

When one of the parsers the block combinator has "combinated" fails, it then scans literally for the next | in the input and starts applying rules there. This is true also historical Redbol, in that you can write any old (LOAD-able) gibberish after a failing rule, due to the skipping:

red>> parse "aaa" [some "b" fhqwhgads ??? | some "a"]
== true

(Should a debug mode exist that runs through and ensures these regions will combinate, just not call the parsers? Probably not in this language paradigm. Mismatching rules can guard against dereferencing undefined variables and such, and that's kind of how the whole language works... it's modeling clay.)

UPARSE adds in another element to BLOCK!'s recognized-literally vocabulary of ||, that is a synonym for putting everything to the left of it in a block:

[rule1 | rule2 | rule3 || rule4]
=>
[[rule1 | rule2 | rule3] rule4]

[rule1 | rule2 || rule3 | rule4 || rule5]
=>
[[rule1 | rule2] [rule3 | rule4] rule5]

(Well, it's a synonym so long as you haven't redefined the BLOCK! combinator itself to other meanings, but if you did you might also redefine || while you're at it.)

Anyway, I'm rather fond of ||.

Seems unfortunate that the default is biased that way, and you have to tack on more to get the likely-intended meaning.

There was significant discussion on what to call the "0 or more" combinator. Historical Redbol called this ANY and I did not like that (instead pursuing an ANY that is aligned with the ANY construct in the language).

But I do not like MANY either--it might be worse. Ultimately I decided that OPTIONAL SOME (or OPT SOME) is just what you would write out.

On the linguistic front, we don't have a word for "zero or more" in English, and maybe there is a reason for that. What I found in going through replacing the old "ANY"=>"OPT SOME" was that there were a lot of cases where someone reached for ANY when it actually wasn't a zero-or-more situation, so I liked it being explicit.

Many cases I looked at tidied up (and by many I mean many, not zero). I found this code removing 0 or more newlines at the head of a series via ANY:

parse series [
    remove [any newline]
    ...
]

But when you rephrase this with OPT SOME it suggests a better factoring:

parse series [
    remove [opt some newline]
    ...
]

It reads clearest when you bring the OPT outside, to say you're optionally removing some newlines:

parse series [
    opt remove [some newline]
    ...
]

Plus it's now more obvious that the whole expression will be NULL in the case when no newlines are removed, and leverage that.

Note that the block is not necessary there:

parse series [
    opt remove some newline
    ...
]

And shows a bit of the dynamics of choice we like about the medium.

To prematurely return a result from UPARSE (without requiring reaching the end of input), the ACCEPT combinator can be used.

>> parse [a b c 10 20 <d>] [some [word! | integer!]]
** Error: PARSE partially matched the input, but didn't reach the tail

>> parse [a b c 10 20 <d>] [some [word! | accept integer!]]
== 10

>> parse [a b c 10 20 <d>] [some [word! | accept [let i: integer! accept (i * 3)]]]
== 30

R3-Alpha added this as "RETURN", but Red doesn't seem to have carried that forward. I like using a different word.

UPARSE combinators can synthesize antiforms, including NIHIL. So there's an ELIDE combinator.

>> parse [a b a b b c c] [accumulate ['a | 'b], some 'c]
== c

>> parse [a b a b b c c] [accumulate ['a | 'b], elide some 'c]
== [a b a b b]

This is particularly helpful to reach for in cases where an ACCEPT that returned prematurely would not complete your match. e.g. you synthesized your result, but you still have more matching to do, and want to avoid the inconvenience of making and naming a variable for the synthesized result and then evaluating to it at the end.

(Note that COMMA! doesn't do anything, they are skipped by the BLOCK! combinator so long as they are between combinated parsers. some "a", some "b" is legal but not some, "a" some, "b").

Rebol PARSE has lots of performance issues. e.g. when doing something like a seek, a block rule is re-applied at each location:

red>> parse "aaaaab" [thru [(print "testing") "b"]]
testing
testing
testing
testing
testing
testing
== true

There are attempts in historical Rebol and Red to try and optimize where possible, e.g. the implementation of to "b" optimizes the find to do a faster seek for the pattern. But you lose that optimization as soon as you change it to to ["b"], and it recurses the parser.

Ren-C dials things up to 11, including the crappy performance. UPARSE is knowingly painfully turtle slow, because it is really just a sketch of the architecture for how parser combinators would be done. But as its broken up into functions, those functions can be rewritten as native C when the time comes (though I want to make sure the usermode implementations are kept in sync and can be swapped in and still work equivalently--so I imagine maintaining each usermode combinator and its C implementation in the same file).

I also want to make sure all the right hooks are in place to do things like stepwise debugging:

ReplPad Visual PARSE Debugger

But I'm trying to put that together as part of a bigger story of what kind of infrastructure exists for dialect debugging and debugging in the main language (e.g. you'd be able to "step in" to a dialect at the Rebol instructions--like an assembly view--or at the level the dialect offers).

The most useful thing is bringing this outside perspective to the sort of "people who grew up on C64 BASIC" who tend to populate and design Rebol. (You've already usefully pointed out that OPTIONAL is a perfectly acceptable word and explained the Haskell sense of TRY...)

Like I say above, the UPARSE design is still very much prototype-y. It's at the "amazing it works at all" phase, though it's already the most tested PARSE variant by far (e.g. by subsuming Red's PARSE tests). But by being all usermode code, it's still at the point where there's plasticity in the design to incorporate new ideas.

The big areas where ideas are needed most are in error delivery, and what the story is for binding tools within parse to help dialect authors accomplish cool things easily. And any architectural ideas to bolster performance (while still keeping the pleasing ergonomics) would be welcome.

Interesting, thanks! After reading it, I do have a few thoughts, but I’ll reply in that thread.

Sure, but I wasn’t aiming to describe Rebol there.

This certainly is neat. Reminds me a bit of Haskell’s ($) operator, but more flexible.

But on the other hand, this approach greatly improves performance, and makes control flow far more manageable. From my experience using parser combinators, the trade-off is well worth it.

In Haskell, many is just the conventional term which has stuck. Haskell tends to care much less about names than Rebol does, probably because the types and the syntax carry so much more information.

Apologies, I should have made this clearer. return in Haskell does not prematurely return a result. It simply takes a value and hoists it into a monadic context. In this case, return value is a parser which consumes no characters and always succeeds, yielding the result value as it does so.

(Note that return has the alias name pure. Arguably that’s a less confusing name, and I should have used it.)

Again, this is something I should have been clearer about. Haskell (>>) is simply the equivalent of Rebol COMMA!.

In this case, (>>) is used to write the equivalent of your first example, not your second. In Haskell, these would be:

parser1 = many (char 'a' <|> char 'b') >> some (char 'c')
parser2 = many (char 'a' <|> char 'b') >>= \result -> (some (char 'c') >> return result)

Although in practice we have another operator (<*) to make that second example a bit shorter: many (…) <* some (char 'c').

Yep, you understand my intentions!

Some parser combinator libraries have interesting strategies for error handling, so that might be worth a look. I know that the paper I linked summarises the basic ideas around how that works. (If I get time, perhaps I can investigate further and write it up here.)

Haskell is well-suited in this regard, because it has so many well-specified abstractions for combining things together. (e.g., many libraries implement parsers as monad transformers.) And of course parsers are simply ordinary values, so it’s easy to manipulate them using the rest of the language.

By contrast, I’ve always felt this is the limitation of dialecting: the elements of a dialect are just words in a block, rather than first-class values of the language. Rebol has parse specifications, but the ‘parsers’ themselves don’t actually exist outside the call to uparse. There may be some solution to this problem, but I’m not sure what it is.