I’ve seen a few parsing-related threads recently in this forum. I haven’t been commenting much on them, since I don’t have enough experience with UPARSE (or for that matter historical PARSE) to say anything helpful.
On the other hand, I do have a lot of experience with parser combinators in Haskell. In terms of structure and purpose, they’re probably the concept most similar to UPARSE that I’ve seen from another language. So I thought I’d write this post in the hope that something, somewhere, might eventually prove useful somehow.
High-level overview
The first and most important thing to realise about parser combinators is that they’re not built into the language. They’re simply ordinary libraries which make it easy to construct parsers compositionally.
(I mean, for that matter, UPARSE isn’t built in either. But there’s a different level of integration with the rest of the language.)
The basic approach is to define a new datatype for parsers, usually called something like Parser a
. This is a parser which can process a string (or other datatype), returning a result of type a
if it succeeds. Alongside this there will be a set of primitive parsers, most notably char
(to parse single characters), return
(which always succeeds with a result), and empty
(which always fails).
(Of course, practical libraries will have more primitives. megaparsec
has a nice selection, with good comments.)
The most interesting bit is the way these parser combinators are combined to create larger parsers. This, of course, relies on the standard abstractions of Haskell. Two operators are particularly important:
- Sequencing:
p >>= f
is a parser which runsp
and passes the result to functionf
, then runs the resulting parser. - Choice:
p <|> q
is a parser which runsp
, then backtracks and runsq
if that failed.
From these basic elements you can define a very wide range of generic combinators, for instance these useful ones:
-- sequencing ignoring result
(>>) :: Parser a -> Parser b -> Parser b
p >> q = p >>= \_ -> q
-- Match a whole string
string :: String -> Parser ()
string "" = return ()
string (c:cs) = char c >> string cs
-- 1 or more
some :: Parser a -> Parser [a]
some p =
p >>= \first ->
many p >>= \rest ->
return (first : rest)
-- 0 or more
many :: Parser a -> Parser [a]
many p = some p <|> return []
A basic implementation
This is all a little abstract, though. Probably the easiest way to understand parser combinators is to implement them.
A basic implementation is quite simple. A parser is just a function which takes in a string, and on success returns the result alongside the unparsed portion of the string:
newtype Parser a = Parser (String -> Maybe (String, a))
You then run the parser simply by unwrapping the function and applying it to your input:
runParser :: Parser a -> String -> Maybe (String, a)
runParser (Parser p) input = p input
Simple parsers work as you’d expect:
char :: Char -> Parser ()
char c = Parser $ \input -> case input of
(c':cs) | c == c' -> Just (cs, ())
_ -> Nothing
return :: a -> Parser a
return a = Parser $ \input -> Just (input, a)
empty :: Parser a
empty = Parser $ \_ -> Nothing
The combinators require more elaborate state-threading, but are still straightforward:
(>>=) :: Parser a -> (a -> Parser b) -> Parser b
p >>= f = Parser $ \input ->
case runParser p input of
Nothing -> Nothing
Just (partiallyParsed, a) ->
runParser (f a) partiallyParsed
(<|>) :: Parser a -> Parser a -> Parser a
p <|> q = Parser $ \input ->
case runParser p input of
Just result -> Just result
Nothing -> runParser q input
This simple code suffices for a surprisingly wide variety of tasks. For instance, a slight variant is present in the base library as ReadS
. (The variation is that ReadS
can produce multiple results.)
Managing backtracking
Unfortunately, this has serious problems with any larger-scale use. For one thing, it can’t do error reporting beyond ‘it failed somewhere’. But there are even bigger problems with this implementation of the choice combinator (<|>)
:
- It behaves unpredictably: any error anywhere will result in the parser backtracking to the last choice, and so on until every single choice has been exhausted. This is bad for the time complexity, plus it leaves you no chance of ever reporting errors nicely.
- It holds onto the input string during the whole time
p
is executing. This creates a space leak, which grows larger the more nested choices you have.
The first parser combinator library to solve these issues was Parsec
. Its design is described in this paper (which seems very readable). I won’t describe its implementation here… but the basic idea is simple to summarise: each parser keeps track of what it’s consumed, and commits to a parser as soon as it’s consumed any characters.
Some parsers require no changes under this new semantics. For instance, string "first word" <|> string "second word"
works without problems. If it sees an input character f
, the first choice string "first word"
matches immediately, meaning the library can commit to the first parser, and know that the second parser never needs to run. On the other hand, upon seeing an input character 's'
, that first parser will fail on that very first character: since no characters have been consumed, control passes over to the second parser to succeed or fail.
On the other hand, more elaborate lookahead becomes more difficult. For instance, string "word 1" <|> string "word 2"
no longer works: as soon as this parser sees 'w'
, it matches the first parser, so the second parser will never be run.
To solve this, Parsec introduces a new primitive combinator: try
, which creates a parser which never consumes characters on failure. The effect of this is to allow backtracking over whatever is in the try
. For this example, try (string "word 1") <|> string "word 2"
would work as expected.
The nice thing about this approach is the control it gives you. For instance, I can write things like:
(try (string "App") >> string "le") <|> string "Apricot"
Here, as soon as the parser reaches a second 'p'
, it can commit to the first branch. But if it sees another character, it’s still able to backtrack through the string "App"
and move on to the second branch. Of course, this is a contrived example — but in general, being able to specify the control flow of your parser like this is exceedingly useful in larger parsers.
What use is this for Rebol?
Quite probably, none at all.
But like I said, I’m hoping potentially some of this might end up useful somewhere. If it can act as inspiration which helps solve any problems, I’m happy.
And, of course, now that we have this thread for it, feel free to ask me any questions about parser combinators you might have!