PARSE vs. Haskell's (X)PARSEC

Rebol's PARSE might seem unique to those coming from "mainstream" languages. But when talking to those who survey alternative languages, it's important not to be ignorant of the work of others! For instance, take a skim through of:

Parser Combinators: Parsing for Haskell Beginners

The author of that page makes many of the same arguments Rebol users make about how RegEx is gibberish...and that it's much nicer when you can name and factor out rules. But there are some big differences worth knowing about.

Haskell has formulated their Parser so you can re-use patterns from the rest of the language inside the parse. It's as if in Rebol there didn't need to be syntax in the PARSE dialect for "repeat this rule N times", you just use the same REPEAT native that ran any other code a certain number of times. So making up new "parse keywords" is just writing new functions.

If that sounds super-slick, it is. How did they create a framework so general that it applies to both "function" and "parser plugin"? Well, you can't use just any function this way...you have to use various "monad-aware" operations. e.g. the form of REPEAT you can use is called replicateM. Explanations of this are beyond the scope of this post and/or my understanding...but the upshot is that it's still all quite neat.

And as a consequence of making each subparser "just another function": All the optimizations the compiler can do for functions apply to the parse too, and it's blazing fast. Continuing the REPEAT example...if they bothered to notice that it was actually a REPEAT of 1 then it can optimize that out. While each subrule you give a name to and factor out in Rebol parsing slows it down a little bit at runtime, Haskell can optimize that out at compile-time.

There are a lot of standard parsers defined...so people who want to get something like a signed decimal number out of a stream can do so. If you want hexadecimal numbers, you've got it. (The lack of standard definitions like these has always been a weak point in Rebol.)

Also: Haskell's Parsers have a Lot of Nice Features we wish we had. Look at the list from Megaparsec, the current leader of the pack of Haskell parsers. A few:

  • Megaparsec can show the line on which parse error happened as part of parse error.

  • Megaparsec has more powerful combinators and can parse languages where indentation matters out-of-the-box.

  • Megaparsec can recover from parse errors “on the fly” and continue parsing.

  • Megaparsec allows us to conditionally process parse errors inside your parser before parsing is finished. In particular, it's possible to define regions in which parse errors, should they happen, will get a “context tag”, e.g. we could build a context stack like “in function definition foo”, “in expression x”, etc.

How Can Redbols Compete?

Forget the idea of ever competing on performance or rigor. No way.

But since Haskell is compiled, you don't get the opportunity to build parse rules dynamically. So that's something to really stress. I think Ren-C's ability to indicate the GROUP!s that you want to COMPOSE is the sort of thing that would play into a real "magic" demo of how clean it could be to build these things at runtime.

(Note: Clojure has an interesting edge here, because it has monadic parsing libraries, but you can build those parse "rules" and compile them at runtime too. This is the sort of thing Red could, in theory, do.)

The Monadic parsers can't have side-effects, so each little parsing subrule is a function that gives a specific type of result. The top-level rule you pass in builds a structure out of all these subresult.
Being able to stick in GROUP!s to run arbitrary code and use SET and COPY to write arbitrary variables really does differentiate Rebol...and it can seem very nice. But maybe it would be good if the foundations of Rebol PARSE permitted the "building an aggregate value cleanly" and these imperative abilities were thrown in as a tool to use if that didn't fit your use case.

I thought Haskell parsers were network-stream-capable out of the box...e.g. that they would have an intelligent formulation so you could parse an infinite number of data items one at a time. But nearly all of the parsers--including Megaparsec--require all the data to be in memory from the beginning (the way Rebol PARSE does today). Even the parsers that are "incremental" hold onto all the historical input in case they need to backtrack...whether the rules need to or not. I've found some research pointing to architectures of doing this better...see Section 4 of this paper

Conclusion

Taking time to work in other languages is helping to give me perspective, and get rid of some of the "tunnel vision" of exploring the vagaries of C, C++, and Rebol for too long. I'm looking for the best of what's out there for anything I can steal: type systems, streaming for a composable PORT! design, you-name-it.

Even so...while I do enjoy aspects of hacking around in Haskell, doing so certainly convinces me that there are very few people in the planet who would ever be able to put up with programming in it. Some parts of it do feel fairly natural for my line of thinking...but some of it is just really frustrating and dizzying, and it doesn't help that people really seem to go for absolute brevity in their code instead of labeling their subexpressions. (The compiler will optimize it out, what are they afraid of?)

Finding out that Haskell parsers weren't supporting incrementality for network streaming out of the box was a bit of a shock to me. I'm going to keep digging until I can find some neat version of a "snapped together" system that can parse a PORT! which has a decompressor codec snapped onto it, running over a https connection. If I can piece together a state-of-the-art system that does that, I'll try to approximate it.

2 Likes

Yes, this is key. Peculiarities aside, Rebol syntax is quite friendly. I wouldn't be interested in big changes to get more performance or rigor. Syntaxes like REBIN or Red/System make me wonder why one wouldn't just switch to Go, Rust, Java etc. to get those features plus community/docs.

I think Java 8 has Streams which might be worth looking at.

As a Haskeller, I don’t actually think this is valid. I think there is definite untapped potential for Ren-C to be better than Haskell, at least in one of these metrics.

Firstly: note that parser combinators were not always ‘blazing fast’. There was a time when they were considered almost unusably slow. As I’ve said, the Haskell community needed to find the right trade-offs to get the speed up to where it is now… and I’ve definitely met people who think those trade-offs aren’t worth it (even if I personally do).

But also: it’s worth thinking a little closely about how Haskell parser combinators get that ‘rigor’. In large part, they do it by building everything off a set of well-defined abstractions: Monad, Alternative, and so on. This gives a flexible and reliable toolkit for composing parsers together, from which flows many other advantages.

But this approach has some disadvantages as well. The most prominent is that these particular abstractions can be, in some sense, ‘too big’: they might give parsers capabilities you might not want. In this case, the main culprit is Monad, which gives parsers the ability to depend arbitrarily on the results of earlier parsers. On the one hand, this allows context-sensitive parsing — but on the other, it makes static analysis impossible.

By contrast, Rebol parsers are completely specified ahead of time. You can inspect at the BLOCK! which is passed to PARSE or UPARSE, and see ahead-of-time the entire control flow of the parser. That gives you the opportunity to do clever optimisations… such as, let’s say, precompiling the whole parser into an ultra-fast LR lookup table. Even in Haskell, many people prefer to use such parser generators over parser combinators, despite the clunkiness it adds — a parser generator embedded into the language itself would be pretty amazing to use.

You can think of more creative uses too, of course. For instance, the Red people have used this capability to generate railroad diagrams, another thing Haskell parser combinator libraries can’t do.

I’m not quite sure what you’re referring to here… but if I understand you correctly, this sounds equivalent in power to what the Monad interface gives you. Most parser combinator libraries will allow you to alter the parser on-the-fly in crazy ways, if you really want to do so.

This is another one of those tradeoffs. To support arbitrary backtracking in a parser combinator system, you need to store the text to backtrack over. You can just not store that text, but then you can’t do any backtracking (or in more technical terms, you’re restricted to LL(1) grammars). So this is another place where Rebol potentially has an advantage, since it can analyse the parser before running it.

A constrained parse dialect could be a parser generator, and the constrained dialect could probably subset the unconstrained one pretty closely.

But unconstrained PARSE as-we've-known-it is quite dynamic. I've tried to nail down some minimal rules for the sake of coherence:

When Should PARSE Notice Changes

My main point of view that where this fits in the "Loom of Language" is packaging this kind of methodology in a way that brings it to the common programmer, using naught but mostly-C89. A kind of "organic, locally-sourced software" without inscrutable additives and preservatives. I want it to be constructed obviously and my performance goals are modest--seeing it as a toy almost more than anything.

This is in contrast to fairly bold claims about the pervasive applicability of the medium, believed by Carl and Red.

Ultimately if it gets used by others, it won't be up to me. I don't know if you read Graydon Hoare's bit on Rust going in a direction he didn't envision--but that his vision probably had no realistic future:

https://graydon2.dreamwidth.org/307291.html

Maybe my more humble framing isn't the way the world will take it. I don't know, but I'm glad you see potential in it!

One subtle thing to note here is that it is possible to inspect a parse block to determine how dynamic it is. This is something which Haskell simply cannot do at all.

I think this is a really important thing to do. The reason Haskell’s parser combinators are so extensible is that they’re built on abstractions with very well-defined laws, so they can make assumptions about how other parser combinators can behave. By contrast, it seems to me that many UPARSE combinators behave have somewhat idiosyncratic behaviour, which makes them more difficult to combine. Having a more rigorous foundation for UPARSE combinators would be well worth the effort, I think.

This would explain a lot of our disagreements, then… I don’t particularly like ‘toy languages’, except insofar as they have interesting ideas I can use elsewhere. I find it much more rewarding to think about how to design languages to solve real-world problems.

1 Like

Certainly "real" problems are being attacked. I just question the idea of solving problems at scale with such a tool.

But many real-world problems don't need to work at scale. I probably shouldn't say they're "toy" problems just because they're kind of small, but it certainly opens the field to using quirkier interpreted languages.