!!! BIG NEWS !!!
I've griped about historical PARSE's weaknesses for some time. To name a few:
-
Users can't add new keywords, or tweak existing ones. It's a monolithic chunk of C code that you have to recompile to get any changes to. And since Rebol2's heyday, there have arisen a competitive landscape of "parser combinators" which allow users to build new parse "keywords" and compose parsers out of each other.
-
Haskell has several to choose from I've written about (Parsec, Attoparsec, Megaparsec). These offer composability, performance, rigor, and even error messages to guide you to correcting your input where the rule failed.
-
...but search the web for "parser combinator [L]" for any language [L]. You'll find plenty of other people who have figured out how to break down your parse into components and avoid RegEx... "nom" for Rust, "kern" for Clojure
-
-
Parse rules are very hard to reuse. Because executing a rule doesn't produce a "result value", any side-effects a rule has wind up being written in some global variable.
- The finite set of keywords users are given for data capture are limited to capturing data at the level of the input series (or elements of it).
-
The R3-Alpha codebase for PARSE developed organically over a long period of time, and that shows.
-
While there's lip service to the idea that you can write fluid code using delimiters only when you want to, the irregular implementation gives you errors... if you are lucky (if you're unlucky it just churns along)
r3-alpha>> parse ["aaaa"] [and any-string! into some 2 "a"] ; no AHEAD ** Script error: PARSE - invalid rule or usage of rule: into r3-alpha>> parse ["aaaa"] [and any-string! into [some 2 "a"]] == false ; !!! WHAT? (it doesn't even know what it can't do) r3-alpha>> parse ["aaaa"] [and any-string! into [some [2 "a"]]] == true
-
Despite being an all new codebase, Red is similarly irregular:
red>> parse ["aaaa"] [ahead any-string! into some 2 "a"] *** Script Error: PARSE - unexpected end of rule after: into red>> parse ["aaaa"] [ahead any-string! into [some 2 "a"]] ; needs block == true
-
The best way to start drafting a new-and-much-needed architecture for PARSE is to write it first as usermode code. It will be slow as molasses...but if the design is done right, then it could be gradually optimized. Those optimizations are more likely to give cross-cutting advantages to other code in the system.
Not only that: if the architecture is good enough, it should be able to create backwards-compatible variations... like PARSE2 for Rebol2 PARSE semantics, PARSE3 for R3-Alpha, REDPARSE for Red...
Meet the new PARSE! (codenamed UPARSE)
I actually did write a protozoan draft implementation in a day-ish. What better way to convince people that they can make powerful dialects than to show them that they--too--could have cooked up PARSE in a day?
But let me be completely clear: the ability to even approach writing such a thing is fundamentally built on Ren-C core mechanics. Most notably FRAME! interchangeability with functions and partial specialization. You wouldn't want to try something like this in Rebol2, R3-Alpha, or Red.
Since the original prototype, it has been built up and become a pretty thorough effort!
(UPDATE: While once actually called uparse, the function has now taken the name parse...with the old parse codebase running as parse3. For performance reasons, some code still uses parse3.)
Killer Feature: It Can Synthesize ANY-VALUE!
UPARSE isn't limited to return results of true and false. It actually raises an error if it cannot process its rules completely:
>> parse "abd" ["a" "b" "c"]
** Error: BLOCK! combinator at tail or not thru
This opens up all normal values as fair game for return! Every "combinator function" (historically what you might have thought of as "keywords") has the option of returning a result, and the BLOCK! combinator that succeeds will return its last result.
Literal matches return the rule itself:
>> parse "abc" ["a" "b" "c"]
== "c"
A GROUP! is a combinator that always succeeds, and returns its result:
>> parse "abc" ["a" "b" "c" (1000 + 20)]
== 1020
When you use a SET-WORD! in a rule, what you're actually capturing is the result of the rule on the right.
>> parse "aaab" [stuff: collect [some keep "a"], "b"] ; commas optional, but cool!
== "b"
>> stuff
== ["a" "a" "a"]
The sky is the limit with this approach. When you consider that things like COLLECT are not "built-in" but the sort of combinator you can create after-the-fact, the true power starts to hit you. Here's GATHER, which you can use with EMIT to make objects:
>> parse [x 10 y 20] [gather [
emit reaction: (join "WOW" "!!!")
some [name: word!, emit (name): integer!]
]]
== make object! [
reaction: "WOW!!!"
x: 10
y: 20
]
Customizable, and Easier To Understand Than You'd Think!
UPARSE can take an optional MAP! of "combinators". This is a map from keywords to stylized "parser functions". These functions take the input and other parsers as parameters...then calculates how much of the input was consumed and returns it.
An easy one to look at would be the TALLY combinator, which just counts up the number of times a rule matches. This post isn't intended to be a complete guide to the mechanics, but just to show you the scale of what it takes to add a feature:
tally: combinator [
{Iterate a rule and count the number of times it matches}
return: "Number of matches (can be 0)"
[<opt> integer!]
parser [action!]
<local> count pos
][
append state.loops binding of 'return ; registration needed so BREAK works
count: 0
cycle [
[^ pos]: parser input except e -> [ ; definitional error if parse mismatch
take/last state.loops ; unregister from the BREAK-able stack
remainder: input
return count
]
count: count + 1
input: pos
]
fail ~unreachable~
]
And voila!
>> parse "abbcccabb" [tally ["a" repeat 2 "b" | repeat 3 "c"]]
== 3
But what is this PARSER function parameter to TALLY? It's basically a black box. The UPARSE engine itself figures out what that parameter should be, just like DO usually figures out expressions. So a PARSER is what you get when a COMBINATOR has had its own dependent parser parameters specialized out, to then only require the input.
(If that sounds brain-bending, the only part you need to know is that TALLY is able to get its work done without knowing anything about how to interpret ["a" repeat 2 "b" | repeat 3 "c"]
. UPARSE does the behind-the-scenes work of "fusing" the BLOCK! combinator, TEXT! combinators, and INTEGER! combinator...and passes the implied parser representing that operation.)
Composable Architecture Avoids Redbol's Ad-Hoc Pitfalls
Something impressive about UPARSE is that it uses partial specialization via FRAME! to build coherent rules without blocks. It "just works".
In the introduction I pointed out Red and R3-Alpha stumbling over INTO usage. UPARSE's SUBPARSE (like Topaz'z INTO) is arity-2 so it has an implicit AHEAD as the first parameter, but here we can see it working in the free-form fashion:
>> parse ["aaaa"] [subparse any-string! some repeat 2 "a"]
== "a" ; Success! (e.g. not an error!)
DATATYPEs Have Combinators Too!
The implementation is so modular that every datatype is implemented through its own combinator. And you can override them when you run UPARSE.
Let's try doing something totally weird just to GROUP!, leaving everything else the same
weird-combinators: copy default-combinators ; start with the default list
weird-combinators.(group!): combinator [
{Weird group combinator: test if equal to next input element}
return: "Return a block saying whether it matched or not"
[block!]
value [group!]
<local> result block
][
result: do value
block: if (first input) = :result [
compose [match: (value) = (first input)]
] else [
compose [mismatch: (value) != (first input)]
]
remainder: next input ; skip input either way
return block
]
>> weird-parse: specialize :uparse [combinators: weird-combinators]
>> weird-parse [10 20 30] [collect [keep (2 + 8) integer! keep (5 + 15)]]
== [match: (2 + 8) = 10 mismatch: (5 + 15) != 30]
Whoa... that's wild. I thought I'd demonstrate something radical just to get people thinking, but more obviously useful variants would do things like hook the WORD! combinator to get a running log of rules that trigger.
Most interesting is that BLOCK! itself is a combinator. So you can hijack the parse at its most fundamental level. You could dynamically pre-process the block and pass it to the default combinator.
And in fact, this is already in use to create PARSE2, a Rebol2-compatible PARSE.
Wild New Combinators Are Finessing Old Problems
UPARSE is providing a playground for attacking classic annoyances.
r3-alpha>> parse "We all remember <<this problem>> (?) don't we?" [
thru "<<" copy thorn to ">>" skip skip ; ugh. [to ">>" ">>"] is also ugh
space "(?)" to end
]
== true
r3-alpha>> thorn
== "this problem"
With value-synthesis, problems like this can be cast in terms of combinators inspired by other languages (and beyond). Here's BETWEEN, a 2-argument rule...which I'll put in brackets even though I don't need to:
>> parse "We all remember <<this problem>> (?) don't we?" [
thru "<<" rose: [between <here> ">>"]
space "(?)" to <end>
]
>> rose
== "this problem"
You could have just written that as THRU BETWEEN, but I wanted to show off how it solves the problem from the point of the old understandings. Now I'll show off the ellipsis combinator too.
>> parse "We all remember <<this problem>> (?) don't we?" [ ... rose: between "<<" ">>" space "(?)" ... ] >> rose == "this problem"
@rgchris and I are of the opinion that COLLECT is nigh-useless without backtracking. e.g. in Red:
red>> parse "ac" [collect [
"a" keep (<a1>) "b" keep (<b>)
| "a" keep (<a2>) "c" keep (<c>)
]]
== [<a1> <a2> <c>] ; Bah.
So a careful generic design has been made for that, which is extremely clever (if I do say so). COLLECT is only one of the clients, but an obvious one:
>> parse "ac" [collect [
"a" keep (<a1>) "b" keep (<b>)
| "a" keep (<a2>) "c" keep (<c>)
]]
== [<a2> <c>]
There's Much, Much More...
...and you are invited to help throw in your ideas, as well as document and test this huge new design. It's slow for now, while features like stream parsing continue to be prototyped. But when it's ready, combinators will be native and optimizations pursued!