When UPARSE began it couldn't do much. So it had one test file %uparse.test.reb, that grew as the number of combinators grew.
This "one test file for all parse functions" approach was all the Saphirion tests ever had. And the ever-growing-file method is also used by the Red parse tests.
But UPARSE now aims higher: each combinator gets its own test file. While some tests will not fit precisely into that--especially larger examples that use many features--it's a better general idea for most of the tests.
Hence you can see all the nice files in the %tests/parse/
directory. And now that UPARSE is capable enough, I've even mined the R3-Alpha PARSE tests and sorted through them...adjusting them for UPARSE and making sure they work.
Mining Red Tests for Insight
This morning I hacked up a converter to translate Red's rather verbose test format to the more spare Rebol test form. At some point we'll be able to run Red's test suite directly "via Redbol", but I just wanted to go through and mine their parse tests for anything useful...and have them translated into the UPARSE way of saying things.
So... MEET %PARSE-TESTS-FROM-RED.TEST.REB!
(Note: Capture was done as of the last change to that file on Oct 16, 2020... commit 32c30072ff215fd4efc0200ab3572ffd7afc8e9f...curious they haven't added any parse tests for a year...?)
No small feat to go through those, but. Here we are. Most of them work--but BREAK and REJECT still need to be thought through as combinators. So the tests that don't work are the ones that use those. I'm starting them out as a commit of one file just to show what was taken if there's any question about that. But the next step is to break them out into the per-combinator files.
They licensed their tests BSD-3 which is Apache-2 Compatible to include or extend. Taking Apache-2 code back to BSD-3 isn't allowed by default. But they have my permission to take whatever tests as BSD-3, if they care. (They can in fact take any of my ideas. Should they ever do so, then them merely knowing that their ideas don't work and mine do is plenty punishment for them. But the non-test-code is LGPL...that includes the code for UPARSE and the web REPL. Borrowing from that means being subject to the LGPL license.)
Changes to remember are:
-
UPARSE uses OPT SOME (or MAYBE SOME) instead of ANY/WHILE (it has a better meaning for ANY and a more consistent meaning for WHILE!). SOME does not have a progress requirement. Use FURTHER if progress is mandated.
-
SET-WORD! must be combined with
<here>
to capture a position, vs. that weird old behavior of set-word alone. Seeking positions is done with the SEEK combinator, not a GET-WORD!. -
UPARSE replaces END with
<end>
and SKIP with<any>
. I think these TAG! combinators are working out great...it's nice to have END free for a variable name...and it opens up the space for more nouns that are "out of band" from variable names. Needing to say'<tag>
to actually match a tag is a small price to pay. -
General Ren-C renamings (string! => text!, number! => any-number!, none! => blank!), historical TRY is TRAP (with a much neater meaning for TRY)...
-
Stopping an alternate match is just done with FALSE. FAIL is reserved for the "raise an error" sense of failing, and by letting LOGIC! decide if the parse should go on or not we have a nice ability to use splicing rules like
:(condition = whatever)
to put a true to go on or a false to stop matching.
Thoughts follow.
You Never Find Out If Red PARSE COLLECT fails
Once a COLLECT keyword is hit, your result will be an array. It will contain whatever got collected up to the point of failure.
red>> parse [1 2 3 <bomb>] [collect [some keep integer!] word!]
== [1 2 3]
red>> parse [1 2 3 <bomb>] [collect [some keep integer! word!]]
== [1 2 3]
red>> parse [1 2 3 <bomb> 4] [collect [some keep integer! word!]]
== [1 2 3]
Ren-C's system lets you have your cake and eat it too... the COLLECT result can be the result and you can even elide matches outside:
ren-c>> parse [1 2 3 ta-da!] [collect [some keep integer!] elide word!]
== [1 2 3]
ren-c>> parse [1 2 3 ta-da!] [collect [some keep integer!] word!]
== ta-da!
ren-c>> parse [1 2 3 <bomb>] [collect [some keep integer!] word!]
; null
And of course you can always store rule results--any rule synthesized result--into a variable:
ren-c>> parse [1 2 3 ta-da!] [block: collect [some keep integer!] word!]
== ta-da!
ren-c>> block
== [1 2 3]
Every option is on the table (except having rules you write as if they must match fail, and not tell you!)
Red PARSE COLLECT/KEEP Is Wacky About Splicing
Here's a collect test that shows some nasty inconsistencies:
red>> parse [a b b b] [collect [skip keep some 'b]]
== [[b b b]]
First of all, that KEEP is keeping a BLOCK!. We know that in ordinary COLLECT if you keep a block it will splice...but this PARSE KEEP is acting like a Rebol2 KEEP/ONLY.
Secondly, the some 'b rule is returning a BLOCK!. But that's not what it does in general. Try an assignment:
red>> parse [a b b b] [skip set var some 'b]
== true
red>> var
== b
So KEEP SOME 'B has a different logic for what SOME 'B synthesizes than SET SOME 'B.
Ren-C is consistent on both fronts. KEEP does as-is by default, and SPREAD is used to request splicing. SOME always synthesizes the value of its last rule unless you ask for a copy. Currently that is done with ACROSS (may name change to COPY after a settling period...the distinction is currently helpful)
ren-c>> parse [a b b b] [collect [<any>, keep across some 'b]]
== [[b b b]]
ren-c>> parse [a b b b] [collect [<any>, keep spread across some 'b]]
== [b b b]
ren-c>> parse [a b b b] [collect [<any>, keep some 'b]]
== [b]
But...how do you splice in Red KEEP?
Er... KEEP PICK and KEEP COPY variants are explained (?) here:
red>> parse [x -- ] [collect [keep to '-- ]]
== [x]
red>> parse [x y -- ] [collect [keep to '-- ]]
== [[x y]]
red>> parse [x -- ] [collect [keep pick to '-- ]]
== [x]
red>> parse [x y -- ] [collect [keep pick to '-- ]]
== [x y]
red>> parse [x -- ] [collect [keep copy _ to '-- ]]
== [[x]]
red>> parse [x y -- ] [collect [keep copy _ to '-- ]]
== [[x y]]
(Sidenote: the use of the _ as a "word that is thrown away" shows that being forced to name arguments to COPY is a bad idea. It's better as a combinator that synthesizes a result that may or may not be stored in a variable.)
The best I can imagine is that this is an attempt to avoid generating large intermediate series. Since Red doesn't worry about "rollback" then if it wants to append things as it goes to the collecting array it can do so with this KEEP PICK.
It doesn't seem to even make any sense...the above suggests that KEEP PICK splices, but apparently not if the result comes from a GROUP! (??). See Red Issue #4198:
red>> parse [][collect keep pick ('a)]
== [a]
red>> parse [][collect keep pick ([a b])]
== [[a b]]
UPARSE is going down a vastly more consistent/usable road. But if rollback isn't a requirement there's no reason the append-with-no-intermediate-series behavior couldn't be mimic'd in the UPARSE2 emulation. Don't know who'd want it, though.
Explicit Advancement Requirement Is Good
These all infinite loop in UPARSE:
parse [a a] [some ['c | not 'b] repeat 2 <any>]
parse "aa" [some [#c | not #b] repeat 2 <any>]
parse "bx" [some [not "b" | <any>]]
parse #{0A0A} [some [#"^L" | not #{0B}] repeat 2 <any>]
And they're trickier than usual, because if you just change that to some further
they won't work... because you're demanding rules like further ['c | not 'b]
make progress some number of times.
*It's a really convoluted way of thinking of what you're doing here as "some number of matches... including a non-match that doesn't advance of 'b counting as at least one match. So the SOME doesn't fail to match, -but- if it doesn't advance, even though the one time succeeded count it as a break of the iteration and yield success"*.
Who wants to think like that? It doesn't really make sense. If you insist on using SOME you have to use an OPT and FURTHER:
parse [a a] [opt some further ['c | not 'b], repeat 2 <any>]
Or write it more coherently by just testing for the NOT in sequence, vs inside the looped alternate:
parse [a a] [opt some 'c, not 'b, repeat 2 <any>]
I know it's just a test, but, I have a feeling that most of these "infinite rules need to break" rules have saner expressions...which makes for a more understandable SOME. Red embraces the advancement rule on purpose; these tests represent "fixes" to what they consider bugs, e.g. Red Issue #3927.
But I think requiring advancement fundamentally limits the perception of what PARSE can do. It can be the control structure of an application or state machine... SOME is a loop. You might want to re-run a rule so long as input is pending on a network port. Especially with GROUP! rule splicing that can inject "true" or "false", you can mix and match rules with imperative code... and having a successful rule decide it was "too successful" is just disruptive.
Having a good debugger someday (it's always "someday"...? :-/) will make it easy enough to find the infinite loops.
We Don't Do /INTO... nor COLLECT INTO
Red is concerned about low level series optimization, even though Gregg has indicated he agrees with me that this is a bad thing to focus on:
Not going to rewrite that post here. But I'll say the idea of being able to make series discontiguous in memory is an interesting one; kind of like how filesystems can split files into chunks. It seems to me that if series could be "chunked" in this way at a system level, we could worry less about these /INTO matters. I'd rather look into that kind of answer vs. burdening users with /INTO.
So there's no COLLECT INTO pattern in UPARSE. They also have a COLLECT AFTER which uses the position after the current series position instead of before the current series position to insert. :-/ The need for these permutations just seems to further demonstrate what an awkward thing this is to build in.
Someone can put this stuff in Redbol's PARSE built on UPARSE with different combinators--if they really want it. Don't think that will be me!
Using Things as Non-Rules Usually Requires GROUP!
In Red, this works:
red>> parse blk: [] [insert 1]
== true
red>> blk
== [1]
Originally I asked "Why treat that 1 as a number, instead of as a rule?" Because it was common to say that things like 3 rule were a repeat count. I pointed out the inconsistency.
My argument against inconsistency still holds. But in this case: I think the literal integer interpretation was the right bias to take. So UPARSE's INTEGER! combinator now evaluates to the integer literally with no implicit repeat.
Red's PARSE INSERT Is Either Arity-1 or Arity-2
As I show above, Red lets you do insert 1
... which makes it look like INSERT takes a single argument. That's the thing to insert. You don't need to tell it what series or position, because it presumes you mean the current series at the parse position.
But weirdly enough, you can provide a position as the first argument to insert as a word. In this case it takes two parameters.
>> parse (data: [a b c]) [
pos-head: ; capture head postion to variable
to end ; seek to end of series
pos-tail: ; capture tail position to variable
insert pos-head 1 ; use arity-2 form of insert, arg 1 is where
]
== true ; parse position was moved past insertion to end
>> data
== [1 a b c] ; data changed as expected
>> pos-head
== [1 a b c] ; pos-head did not move to consistently point at a
>> pos-tail ; pos-tail also did not move, no longer tail
== [c]
There's a little bit of this mechanism that would be hard to do yourself. If you tried to save the parse position, do a SEEK to where you want to insert, then insert, and jump back to the parse position you saved... your parse position wouldn't take into account the size of the insertion. So you'd have to do something more like:
- Save the parse position
- Seek to where you want to insert
- Save the insertion position
- Do the insert
- If parse position was before insertion position, jump back to it, else
- Save the after-insertion position
- Count distance between after-insertion position and before insertion position
- Seek to the saved parse position plus that distance
So this form of INSERT does that for you. But as shown above, the only position that is getting adjusted in this way is your parse position...all other saved positions in the parse will have the wrong index. This is just a general Rebol issue since it's nothing more than arrays and indices at heart.
Here's an example weird Red test of this:
red>> series: [a b c]
red>> letters: [x y z]
red>> parse series [
mark: 'a insert mark letters insert only mark letters 'b 'c
]
== true
red>> series
== [[x y z] x y z a b c]
Here's me rotely translating that parse rule according to my formula above:
[
mark: <here>
'a
; Try equivalent of Red's `insert mark letters`
pos: <here>
seek (mark)
insert spread (letters)
after: <here>
seek (skip pos (index of after) - (index of mark))
; Try equivalent of Red's `insert only mark letters`
pos: <here>
seek (mark)
insert (letters)
after: <here>
seek (skip pos (index of after) - (index of mark))
'b 'c
]
The mechanic isn't rocket science, but it's sure a pain to do by hand. Another approach would be PUSH-POSITION and POP-POSITION operators, where the positions on the stack get updated for insertions.
CHANGE is weirder still, because if you pass it a position in the series it changes between the current parse position and what you pass it.
red>> parse blk: [a b c 1 2 3] [
mark:
some word!
change mark "like so"
some integer!
]
== true
red>> blk
== ["like so" 1 2 3]
Gabriele dropped mutating operators like CHANGE/INSERT/REMOVE from Topaz PARSE entirely. They are troublemakers, for sure. I've kept an open mind but this is of course the kind of debacle you're going to have with them.
But no matter how this is dealt with, I don't think variable-arity INSERT is the right answer. This needs more careful thought if mutations are going to be allowed.
Weird Implicit KEEP on Nested COLLECT
This is the consequence of the fact that COLLECT doesn't really know where it's writing things when you don't use INTO. It bubbles its result out the top of the parse if you merely mention COLLECT. So if you have a nested collect and no INTO, where else would it go?
red>> parse [a a [1 1] b b] [
collect [some [
keep word!
| ahead block! into [collect [some keep integer!]]
]]
]
== [a a [1 1] b b]
It's nicer to have the option to do what you want with it. In Ren-C, such a COLLECT would just be thrown out...you'd have to KEEP it in the outer collect. At which point you could splice it, or not...
ren-c>> parse [a a [1 1] b b] [
collect [some [
keep word!
| subparse block! [keep spread collect [some keep integer!]]
]]
]
== [a a 1 1 b b]
ren-c>> parse [a a [1 1] b b] [
collect [some [
keep word!
| subparse block! [keep collect [some keep integer!]]
]]
]
== [a a [1 1] b b]
Weird TO END Behavior With Strings
In Red Issue #2561 it was observed that the null terminator of strings was getting captured by KEEP TO END in PARSE collect. Clearly a bug.
But the fix yields this weird behavior:
red>> parse "" [collect [keep to end]]|
== []
red>> parse "" [collect [keep pick to end]]|
== []
I ask the usual question of why KEEP should be different from anything else. Let's try:
red>> parse "" [test: to end]
== true
red>> test
== ""
So if that's what TO END synthesized, why isn't it what KEEP would keep?
Often it is said in the Red camp that I "make things too complicated" (or whatever they say, it's patently untrue)--it would be so much easier if they'd just follow the UPARSE schematic. Rules synthesize values, BLOCK! rules synthesize the value of their last synthesized alternate, GROUP!s consume no input and just synthesize a value. And that value is what gets up to KEEP or set in a variable or returned from the overall operation.