New, More Powerful Arity-2 INTO in UPARSE

hostilefork · March 13, 2021, 3:27pm

So INTO had a proposal on the table for R3-Alpha to become arity-2, and take a datatype. This simplifies a common pattern of wanting to say what you're parsing into:

ahead text! into [some "a"]  ; arity-1 form
=>
into text! [some "a"]  ; arity-2 form

Neither R3-Alpha nor Red went with this...but Topaz did. (Perhaps it was Gabriele who made the proposal in the first place?)

On the surface it seems like it has some pros and some cons, but mostly equivalent. But since UPARSE has value-bearing rules, we can take this one step further... instead of limiting the first argument to a datatype!, we can make it any rule that bears a series as its result!

Datatype counts for that, but there's more...

Parsing INTO a Generated Series

This means you're not just restricted to going INTO a series that existed at the start of the parse. You can parse into products.

For example:

uparse "((aaaa)))" [into [between some "(" some ")"] [some "a"]]

The first rule is a BETWEEN which is somewhat like traditional COPY in that it generates a new series....in this case a series that doesn't have the leading or trailing parentheses.

If your first rule is complex, this can introduce a somewhat long separation between the parameters. But a general tool exists now for addressing that, with the SYM-XXX! substitutions:

Here's why that's important:

uparse [| | any any any | | |] [
    content: between some '| some '|
    into @content [some 'any]
]

The BETWEEN captured a block as [any any any]. You don't want the INTO to be using that as a rule, you're trying to use it as-is.

The Current State of SYM-XXX! In Parse

When I first framed the idea of SYM-WORD! I was thinking of it for matching against a literal item:

 data: [not a rule]
 parse [[not a rule]] [@data]  ; the previous idea: match input against literal

But what it's acting like now is not for matching input. Instead, it is a value-bearing rule that consumes no input.

So it's like this:

 >> uparse "aaa" [x: @("not for match"), some "a"]
 == "aaa"

 >> x
 == "not for match"

I threw in a twist, which is that the @ rule will fail if it gets NULL

>> x: <before>

>> uparse "aaa" [x: @(if false ["not for match"]), some "a"]
; null

>> x
== <before>

This still gives you the power to take the null and keep going. Just combine it with the OPT rule!

>> x: <before>

>> uparse "aaa" [x: opt @(if false ["not for match"]), some "a"]
== "aaa"

>> x
; null

That makes for good cooperation with KEEP, when you find that you want to keep some calculated material, and you can frame your rule to opt out. So check this out, @rgchris:

>> uparse "aaa" [x: collect [some [
        keep opt @(if false [<not kept>])
        keep skip  ; or whatever we wind up calling "consume next series item"
        keep @(if true [<kept>])
    ]]]

>> x
== [#a <kept> #a <kept> #a <kept>]

Contrast of `@(...)` and `(...)`

The (...) form and @(...) form are similar in the sense that they do not advance the input...or look at the input at all. But (...) is not "value-bearing"

>> uparse "" [x: (1 + 2)]
** Error: UPARSE can't use SET-WORD! with non-value bearing rule (1 + 2)

We don't technically need to distinguish the @(...) and (...) forms, but I think there are good reasons to do so. I think the "fail the rule if null" is a rather neat twist--and you wouldn't want that for a rule that wasn't intended to have its result used (it might incidentally return null, e.g. be an IF whose branch didn't run). But the most obvious justification for differentiation is helping guide the user reading the rules to know what they are looking at.

A value-bearing group risks contaminating aggregate captures:

  x: [integer! (...) | text! (...)]

It would be awkward if you had to explicitly disavow those groups with ELIDE (which may take over the name SKIP):

  x: [integer! elide (...) | text! elide (...)]

It's nice to get a heads up when you are getting things wrong, when you think you're producing a used value but it's getting discarded:

>> uparse "aaa" [some "a", @(if true [<what?>])]
** Error: Result of @(...) rule not consumed, use (...) for non-value-bearing

Plus, there's a natural correlation between the @word and @pa/th forms. Seeing all these as a family instead of having (...) be the odd duck is helpful.

I'm on the fence as to whether it's worth "wasting" @[bl oc k] as a synonym for @([bl oc k]). It has the nice property of not looking at the input like other things in the family, and might help KEEP in particular be lighter in adding material to its collection.

Contrast with `:(...)`

The :(...) form means "use the product of this group as a match rule". That goes along with :word which I am suggesting means "fetch word and use as a rule, just like with ordinary words, but with the exception that it means override any keyword with the same name."

Because this is something like COMPOSE-on-the-fly, the case of returning NULL doesn't fail here. It just splices no rule. You have the option of making a failing rule by evaluating to #[false]...so there's still an option, and that's a popular one, e.g. :(mode = 'some-state)

This overrides the need for IF, or similar constructs. It leaves GET-BLOCK! up for grabs as what it would mean, since a BLOCK! is already interpreted as a rule...so we can keep thinking about that.

How To Achieve Meaning of "Match Literally"?

We have what appears to be a bit of a hole here, on how we are supposed to match an item of input literally. Let's go back to the example:

 data: [not a rule]
 parse [[not a rule]] [??? data]  ; how to do this match?

Right now we have at least one option: generate a rule that adds a quote level.

 data: [not a rule]
 parse [[not a rule]] [:(quote data)]

That acts as if we said:

parse [[not a rule]] ['[not a rule]]

Using :(quote ...) is not super ideal, so the pattern might need a keyword. But the keyword would need to cooperate with the @ form, because literally data would have data turned into a block combinator by the parse engine before LITERALLY saw it. You'd have to say literally @data to get the block itself passed.

BlackATTR · March 13, 2021, 4:59pm

8k1C

IngoHohmann · March 15, 2021, 2:22pm

Is this the correct behaviour?

>> uparse "baaabccc" [into [between "b" "b"] [some "a" end] to end]
== "baaabccc"

>> uparse "baaabccc" [into [between "b" "b"] ["a" end] to end]
; null

>> uparse "baaabccc" [into [between "b" "b"] ["a"] to end]
; null

>> uparse "baaabccc" [into [between "b" "b"] ["a" to end] "c" to end]
== "baaabccc"

I think it is, and it is how it works, I just want us to be all on the same page here.

IngoHohmann · March 15, 2021, 2:24pm

I think, this should work, too.

>> uparse "aaabccc" [into [to "b"] [some "a"] to end]
** Error: BLOCK! combinator did not yield any results

hostilefork · March 15, 2021, 2:38pm

Looks right to me, can be added to the tests.

Initially I was trying to avoid some of what I thought to be bad behaviors of considering TO X to be a "value-bearing combinator" because I didn't like the value it gave:

red>> parse "aaab" [set x to "b", to end]
== true

red>> x
== #"a"

So I made that an error, instead:

 >> uparse "aaab" [x: to "b", to end]
 ** Error: SET-WORD! in UPARSE needs result-bearing combinator

e.g. I wanted to force you to use something like COPY (now ACROSS)

 >> uparse "aaab" [x: across to "b", to end]
 == "aaab"

 >> x
 == "aaa"

However, we could consider the ACROSS to be implicit with TO and THRU. I don't know quite what all the ramifications of that would be offhand.

hostilefork · March 15, 2021, 3:21pm

I just thought of another weird possibility... parsing into the same series with into here

>> uparse "aaabbb" [
     some "a",
     into here ["bbb" (print "yep, Bs")]
     "bbb" (print "Bs again")
 ]
yep, Bs
Bs again
== "aaabbb"

It's like AHEAD in that it doesn't move the parse position, but the subrule has to match to the end. :-/

IngoHohmann · March 15, 2021, 4:07pm

And I thought:

>> uparse "aaabbbccc" [
     some "a",
     into here ["bbb" to end (print "yep, Bs")]
     "bbb" (print "Bs again")
     "ccc" (print "Here be Cs")
 ]
yep, Bs
Bs again
Here be Cs
== "aaabbbccc"

or better yet

>> uparse "aaabbbccc" [
     some "a",
     into here ["bbb" done (print "yep, Bs")]
     "bbb" (print "Bs again")
     "ccc" (print "Here be Cs")
 ]
yep, Bs
Bs again
Here be Cs
== "aaabbbccc"

Where done means: If I got this far, I don't care about the rest of the input and call it a match.

hostilefork · April 11, 2021, 10:19pm

I gave my argument for why I didn't like to "b" being the kind of rule that automatically gave a value back...because if it generated a product, then that product would frequently need to be discarded.

But with the way the rules are currently working, they can know whether they are supposed to be producing a value or not. This avoids having to make separate "cheap" rules with skip in their name, distinct from "expensive" ones that don't have skip in the name. Some parser combinator libraries do this.

Hence I figured there was no harm in having TO become able to bear a value if it wanted to. Since INTO was the kind of rule that was asking for a place to look, it would qualify for switching TO into that mode. Easier to type than into [across to "b"]. So I made TO bear a value if there was potential for its usage.

But given what I'm looking at with yielding the last result of a PARSE rule by default for var: [your rule here], I think this is the wrong tradeoff.

That takes away the certainty of "rules knowing when they need to generate a product or not" . We can tell with a rule with no set-word that it won't assign a value, but once you combine a block with a SET-WORD!, no rule in the block a priori knows if it's the last. If we tried to enforce that it could know when it was the last--by means of lookahead and analysis to check if everything afterward was invisible--would mean we'd have to prohibit rules with side effects during argument gathering (as well as prohibit the phenomenon I call "opportunistic invisibility", which is less critical). I think that for the kind of freeform experience this is shooting for, we want to err on the side of imperative freedom...even if it costs us optimality.

But even so: if casual usages of seeking/matching rules are synthesizing new values when they're thrown away 90% of the time, that's worse than calling out the 10% where you do want it.

Instead, what I might propose is that @[...] become the syntax for ACROSS. This means instead of writing:

uparse "aaabccc" [into [across to "b"] [some "a"] to end]

You could say:

uparse "aaabccc" [into @[to "b"] [some "a"] to end]

This would become a means of converting any rule into a value-bearing rule, even if all of its component rules were "invisibles".

So I'm going to undo my addition that allowed the into [to "b"], and substitute the ability to say into @[to "b"]. Then see how well things can work with parse rules evaluating to their last value...

IngoHohmann · April 13, 2021, 4:58pm

As always gradual improvements move in the right direction in the end.