Cheap UPARSE Returns: Rule or Item?

hostilefork · April 17, 2021, 11:20am

The new concept behind UPARSE rules is that all rules (besides a few invisibles like ELIDE) give back values. Then BLOCK! rules that match evaluate to the last rule result.

The catch is that we don't want rules commonly used for matching and seeking to synthesize complicated results. What they give back should be cheap...either something they already have on hand, or an isotope if they can't think of anything useful.

Imagine a BINARY! that has a lot of leading zeros, like maybe a megabyte's worth, and you want to skip that:

parse giant-binary-leading-zeroes [some #{00}, data: copy to end]

It would be annoying if SOME #{00} returned a megabyte-long binary of zeros, that just got discarded. So it has to return something else.

I mentioned that one useful thing that can be returned by a rule like #{00} is the value itself:

 >> uparse #{00} [x: #{00}]
 >> x
 == #{00}

The reason this isn't useless is that it gives you at least information about what match you might have in cases like this:

 >> uparse #{00} [x: [#{00} | #{FF}]]
 >> x
 == #{00}

That could fit into the puzzle of being something helpful to know. What's neat though is that you can override this if you like, just as you would in any other evaluation:

 >> uparse #{00} [x: [#{00} (<zero>) | #{FF} (<max>)]]
 >> x
 == <zero>

So now... what should SOME return? Right now I'm thinking the last match:

 >> uparse #{00FF} [x: some [#{00} | #{FF}]]
 >> x
 == #{FF}

Notice that it is returning the #{FF} from the rule... because it can't return the #{FF} in #{00FF} since generically that series could stretch arbitrarily on past the rule.

However...it would be possible to return information from the input data if it were matching a block input:

 >> uparse [#{00} #{FF}] [x: some [#{00} | #{FF}]]
 >> x
 == #{FF}  ; There's a choice of which #{FF} to return

This is a bit of a conundrum. It seems that the more useful thing to capture would be the data from the input, though that would be inconsistent with what we are forced to do when matching string and binary series.

If we went the consistent route, then getting the input value becomes more complex:

 >> uparse [#{00} #{FF}] [x: some [ahead #{00} skip | ahead #{FF} skip]]
 >> x
 == #{FF}  ; This would be the #{FF} from the input, not the rule

(There could be an operator that encapsulates the AHEAD RULE SKIP pattern more succinctly.)

Alternatively, there could be an acceptance of the inconsistency: block rules captures an item from input, string and binary rules get the series out of the rule itself. I'm leaning against this, however...I prefer consistent.

Hopefully getting more experience will inform this...I just wanted to mention the issue.

hostilefork · April 18, 2021, 6:23am

I should point out that this leads to a potential confusion...that KEEP SOME and SOME KEEP are different:

 >> uparse "aaa" [return collect [some keep "a"]]
 == ["a" "a" "a"]  ; one keep each time through the SOME

 >> uparse "aaa" [return collect [keep some "a"]]
 == ["a"]  ; one keep of the overall SOME result

They're also different in Red, though the KEEP SOME case appears more useful:

red>> parse "aaa" [collect [some keep "a"]]  ; Red COLLECT is the implicit RETURN
== [#"a" #"a" #"a"]

red>> parse "aaa" [collect [keep some "a"]]
== ["aaa"]

But they clearly don't implement that every SOME rule is generating a series. Instead, KEEP is implicitly doing a COPY across the rule it is given.

It's ad-hoc, though...it depends on how many characters you match. If you match one unit in a string you get a CHAR!, if you match more than one you get a STRING!:

red>> data: parse "ababab" [collect [some keep "ab"]]
== ["ab" "ab" "ab"]

red>> append data/1 "c"
== "abc"

red>> data
== ["abc" "ab" "ab"]

Note you get copies without having asked for them. This is not how plain COLLECT and KEEP work in Red or other Redbols:

red>> data: collect [loop 3 [keep "ab"]]
== ["ab" "ab" "ab"]

red>> append data/1 "c"
== "abc"

red>> data
== ["abc" "abc" "abc"]

Ren-C wouldn't allow mutation in that case, because the LOOP would be an iterative context with an implicit CONST. PARSE should be too, so if you're KEEP-ing the rule directly, you wouldn't be able to modify it. If you actually wanted copies you'd ask for them.

But also, just compare the behavior of SET of a rule:

 red> parse "aaa" [set x some "a"]
 == true

 red> x
 == #"a"

To how KEEP of that rule acts:

 red> parse "aaa" [collect [keep some "a"]]
 == ["aaa"]

The Uniform Approach Of UPARSE Seems Saner

It certainly seems to clear to me that it does not make sense for keep "a" and keep "ab" to keep distinct datatypes. It makes even less sense to have one generate a new copy of data while the other does not. And having what SET gleans from a rule be different than what you would KEEP seems wrong as well.

I think it's alright to ask for a copy if you want one:

>> uparse "aaa" [return collect [keep across some "a"]]
== ["aaa"]

Note the reason I want to call this ACROSS vs. COPY has to do with being less ambiguous about whether you're operating on the rule for the value result of the rule. If the result of SOME "A" that succeeds is "A", you might think COPY SOME "A" would just give you a copy of "A"...that could be useful in and of itself. ACROSS helps you know that you're really talking about the span.

Also trying to give a shorthand or that with @[...] to get ACROSS behavior on a block rule:
>> uparse "aaa" [return collect [keep @[some "a"]]]
 == ["aaa"]
If the rule was already in a block to begin with, it's just one character extra.

It's a tough duality to play with the notion of "rule span" vs. "synthesized rule product", but I think KEEP pretty clearly wants to work with synthesized product. The UPARSE strategy gives a nice uniformity which can tackle complex composition.

>> uparse "abbbbabbab" [return collect [
       some [keep "a", keep/only collect [some keep "b" keep (<hi>)]]
   ]]
== ["a" ["b" "b" "b" "b" <hi>] "a" ["b" "b" <hi>] "a" ["b" <hi>]]

...and yes, there is now rudimentary support for KEEP/ONLY in UPARSE. I'm trying to figure out generic support for refinements in combinators. Each of these things is a giant problem space unto itself.

Long Story Short: KEEP SOME "A" Just Keeps "A"

I think KEEP [X] as a pattern is best seen as keeping the synthesized rule product, without any weird edge cases or special exceptions that do something else.

And I've explained why SOME "A" should not synthesize a new series; it would be wasteful in the general case.

Hence KEEP SOME "A" should not synthesize a new series.

And it's important to remember that the reason all this is coming together this way is about having good generic answers to things like what happens if you keep a BLOCK! rule:

Compare Red's answer for the following:

red> parse "aabab" [collect [some keep ["ab" (<AB>) | "a" (<A>)]]]
== [#"a" "ab" "ab"]

to UPARSE plan...which shows the flexibility:

>> uparse "aabab" [return collect [some keep ["ab" (<AB>) | "a" (<A>)]]]
== [<A> <AB> <AB>]

And avoiding that implicit weird "copy if more than one unit of series match, otherwise char!" behavior feels like it's for the best. If you want to KEEP ACROSS, then KEEP ACROSS.