Should "Arity-2 INTO" in UPARSE actually just be... PARSE?

hostilefork · July 6, 2022, 3:53am

There's a cool feature that UPARSE's INTO is willing to take a rule that gives the data you want to parse into, even if it has to be synthesized:

>> uparse "(1 1 1)" [
    into [between "(" ")"] [data: <here> (print mold data) some integer!]
]

That will give you:

"1 1 1"  ; notice no ")" because the INTO copied the data spanning the rule
== 1

(I threw in the INTEGER! transcode for fun there. Note that Red allows the transcoding rules for datatypes as well, but only on BINARY! input series. It's because they don't have UTF-8 everywhere, they'd have to rewrite their scanner to process variable-width strings. One of the uncountable Ren-C design advantages...)

More generally, you can pass any variable you want to INTO.

>> uparse [1 2 3] [some integer! into ("aaa") some "a"]
== "a"

But... Couldn't We Just Call That PARSE?

This arity-2 INTO takes an input, and rules. Why isn't that just PARSE?

Difference is that since it's inside a parse already, its first parameter will be treated as a rule and use the synthesized result...unless you put it in a GROUP!. But that's implicit. Maybe call it SUBPARSE to be clear?

It would free up the keyword INTO, maybe to be compatible with historical single-arity version, for cases that have already figured out they're at a series value and don't want to repeat themselves by giving a rule that matches where they know they are.

IngoHohmann · July 6, 2022, 8:28am

I would just call it parse. We're in a different dialect there, so one has to be aware of that anyway.

Dialects all the way down.

BlackATTR · July 6, 2022, 12:33pm

I think SUBPARSE makes it more clear what's happening here. I think there's potential for confusion if it's just called PARSE.

hostilefork · July 6, 2022, 12:48pm

Another argument for why to call it SUBPARSE is my proposal for functions automatically becoming combinators if they fit a certain pattern.

That pattern is to take an INPUT and then have on their return results something indicating progress.

What would happen then would be the function would act like a combinator...having one less parameter than usual (e.g. the INPUT is now implicit, from the data being parsed).

This would make PARSE an arity-1 synonym for SUBPARSE <HERE>, not plain SUBPARSE.

Whether UPARSE actually chooses to bend its interface to be willing to act as that combinator or not, it's still a point of coherence. I'd say that pushes it over the edge to say it shouldn't be called just PARSE. SUBPARSE seems good.

hostilefork · February 22, 2024, 2:28am

There's another nuance here...

PARSE has a variant (currently called MATCH-PARSE) that returns the parse input when the rules match, and null if it does not:

>> parse "aaa" [some #a]
== #a

>> parse "aaa" [some #b]
** Error: PARSE BLOCK! combinator did not match input

>> match-parse "aaa" [some #a]
== "aaa"

>> match-parse "aaa" [some #b]
== ~null~  ; anti

One concern here is that MATCH-PARSE has its parameters backwards from MATCH

MATCH purposefully takes the "rules" of what to match first, because it expects the expression producing the data to be longer than the match rule in the general case.
PARSE purposefully takes the "rules" of what to match second, because it expects the rules to be an entire mini-program

So I'm not sure how good a name it is in the first place. Then, we'd need MATCH-SUBPARSE if we're going to use it as a combinator to operate on series encountered during the parse.

I was thinking of changing the name to just VALIDATE (it's implicit in other dialects like DESTRUCTURE that they are built on top of parse, so why not make the default meaning of LIB.VALIDATE be parse-oriented as well?)

And it's appealing to be able to just say VALIDATE in the PARSE rules.

parse [x: [$y z] "a" <b> %c] [
    words: accumulate [
        &any-word?
        | spread validate block! [some &any-word?]
    ]
    numbers: accumulate &any-string?
]

>> words
== [x &y z]

>> strings
== ["a" <b> %c]

Saying SUBVALIDATE doesn't make sense, because we're not in an outer validation. We're in an outer parse.

The reverse applies if you're in a VALIDATE (or a DESTRUCTURE, etc.)... or anything that doesn't make it completely obvious that it's being driven via PARSE.

SUBVALIDATE irks me more than SUBPARSE did, and it moves the needle to where I think this is probably the right way to look at it.

But I'll hold off on renaming SUBPARSE just yet, and see what the consequences are with using the name VALIDATE as both combinator-keyword and top-level function.

hostilefork · February 22, 2024, 3:54am

I'll add that it might seem you don't need VALIDATE because you could just write instead:

spread subparse block! [some &any-word? <subinput>]

But it's worth mentioning a few reasons why VALIDATE as its own primitive is good...

It's not always as simple as adding to the end:

subparse block! ["foo" "bar" | some &any-word? <subinput>]  ; oops

You need to make sure all the alternatives have finished:

subparse block! [["foo" "bar" | some &any-word?] <subinput>]
subparse block! ["foo" "bar" | some &any-word? || <subinput>]

The rule may not be inline, so you could go from not needing a block at all to needing one, so you're adding a tag AND a block:

subparse block! element-rule
subparse block! [element-rule <subinput>]

But overall, it's generally just very helpful to know ahead of time that you're expecting the input you're giving back out if it matches. That aids comprehensibility, in particular if your rules are long:

x: subparse block! [[
      ...
      pages of stuff that scroll off the screen
      ...
 ] <subinput>]

A reader has much more of a clue what's going on with:

x: validate block! [
      ...
      pages of stuff that scroll off the screen
      ...
 ]

They know X can only be the block you just matched and passed to validate, or the rule fails and it the parse moves on.