How To Resolve the SET Inconsistencies in PARSE?

hostilefork · March 3, 2021, 7:30am

Background

The general concept of a parser combinator is that it processes some amount of input and returns the remainder...but also typically returns a value of arbitrary type.

So if you want to get an IP address structure from a string, you would build a combinator for that out of a combinator that gave an integer from a string...and a combinator that could detect a dot and skip over it (if it gave a result you'd throw it out).

In Haskell this is rather straightforward. e.g. with Attoparsec:

-- define the structure of an IP address
data IP = IP Word8 Word8 Word8 Word8

parseIP :: Parser IP
parseIP = do  -- Haskell "do" is syntax sugar for "call these steps in order"
  d1 <- decimal
  char '.'
  d2 <- decimal
  char '.'
  d3 <- decimal
  char '.'
  d4 <- decimal
  char '\n'
  return $ IP d1 d2 d3 d4

As in UPARSE, you see that there's a certain amount of "magic parameterization" going on behind the scenes with these functions. You don't have to explicitly pass the input data to the decimal or char combinators, they effectively "inherit" that from being "inside the Parser monad".

Point being: how much input a combinator consumes and the evaluative product of the combinator are two completely different things. You tend to bubble up to get more and more complex types from simpler ones. And the parse pattern can even proceed to higher levels; you can convert a stream of text into a stream of IP addresses and then parse that stream at a higher level.

But Historical Rebol/Red Has No "Combinator Product"

The only known result from a parse rule is "advanced input". Hence the product has to be determined from that.

If you use SET, you get just the first element of however far the rule got:

 rebol2>> parse "abcde" [set x some skip]
 == true

 rebol2>> x
 == #"a"

Rebol2 errors if you try to use SET with a literal series match, but R3-Alpha/Red chose to allow it...again only single:

r3-alpha>> parse "hello" [set x "hello"]
== true

r3-alpha>> x
== #"h"

If you use COPY, you get all of the elements for how far the rule got:

 rebol2>> parse "abcde" [copy x some skip]
 == true

 rebol2>> x
 == "abcde"

When adding COLLECT, Red seemed to go with the COPY semantics for KEEP:

>> parse "aaaa" [collect [keep some "a"]]
== ["aaaa"]

But notice that the exact phrasing of the rule doesn't matter...the only thing that's paid attention to is the absolute amount of progress made on the input:

>> parse "aaaa" [collect [keep some 2 "a"]]
== ["aaaa"]  ; e.g. not ["aa" "aa"]

UPARSE Adds The Combinator Product Twist

What I wanted with UPARSE was to split off this idea of what was produced by a "rule" from having to be the same type as the series elements. If you don't allow that, you wind up having to write labor-intensive code just to extract integers out of a string.

It seems reasonable that when you use a SET-WORD! in UPARSE, that means that you're asking for a result from whatever follows. You don't have to ask for it...if you use a rule without it then it will consume the input.

I'd imagine that if you say x: a-rule that whatever gets put in X would be the same thing that is kept when you say keep a-rule...which is to say, the combinator product.

But should all rules have combinator products, or should some refuse? Rebol2 wouldn't allow you to do set x thru ... while R3-Alpha and Red do. It doesn't look good to me:

 red>> parse "abcde" [set x thru "e"]
 == true

 red>> x
 == #"a"

But before we go banning "weird sets" we should look at a genuinely useful pattern of being able to set single elements out of alternates:

 rebol2>> parse [1] [set x [tag! | integer!]]
 == true

 rebol2>> x
 == 1

That looks coherent, but it quickly gets non-coherent:

 rebol2>> parse [1 2] [set x [tag! tag! | integer! integer!]]
 == true

 rebol2>> x
 == 1  ; d'oh, what about 2?

This might get you to think that the right thing to do here would be to collect those up as a BLOCK!, e.g. [1 2]. This brings us back to the question of whether every combinator has a result or not. Does TO? THRU? Literals? GROUP!s?

rebol2>> parse [<test>] [set x [(10 + 20) tag!]]
== true

rebol2>> x
== <test>

If the goal was to collect up the combinator products, and GROUP!s had a product, then that would be [30 <test>] instead. But this seems to be getting away from what's useful.

This Is The Current Biggest Issue To Sort Out in UPARSE

I hope I've motivated it to where it's clear why we don't want all the parse process to be trapped speaking only in terms of the elements of the series it's processing. The goal is extraction. You need to be able to abstract that extraction...or this simply is not competitive with the other offerings.

Right now, I've made it so GROUP!, TO, THRU, and literals are combinators with no product. This means they cannot be used with SET, and would not show up in results collected during a BLOCK!.

OPT will mirror whether the thing it is used with has a product. So x: opt integer! is legal but x: "abc" is not, since x: "abc" is not.

A similar premise could be applied to SOME and ANY. They could default to gathering their products in a block. Here would be an example with a string as input, but that gathered a block out:

>> uparse "(((1 2 3)))" [some "(", x: some integer!, some ")"]
>> x
== [1 2 3]

If you didn't want products gathered, but actually wanted to just get the input across the span, you would use COPY:

>> uparse "(((1 2 3)))" [some "(", x: copy some integer!, some ")"]
>> x
== "1 2 3"

This makes me question if COPY is the right name for the combinator that is "only interested in how much input was consumed". Perhaps that should be CONSUMED? EATEN? SPAN? ACROSS?

You could get both:

>> uparse "(((1 2 3)))" [some "(", a: across x: some integer!, some ")"]
>> x
== [1 2 3]
>> a
== "1 2 3"

Point being that if you have [1 2 3] and "COPY" that you would usually expect another BLOCK!, not a textual span "1 2 3" from the input.

Something that would be tough would be how to merge single results with multiple results in a block rule.

 >> uparse [<a> 1 2 3] [x: [tag! some integer!]]

Would that be [<a> [1 2 3]] or [<a> 1 2 3]? Mechanically it would pretty much have to be the former, because there's not really any information to tell you whether to do a merge or not. Perhaps there could be a merging operator in the BLOCK! combinator itself...something like ++ to indicate it should stick things together? Going with my proposal that the N rule always makes a BLOCK!...you might ask those blocks to be stuck together somehow:

 >> uparse [<a> 1 2 3] [x: [1 tag! ++ some integer!]]
 >> x
 == [<a> 1 2 3]

Also...this may speak to the need for an ELIDE operation in UPARSE, which stops the BLOCK! combinator from accruing something when you only wanted to match it. :-/

 >> uparse [<a> 1 2 3] [x: [elide tag!, some integer!]]
 >> x
 == [1 2 3]

But I think that implicitly eliding GROUP!s seems to make sense.

Remember the reason this is being reasoned through despite the existence of COLLECT and KEEP... there's a situation with x: [integer! | text! | ...] which has worked historically and needs some kind of reasoning. Maybe this is the wrong line of thinking, and that should be done with a specialized rule for that kind of situation...like MATCH:
 >> uparse [<tag>] [x: match [integer! tag!]] 
I don't know yet, but at least we can try some experiments. I just did:
>> uparse [1 2 3] [x: [integer!, elide integer!, integer!]]
== [1 2 3]

>> x
== [1 3]
It's at least worth giving this kind of idea a shot to see if it's worth considering.

Well, There's At Least A Few Thoughts In All Of That

I'll call this writeup progress. Figuring out why you might want ELIDE in PARSE after all is a bit of a revelation.

This also bolsters my opinion that plain GROUP!s should not be value-bearing...which is a longstanding point that I've tried to make. I think that keep do [...] solves that pretty well, by letting DO be the value-bearing executable form.

If people have samples of code they believe must work a certain way, or any killer new feature ideas, that's useful data.

hostilefork · March 14, 2021, 3:57am

8 posts were merged into an existing topic: What to Call Historical "SKIP" in PARSE?

hostilefork · March 14, 2021, 4:28am

hostilefork:

But before we go banning "weird sets" we should look at a genuinely useful pattern of being able to set single elements out of alternates:
rebol2>> parse [1] [set x [tag! | integer!]]
== true

rebol2>> x
== 1
That looks coherent, but it quickly gets non-coherent:
rebol2>> parse [1 2] [set x [tag! tag! | integer! integer!]]
== true

rebol2>> x
== 1  ; d'oh, what about 2?

The ultimate answer for this was that a BLOCK! rule will evaluate to the result of the last match.

>> uparse? [1 2] [x: [tag! tag! | integer! integer!]]
== #[true]

>> x
== 2

Ultimately this is the return value that bubbles out of UPARSE, you don't even need an assignment:

>> uparse [<a> <b>] [[tag! tag! | integer! integer!]]
== <b>

While it wasn't immediately obvious this should be the answer (for some reason), now that it has been seen it cannot be unseen...and is serving very well!