Parsing Alternates: Should "Must Match To End" Be Considered?

I didn't quite absorb that the following was the case in all PARSEs we know of:

 >> parse [a b] [word! word!]
 == truthy

 >> parse [a b] [word!]
 == falsey

 >> parse [a b] [word! | word! word!]
 == falsey

Distinctly Self-Aware "Terminal Blocks", Could They Be Good?

>> uparse [a b] [word! | word! word!]
== b

I'd argue that there already are "two types of blocks":

  1. There are blocks that give a truthy result when they don't reach the end, but don't have any match failures

  2. There are blocks that can succeed on every match but not reach the end, but be an overall failure

Right now we know these blocks by context. The main rule block you give to PARSE is of type 2, and so is the block given to an INTO.

It's a very small semantic difference to say that "they're the same kind of block, with the decisions about them being made by their caller". Why not allow them to be different kinds of blocks?

It Would Be Weird If It Propagated

Here's an example of the kind of weirdness you'd get into if we said it wasn't a property strictly of the root blocks, but rather "any block that found itself at the end of a chain":

>> parse [] [[(print "A") | (print "B")] [(print "C") | (print "D")]]
A
C
D

Being a "category 2 block", we see how the outer rule block would be creating some irritating asymmetry by saying that any block that knew it wasn't going to reach the end got the privilege.

So I definitely don't like that.

But I'm suggesting a sticky property of blocks, that they effectively already had, being allowed to influence one thing besides whether they are forced to reach the end to succeed... that those blocks also get to try all their alternates before saying they failed.

It's like because they're under more pressure they are given a resource to fall back on to succeed :slight_smile:

Looking for Reasons Why It Would Break

It definitely benefits cases like how I was writing CIRCLED, and now that I think about it I've certainly encountered others.

It does break a kind of universality of understanding, like:

rule: [word! | word! word! (print "breaks faith this can never print")]

Yet the understandings in the PARSE world are a little fuzzy. You might say the existing paradigms break the understanding that if that rule came across [a b] as input that it would succeed. It won't if it's an outermost block...

The new understanding would be "Common subsequences in your rule may wind up being matched alternately in top-level parse contexts." If you don't want it, you have an out...double up your block!

rule: [[word! | word! word! (print "breaks faith this can never print")]]

Now your rule won't be subject to the toplevel alternates exception...if you can think of a really good reason why you wouldn't want it.

But, UPARSE is Configurable, So Why Worry Too Much?

I might like it, and other people might not. So we'll see.

I think I'm at least going to try it out, because it looks like it serves common tasks. I'll get some more data and report back.

A more general idea might be a shorthand for [<end> |], a new kind of alternate.

Right now || is taken for another purpose which is useful...to effectively shift all rules to the left into their own block. I've called it the Inline Sequencing Operator

["a" | "b" || "c"] <=> [[ "a" | "b" ][ "c" ]]

We could have -| for saying match only applies if it's at the end of data, and |- for saying a match only applies if it's at the beginning of data. :-/ Or =| and |=. Or the infamous "flags".

circled: lambda [block [block!]] [
    parse block [return [
         thru into group! [<any> <| (fail "Circle One")]
         maybe [thru group! (fail "Circle One")]
    ]]
]

One of the risks of doing this with a new kind of alternate is that it would wind up used at the end, stuck to the last thing:

 [word! <| word! word! <|]

Something could be picked as a symbolic shorthand for end, and used more places:

 [word! # | word! word! #]

We could reverse the doubled-blocks semantics and say a doubled block explicitly means keep running alternates if the end is not matched:

 [[word! | word! word!]]

@rgchris has historically asked that PARSE not enforce reaching the end as a default, so perhaps:

>> parse [a b] [word!]
== a

>> parse [a b] [[word!]]
; null

It's a never-been-suggested use of the @block type, that doesn't really go with anything else:

>> parse [a b] @[word!]
null

I think I still would favor the idea of having it be implicit from the kind of parser situation you're in, and using [[...]] to break out of the rule. But I will have to try it. Just wanted to throw a few more options out.

1 Like

You asked for my input, maybe you'll regret that. :laughing:

Perhaps I misunderstand but I've always believed that unless the historic parse successfully reached the end of the rule block, and then recursively back up to the top of the rule stack with success (or with return), there's no guarantee that the input conforms to the grammar. So that's one type of information - is the input valid.

The second type of parsing is about matching and extracting parts of the input for information, that's where it it's probably fine to not require parse to reach the end.

I don't really understand. I feel like you're talking about parse looking at the rules holistically (true of uparse??). The sequence of matches in every block in the historic parse will succeed or fail regardless of context. Historic parse doesn't see rule blocks holistically, it's a dumb greedy one-at-a-time instruction machine, which has the benefits of being predictable and unambiguous such that this is an obvious miscoding:

parse [a b] [word! | word! word!]

I wonder if a world in which the rule above is not miscoded, is one with ambiguous grammars, unless maybe there's some longest match operator in use, but I can't claim to be a computer scientist.

Maybe what I'm looking for is a different operator, which might try to integrate into parse.

Typically what this is called is "destructuring", where there are patterns you match and when they match you grab the information out of it.

I had a proposal in progress for something like that, in the vein of making SWITCH more like Rust's MATCH

>> destructure [a b] [
    [word!] => [print "Case one"]
    [word! word!] => [print "Case two"]
]
Case two

There were some other concepts, like using ellipsis to mean "anything after here", and also labeling variables:

>> block: [1020 a 304 <any> #other "items"]

>> destructure block [
    x: [integer!] y: [integer!]

    [x 'a y ...] => [print ["Both ints" x y]]
]
Both ints 1020 304

If DESTRUCTURE (or something like it) were also a UPARSE keyword then maybe that would be the place to get the "try all alternates" from, vs. trying to shoehorn it into some variant behavior of |

1 Like