TAG!s as PARSE keywords vs. Literal Matches

Previously I brought up the idea of "TAG! combinators". Here's what I said, moved from another thread where it was kind of a tangent:

[I saw Haskell...] uses EOF instead of END. END is literate, but one often wants to call variables things like "begin" or "end", or "start" and "end".

This makes me wonder if perhaps we should be a bit more creative in the use of datatypes. If you want to match a WORD! in a dialect, you have to use a tick mark. What if you had to use a tick mark to match TAG!s, and then an ordinary TAG! could have meaning as a rule...such as <end>?

parse "aaa" [data: copy to <end>]

parse "<div>stuff</div>" [x: between '<div> '</div>]

Anyway, that could open up a whole new category of combinators... tag combinators. Maybe <here> is another example, or perhaps <input> if you want to pass the original input position through to a function.

A unifying concept here could be that you'd use it for properties that you don't want to have collide with the names of variables. Consider for example if PARSE tracks the line number, you might want to say something like line: <line> in the middle of a rule.

If you want to match tags by their stringness, it's not like it's all that hard to just say "<div>" in the first place. But quoting is even briefer. Remember that being inert in typical evaluation is not enough in PARSE to mean it's not a rule... INTEGER!, BLOCK!, BLANK! (previously NONE!) and now LOGIC! all have to be quoted to mean their actual literal thing. And quotes are needed on things like WORD!, GROUP!, GET-WORD!, SET-WORD!...and much more.

So is it worth it to get another dialect part, by making you have to quote your tags if you want them to match literally? I kind of feel like it would be. Of course, the concept with UPARSE is that people could disagree and make entirely different answers...

(Note: a downside here is that since TAG!s are strings and not symbols, the comparison costs could be (slightly) higher. However, I've been thinking that to speed up string comparisons they might cache a symbol as part of the comparison process...and clear the symbol cache on each mutation. Then comparisons of strings to symbols could become very fast...so long as the string isn't changing. Wouldn't help if it were looked up in a map, but the optimized native version could do a fast check before hitting the map.)

I've decided this is too good an idea to pass on.

I particularly like that it makes a new namespace for nouns. It's mean to take away words like "end" from "start/end" or "begin/end" that people might want to use for variables.

So TAG! will not match string content in PARSE/UPARSE. If you want to literally match a tag in a block, use QUOTED! (as you would for other types):

>> uparse [<a> <a> <a>] [some '<a>]
== <a>  ; remember, rules like SOME don't synthesize anything

>> uparse [<a> <a> <a>] [copy some '<a>]
== [<a> <a> <a>]

There's no particular reason not to have it work in strings too for finding the molded form. But in a string, you can use it in quotes.

>> uparse "<a>stuff</a>" [between "<a>" "</a>"]
== "stuff"

Isn't that block-result-is-the-result-of-UPARSE convention awesome?

Remember that Combinators are Customizable

If you don't like this idea, you can change it...but I think the TAG!-as-parsing-NOUN concept is something we'll get mileage from.

I mention <line> and <file>. It's nice to have these kinds of things not competing.

And I think it likely is going to look better for the likes of <end>. I'm sympathetic that to <end> is a bit more typing than to end but it seems pretty good.

Maybe a good name for SKIP (matching any token) is <any> ?

>> uparse? [xxx "what about this?" yyy] ['xxx, item: <any>, 'yyy]
== #[true]

>> item
== "what about this?"

It's a little bit more typing, but I've been pretty displeased with the weakness of symbols like ? and *.

And now that the ANY => WHILE transition is done we can kind of go back to reclaiming ANY for its natural sense.

Power users will be able to override all of this.

uparse: specialize :uparse [
    ;
    ; Ugly way of extending a MAP!, there should be nicer ways.
    ;
    combinators: append copy default-combinators reduce [
        'end :default-combinators/<end>
        '* :default-combinators/<any>
    ]
]

>> uparse ['x "y" #z . . .] [word!, demo: *, issue!, to end]
>> demo
== "y"

But let's try starting to use it. Here are the changes to the tests to get an idea of how this looks.. I've left the old END and HERE and SKIP in for now, but I think we should move in this direction.

I also think getting <line> and <column> are pretty important, so that should get worked on...

1 Like

TAG!s are a good visual cue, but I confess I’m lazy and don’t like that I need to click SHiFT + for the angle brackets.

This Has Turned Out To Be Gold

Across the board it is good. <any> has replaced SKIP, and item: <any> makes much more sense than item: skip. (If you're skipping it, why would you be assigning it?)

Note: There were some detours with other considerations. <?> seemed to work better than a lot of things:

>> uparse [1 2] [x: <?>, y: <?>]
>> x 
== 1

RegEx-style dot was also considered.

>> uparse [1 2] [x: <.>, y: <.>]
>> x 
== 1

But <any> won out for its literacy. You can still override it, and make your own combinator aliases... like Topaz uses * for it.

Quoting TAG!s To Match Literally Is No Hassle

There would still be the option of matching TAG!s with quoted tags, as with other quoted types.

uparse [<a> <a>] [some '<a>]

You already have to do this with WORD!s, INTEGER!s, etc. So the minor inconvenience of quoting another type is nothing compared to the value of opening up a new space for parsing keywords.

I think strings being the exception is good, it provides coverage for pretty much other types...when all you are doing with it is stringification anyway:

uparse "<a></a>" [{<a>} {</a>}]

The Lazy Crowd Will Still Have Options...

I have removed the non-TAG! variations from the default.

But it is currently looking like all you need to say is:

end: <end>

And it will work in your file. Right now any plain WORD! that isn't a combinator is looked up as a rule, and if that rule takes no arguments it will work.

That rules out INTEGER! (since the INTEGER! combinator takes a rule as an argument to repeat). So you have to say:

>> num: 3

>> uparse "aaa" [repeat (num) "a"]
== "a"

...as opposed to just [num "a"]. But for the moment, only rules which take arguments are forbidden to be looked up via WORD!.

In any case: I think that opening up TAG! as a whole category of subdispatched combinator is a winning idea, which will serve in the long run.

2 Likes

In @rgchris's JSON parser, he actually uses HERE as the name of a marked location.

With TAG! combinators, you can do that.

here: <here> (...code that uses here...)

But I actually hadn't retrofitted R3-Alpha parse to support the tags, just using the keyword HERE. This reaffirms my belief that tag combinators are the right way to handle these nounish things.

I am doing the retrofit now...which will also turn END to <end>...but I'm defining a PARSE2 mode of the old native parse to make it easier to do bootstrapping. There will be a UPARSE2 as well, so people can tinker with the combinators to their liking (if that's the old way, then fine).

2 Likes