TAG!s as PARSE keywords vs. Literal Matches

hostilefork · March 14, 2021, 4:35am

Previously I brought up the idea of "TAG! combinators". Here's what I said, moved from another thread where it was kind of a tangent:

[I saw Haskell...] uses EOF instead of END. END is literate, but one often wants to call variables things like "begin" or "end", or "start" and "end".

This makes me wonder if perhaps we should be a bit more creative in the use of datatypes. If you want to match a WORD! in a dialect, you have to use a tick mark. What if you had to use a tick mark to match TAG!s, and then an ordinary TAG! could have meaning as a rule...such as <end>?
parse "aaa" [data: copy to <end>]

parse "<div>stuff</div>" [x: between '<div> '</div>]
Anyway, that could open up a whole new category of combinators... tag combinators. Maybe <here> is another example, or perhaps <input> if you want to pass the original input position through to a function.

A unifying concept here could be that you'd use it for properties that you don't want to have collide with the names of variables. Consider for example if PARSE tracks the line number, you might want to say something like line: <line> in the middle of a rule.

If you want to match tags by their stringness, it's not like it's all that hard to just say "<div>" in the first place. But quoting is even briefer. Remember that being inert in typical evaluation is not enough in PARSE to mean it's not a rule... INTEGER!, BLOCK!, BLANK! (previously NONE!) and now LOGIC! all have to be quoted to mean their actual literal thing. And quotes are needed on things like WORD!, GROUP!, GET-WORD!, SET-WORD!...and much more.

So is it worth it to get another dialect part, by making you have to quote your tags if you want them to match literally? I kind of feel like it would be. Of course, the concept with UPARSE is that people could disagree and make entirely different answers...

(Note: a downside here is that since TAG!s are strings and not symbols, the comparison costs could be (slightly) higher. However, I've been thinking that to speed up string comparisons they might cache a symbol as part of the comparison process...and clear the symbol cache on each mutation. Then comparisons of strings to symbols could become very fast...so long as the string isn't changing. Wouldn't help if it were looked up in a map, but the optimized native version could do a fast check before hitting the map.)

I've decided this is too good an idea to pass on.

I particularly like that it makes a new namespace for nouns. It's mean to take away words like "end" from "start/end" or "begin/end" that people might want to use for variables.

So TAG! will not match string content in PARSE. If you want to literally match a tag in a block, use QUOTED! (as you would for other types):

>> parse [<a> <a> <a>] [some '<a>]
== <a>  ; remember, rules like SOME don't synthesize anything

>> parse [<a> <a> <a>] [copy some '<a>]
== [<a> <a> <a>]

There's no particular reason not to have it work in strings too for finding the molded form:

>> parse "<a>stuff</a>" [between '<a> '</a>]
== "stuff"

But in a string, you can use it in quotes, maybe clearer:

>> uparse "<a>stuff</a>" [between "<a>" "</a>"]
== "stuff"

Isn't that block-result-is-the-result-of-UPARSE convention awesome?

Remember that Combinators are Customizable

If you don't like this idea, you can change it...but I think the TAG!-as-parsing-NOUN concept is something we'll get mileage from.

I mention <line> and <file>. It's nice to have these kinds of things not competing.

And I think it likely is going to look better for the likes of <end>. I'm sympathetic that to <end> is a bit more typing than to end but it seems pretty good.

Power Users Can Override It

parse: specialize :parse [
    ;
    ; Ugly way of extending a MAP!, there should be nicer ways.
    ;
    combinators: append copy default-combinators reduce [
        'end :default-combinators.<end>
        '* :default-combinators.one
    ]
]

>> parse ['x "y" #z . . .] [word!, demo: *, issue!, to end]
>> demo
== "y"

But let's try starting to use it. Here are the changes to the tests to get an idea of how this looks.. I've left the old END and HERE and SKIP in for now, but I think we should move in this direction.

I also think getting <line> and <column> are pretty important, so that should get worked on...

BlackATTR · March 14, 2021, 1:14pm

TAG!s are a good visual cue, but I confess I’m lazy and don’t like that I need to click SHiFT + for the angle brackets.

hostilefork · August 7, 2021, 12:17pm

This Has Turned Out To Be Gold

Quoting TAG!s to match is literally no hassle.

uparse [<a> <a>] [some '<a>]

You already have to do this with WORD!s, INTEGER!s, etc. So the minor inconvenience of quoting another type is nothing compared to the value of opening up a new space for parsing keywords.

I think strings being the exception is good, it provides coverage for pretty much other types...when all you are doing with it is stringification anyway:

uparse "<a></a>" [{<a>} {</a>}]

The Lazy Crowd Will Still Have Options...

I have removed the non-TAG! variations from the default.

But it is currently looking like all you need to say is:

end: <end>

And it will work in your file. Right now any plain WORD! that isn't a combinator is looked up as a rule, and if that rule takes no arguments it will work.

In any case: I think that opening up TAG! as a whole category of subdispatched combinator is a winning idea, which will serve in the long run.

hostilefork · November 29, 2021, 5:25am

In @rgchris's JSON parser, he actually uses HERE as the name of a marked location.

With TAG! combinators, you can do that.

here: <here> (...code that uses here...)

But I actually hadn't retrofitted R3-Alpha parse to support the tags, just using the keyword HERE. This reaffirms my belief that tag combinators are the right way to handle these nounish things.

I am doing the retrofit now...which will also turn END to <end>...but I'm defining a PARSE2 mode of the old native parse to make it easier to do bootstrapping. There will be a UPARSE2 as well, so people can tinker with the combinators to their liking (if that's the old way, then fine).