Combinator or Plain Function? Reusing Features In PARSE.

hostilefork · March 2, 2021, 8:19pm

Early on, I wondered why we couldn't invoke an ordinary function in PARSE, where its arguments were gathered by rules.

For instance: What if you want to call SPLIT on part of the input. Here's a particularly laborious way of doing it:

parse "(((1|2|3)))" [
    some "("
    let start: here
    to ")"
    let limit: here
    items: (split/part start "|" limit)
    some ")"
]

But imagine if we could avoid creating those intermediate START and LIMIT variables, with something more like:

parse "(((1|2|3)))" [some "(", items: split [copy to ")"] ("|"), some ")"]

The concept would be that the arguments to SPLIT would be gathered by PARSE rules. BLOCK!s would behave as rules and pass series, GROUP!s would go through DO.

Another idea could be that any function which takes a /PART would have that /PART implicitly provided by the limit of the parse range of the parameter.

Or even weirder: Maybe functions like SPLIT are expressed as COMBINATOR to begin with, and we then have a distinct behavior for combinators when called outside of PARSE. That would need more thinking through.

Another Example: ENBIN and DEBIN

A frequent lament of mine was how BINARY! parsing has really underperformed the ideal of what it should be capable of. The addition of ENBIN and DEBIN points in the right direction. (Going to use the $ format to start warming you up to the idea of the change, so that #{...} can be used for an alternative TOKEN! representation...I actually think it looks better!)

If we left it as a function that needed a rule to pass as the series parameter, it might wind up looking like:

parse ${FFFF0001FFFFFF} [
    some ${FF}
    value: debin ([BE + 2]) here
    some ${FF}
]

If it were a combinator, then we could assume it communicated the knowledge of how far it processed back to us vs. being solely a consumer-of-input-and-producer-of-value. And perhaps it could be stylized to quote a BLOCK! and know it wasn't to be interpreted as a rule (?)

parse ${FFFF0001FFFFFF} [
    some ${FF}
    value: debin [BE + 2]
    some ${FF}
]

As another issue: we can abstract this rule:

uint2: [debin [BE + 2]]

But then we might need to think about how would we abstract that to a UINT rule that took an N parameter, so we could write:

parse ${FFFF0001FFFFFF} [
    some ${FF}
    value: uint 2
    some ${FF}
]

How Permissive Should PARSE Syntax Get?

Carl was concerned about the idea that PARSE rules were hard enough to understand as-is...and allowing you to call functions could make it worse.

I definitely empathize with the idea that one should be careful here. And up until now, I've been saying that the mechanism for doing abstraction of this sort is that you would always go through some kind of "abstraction operator"...tentatively :(...)

parse ${FFFF0001FFFFFF} [
    some ${FF}
    value: :(uint 2)
    some ${FF}
]

I like having the option of being able to build a rule dynamically like that...off the cuff...from regular code. The example of being able to make LOGIC! expressions that way that control the continuation-or-not of the PARSE is particularly inspired, imo.

But I don't like being disallowed from doing such invocations without decorating them. Combinators are offering a way forward...so maybe if something is declared with the COMBINATOR generator then it is allowed if it is found via just normal binding.

A Lot To Think About...

It should be easy to encode and decode things into parse streams, and people should be able to plug in new decoders easily. Nice to be able to experiment and hack up ideas quickly...(!)

hostilefork · March 3, 2021, 4:32pm

This may be the most promising idea...although it is strange. For starters:

When a plain function wants to operate on part of the input, it uses a refinement /PART and passes a position it calculates.
When a parser combinator wants to operate on part of the input, the indication of how much input to process typically comes from another combinator which says how far to advance.

As a simple example, consider COPY. When used within UPARSE it doesn't need an argument to know what series to copy from...that's implicit (the INPUT series). But that has to be explicit when called normally. Then the UPARSE argument actually corresponds to the /PART refinement of a normal invocation...expressed as a rule. But it would be a refinement called normally.

Next we have to consider the result. The first implementation of combinator made every combinator return an updated position as the main return...and then a combinator result an optional second result. It made sense to break down the results this way because it's necessary to be able to test whether a combinator had a result requested or not...otherwise you'd be wasting the generation of results.

However, a plain call wants the combinator product. If we rearrange it so that product is the main return, we lose the ability to tell whether it was requested. That means the combinator protocol would need a separate refinement signal to see whether it was being /IGNORE-d or not (the default sense would have to be not-ignored for it to work as an ordinary function).

Next there's the position of the input series if it's not implicit. I give the example of debin above, where it takes a BLOCK! that gives the debin specification. When calling that outside of a parse rule, I think it's best if the block is the first parameter, with the input series next. But where the input series appears could vary.

With the combinator product turned into the main return (with an /IGNORE refinement to ask for it not to be given), the updated position has to become a separate result. We might be generous and allow a function that takes a /PART refinement to avoid giving back a result, and assume the new position is the end of the part...but that would break for cases like CHANGE.

It's puzzling, but I think worth thinking about. Otherwise we'll end up creating two parallel versions of every function that processes series input and produces values...the parse version, and the non-parse version. DEBIN is an excellent example to think about, because it is so useful for binary parsing.

In terms of finding a namespace for these operations, it makes me wonder about the idea of being able to unify the default combinator set with the specific instance of the parser.

parse: make-into-combined-map-and-function :parse-core default-combinators

parse "aaa" [some "a", some "b"]  ; run the PARSE function

parse.debin [BE + 2] ${0001}  ; direct call to a single combinator
; equivalent to `(let x, parse ${0001} [x: debin [BE + 2]], x)`

I'll point out that this is also touching on the question of CODECs...which may overlap with combinators.

Lots to think about. :-/

hostilefork · April 13, 2021, 9:12am

So the philosophy I've shifted to on this is that common combinators (e.g. TO and SOME, as opposed to COLLECT) return results that are already on hand and do no allocations:

>> did uparse "aaabbb" [x: some "a", y: some "b"]
== #[true]

>> x
== "a"

>> y
== "b"

What it's actually passing back to you is the rule it matched; the argument to SOME. This might seem less useful than passing back something like x = "aaa" or x = [#"a" #"a" #"a"] that actually gave you more information... and yes, it is less useful. But it is cheap.

The reason cheap is good is because quite often you aren't capturing things:

>> did uparse "aaabbb" [some "a", some "b"]
== #[true]

And if we're going to unify combinators and ordinary functions, it becomes more practical to do so when we make the "value bearing" result correspond to the result of the function.

Why Don't Functions Know If Their Result is Being Used?

We can't reliably tell functions whether their main result is being used because the evaluator doesn't reliably know. When you write an expression like:

  eval [collect [keep <keep> keep <me>], (elide print "Vanish!)]

There's no psychic way for us to tell the COLLECT that we're going to be evaluating to the [<keep> <me>] block. It's an imperative language that flows left to right, and we don't know until after the ELIDE has run that it didn't produce anything.

We haven't been too historically bothered about the fact that if you call COLLECT and then don't store its result anywhere that you wasted work. Because...what did you expect?

And I would say that a similar philosophy would probably apply for calling COLLECT in a PARSE rule and then not using it anywhere. No one is going to lose sleep over that being inefficient. (Though maybe there should be some sort of runtime tool to help identify places that generated data is "underused")

But I think you'd be rather displeased if you were skipping through things like thousands of zero bytes with some #{00} and that was invisibly creating a gigantic binary of zeros without you asking it to.

Note also that historical PARSE was also on the side of saying if you wanted new series to be created, you had to ask for it:

r3-alpha>> parse "aaaa" [set x [some "a"]]
== true

r3-alpha>> x
== #"a"

There's no COPY, so you get back something cheap: a value from the original input (a character). Whether it's the rule you matched or the character is something to debate, but either way the point is you need an explicit instruction to fabricate a new series that could be arbitrarily sized.

One Step Closer?

Anyway...this means changing the UPARSE mechanics so the "value bearing" portion is the main function result. The combinator does not know whether you're going to use the result or not. This makes the rules a little "dumber" but I think considered holistically this is for the best.