Converting TRIM To UPARSE for Testing And Inspiration

hostilefork · August 13, 2021, 11:09pm

A long time ago, @Brett converted the circuitous native code for TRIM from R3-Alpha to PARSE-based usermode code.

Since we have that code--and some tests for it--I thought it would be a good idea to go ahead and try running it under UPARSE. This would be another way of testing UPARSE...as well as to see if the new features gave it any kind of leg up. We could also look for inspirations for new features...

New Features: `<index>` and MEASURE Combinators

There was a calculation of indentation done for the TRIM/AUTO feature. It uses PARSE* which is the version that doesn't require matching to the end of the input. (Though since it doesn't check the result and doesn't do any operations which would roll back, it doesn't make a difference.)

indent: _
if auto [
    parse* series [
        ; Don't count empty lines, (e.g. trim/auto {^/^/^/    asdf})
        remove [while LF]

        (indent: 0)
        s: <here>, some rule, e: <here>
        (indent: (index of e) - (index of s))
    ]
]

The first thought I had is that with TAG! combinators, though we lost the ability to match TAG!s without a quote like [some '<tag>]...we have a nice noun-space to play with that doesn't interfere with variable name nouns. So what if <index> gave you the index position in the current series?

That makes it a bit nicer:

indent: _
if auto [
    parse* series [
        ; Don't count empty lines, (e.g. trim/auto {^/^/^/    asdf})
        remove [while LF]

        s: <index>, while rule, e: <index>, (indent: e - s)
    ]
]

I also changed the SOME to a WHILE, which always succeeds...and since <index> always succeeds there's no need to pre-emptively set the indent to 0.

But wouldn't this pattern make a nice combinator in and of itself? Something that can tell you how long a matched range is. Well, uparse fans, meet MEASURE!

indent: _
if auto [
    parse* series [
        ; Don't count empty lines, (e.g. trim/auto {^/^/^/    asdf})
        remove [while LF]

        indent: measure while rule
    ]
]

And look how easy the combinator is to write (it's one of those that can just use the default rollback):

measure: combinator [
    {Get the length of a matched portion of content}
    return: "Length in series units"
        [<opt> integer!]
    parser [action!]
    <local> s e
][
    ([# (remainder)]: parser input) else [return null]  ; ignore result

    e: index of get remainder
    s: index of input

    if s > e [  ; could also return something like ~bad-seek~ isotope
        fail "Can't MEASURE region where rules did a SEEK before the INPUT"
    ]

    return e - s
]

That's A Pretty Good Start!

It seems to me that what the TRIM code needs is probably a bit better definition of the semantics. TRIM/AUTO is a bit strange:

>> utrim/auto "  x^/ y^/   z^/"
== "x^/ y^/ z^/"

It indents relative to the first non-newline-line...but that creates an issue of what to do about the line that comes after it which is less indented. The rule for processing lines was:

line-start-rule: compose/deep [
    remove [((if indent [[opt repeat (indent)]] else ['while])) rule]
]

The indent not being a BLANK! implies TRIM/AUTO.

That's a /DEEP compose that does splicing (signified these days by ((...)). I rewrote the rule to be a bit clearer as:

line-start-rule: compose [
    remove (if indent '[opt repeat (indent) rule] else '[while rule])
]

That's more pleasing to me, as well as more efficient. It's a nice use of the quoted branches!

But back to the semantics: is this right? It could also slam the less indented lines to the left by moving the OPT.

line-start-rule: compose [
    remove (if indent '[repeat (indent) opt rule] else '[while rule])
]

That would make the y flush with the left:

>> utrim/auto "  x^/ y^/   z^/"
== "x^/y^/ z^/"

Anyway... let's keep those UPARSE test cases coming! It's to a point now where UPARSE is more reliable than R3-Alpha-derived native PARSE (I'm calling PARSE3) and Red. So it's revealing the bugs and inconsistencies in those codebases, not vice versa.

hostilefork · August 14, 2021, 4:02am

And now a little more magic to hopefully keep @Brett in the impressed-zone...

Let's take a second look at the LINE-START_RULE

line-start-rule: compose [
    remove (if indent '[opt repeat (indent) rule] else '[while rule])
]

I began adding ranged REPEATs with blocks, such as repeat ([2 3]) integer! - this is more hygienic than 2 3 integer! which creates semantic problems by not being equivalent to 2 [3 integer!].

Also: if your min and max are in variables, it's clearer to have the REPEAT construct there vs. opaquely reading foo bar rule and not knowing that's going to wind up with FOO and BAR being integers that are iteration limits. Writing repeat (:[foo bar]) rule makes that a lot clearer. (Remember that GET-BLOCK!s reduce now.)

But I had a thought that BLANK! should be able to opt-out of REPEAT to make it a no-op:

>> num-b: _

>> uparse? "aaa" [repeat (num-b) "b", some "a"]
== #[true]

We allow this in the REPEAT iterative loop to opt out, so why not here? That made me wonder about opting out of the BLOCK! form. Presumably repeat ([_ _]) rule would be a no-op also. But what if you only half opted out? What's repeat ([_ 3]) rule or repeat ([3 _]) rule ?

Furthermore, that thought had a tricky cousin...

What About Opting In ?

If you could "opt in" to a REPEAT... like all the way in... you could say that it was a synonym for WHILE. Basically have some token that represented an arbitrarily large integer.

So I used #, which is what's being used elsewhere to opt-in. It's what refinements with no arguments have as a parameter value when they are requested. It's how you indicate a multi-return value you don't want to give a name to, but still want to have the semantics of requesting. And now it means "no limit".

Under this set of rules, repeat ([3 #]) rule will match the rule at least 3 times, but any number of times more than that. While repeat ([3 _]) rule is just a synonym for repeat (3) rule...that can come in handy when you're trying to write code that generically uses min and max but has the ability to decay to not.

Using this idea and defaulting indent to # instead of _, we can rewrite our rule from:

line-start-rule: compose [
    remove (if indent '[opt repeat (indent) rule] else '[while rule])
]

To something simpler:

line-start-rule: [opt remove repeat (indent) rule]

Still Elevating The Art in 2021 !

Brett · August 14, 2021, 5:52am

Don't think I can claim that, I think that was your efforts with a little modifcation and tests work from me.

Careful, people might find this parsing flexibility indispensable and want it to run at production speed.

Brett · August 14, 2021, 5:54am

This power of expressiveness to simplify to the essence of the intent is great.

hostilefork · August 14, 2021, 7:28am

It is indispensible, and I think that speeding it up will be not just within feasibility, but have the appealing effect of speeding up the general facilities of the language!

But the next great challenge I'd like to be able to prototype is to be able to run on top of an abstracted pipe vs. marching along directly in memory. If we could do that, we could run on network data...or be processing a continuously feeding pipe from a CALL session that is being scripted and reacted to.

It might be another one of those "you get what you pay for" situations with the combinators. If the combinator only accepts ANY-SERIES! input and not PORT!, then it might be more limited. But we'd want the stock combinators to be able to run on either.

Point being: still more prototyping work to be done before the optimizing!

Converting TRIM To UPARSE for Testing And Inspiration

New Features: <index> and MEASURE Combinators

That's A Pretty Good Start!

What About Opting In ?

Still Elevating The Art in 2021 !

New Features: `<index>` and MEASURE Combinators