UPARSE Case Study: Escaping In Strings

hostilefork · March 27, 2021, 1:44pm

I wanted to make a REWORD variation that would look for escaped parts of strings and extract them as words. So:

Input: "abc$(def)ghi"
Output: ["abc" def "ghi"]

It's a common-seeming and not entirely trivial task. The first thing I came up with is a bit convoluted...perhaps because I tried to not repeat the "$(" and ")" strings in the rule:

parse text [
    collect [
        try some [
            not <end>
            (capturing: false)
            try keep between <here> ["$(" (capturing: true) | <end>]
            :(if capturing '[
                let inner: between <here> ")"
                keep (as word! inner)
            ])
        ]
    ]
]

It basically alternates between a capturing mode and a non-capturing mode. It decides if it needs to run a capture mode with a variable.

It has to throw in a NOT <END> for reasons I explain in another post. Because it's running alternating rules that may both opt out.

I use a GET-GROUP! spliced conditional rule, as UPARSE doesn't have any loop-interrupting constructs yet. So you can't say "Stop running this rule, but consider it to have matched." There's only LOGIC! of #[false] which means what FAIL used to mean...e.g. the overall rule did not match (so any collected material would be forgotten).

Since it can't break out of the rule and report success, it has to have a way to skip over a rule. So the rule for capturing inside the parentheses conditions itself out with an IF statement and a generated rule. I could have instead written that as an alternate rule, where if not capturing was true it would bypass normal code:

parse text [
    collect [
        try some [
            not <end>
            (capturing: false)
            try keep between <here> ["$(" (capturing: true) | <end>]
            [:(not capturing) |
                let inner: between <here> ")"
                keep (as word! inner)
            ]
        ]
    ]
]

That feels more convoluted to me because of the inverse logic of the NOT, though.

It produces more empty strings than I would like:

Input: "$(abc)$(def)$(ghi)"
Output: ["" abc "" def "" ghi]

It would technically be possible for a rule like BETWEEN to succeed and give a NULL result if there were no content, instead of an empty string:

>> parse "()" [between "(" ")"]
== ~null~  ; anti

But this then means you can't get a good distinction of what happened in the case of an optional rule.

>> parse "" [try between "(" ")"]
== ~null~   ; anti...so were there parentheses or not?

So I guess it's another situation where if you want to filter out the empty strings, you have to capture into a variable and filter it.

I think UPARSE helps out here...but it's not quite the slam dunk I'd hope for.

Because it has two rules that may both opt themselves out, it's a thought piece for asking if the NOT END makes sense with TRY SOME. Or is it better off baking that into the TRY SOME rule and having another construct? Intuitively I feel like the tax of having two slightly different versions and explaining the use of one vs. the other is worse than just having the more general construct.

If there were a loop-ending construct that indicated the overall rule was a success (e.g. didn't discard the KEEPs), then we might avoid the capturing flag:

uparse text [
    collect [
        try some [
            try keep between <here> ["$(" | <end>]
            [<end> break |
                let inner: between <here> ")"
                keep (as word! inner)
            ]
        ]
    ]
]

But I don't know if BREAK is the right name for a loop-accepting operation (as in DO's while this typically causes most loop operations to return NULL). So I'd think it would perhaps discard anything kept. Perhaps STOP would be more consistent, and it could be value-bearing as well (stop (...))

hostilefork · February 8, 2024, 11:08pm

String interpolation is back on the plan now with Pure Virtual Binding II. So this question is back, too.

We could start improving on the old code with UPARSE's new meaning of WHILE.

parse text [
    collect while not <end> [
        (capturing: false)
        try keep between <here> ["$(" (capturing: true) | <end>]
        :(if capturing '[
            let inner: between <here> ")"
            keep (as word! inner)
        ])
    ]
]

Slightly better. I'll point out a binding question:

should a spliced in parse-rule via GET-GROUP! run a bind operation on its material? I'm speaking about the soft-quoted splice of the block material from the IF. Interestingly this doesn't affect the parse "keywords" (like BETWEEN or KEEP or ) because they are looked up in a map. And it doesn't affect INNER because it's a LET-variable. What gets affected is the AS and WORD! lookup from the LIB context.
- I think it's clear that a spliced value via GET-GROUP! acts as if it had been written verbatim where it was found. So the quoted block--when unquoted--since it was unbound, would receive the binding of the PARSE ruleset in progress at that moment. But this means that curiously, you would be able to return an already bound block as another choice. Here that would give you no difference between if capturing @[...] and if capturing '[...], but if you had (let x: 10 if capturing '[...]) it would affect the visibility of that X... you would not see it if you used the quote, but would see it with the @.

Beyond that I'm a little puzzled over how to do this better.

I feel like the "right answer" wouldn't need a capturing mode variable, but could express this as the difference of a complete rule with the between "$(" being its own line.

Following that line of thinking gets this:

 parse text [collect while not <end> [
     keep any [
         [let inner: between "$(" ")" (as word! inner)]
         between <here> [<end> | ahead "$("]
     ]
 ]]

If we believe the pattern of COLLECT plus one KEEP of the body should be ACCUMULATE that could be:

 parse text [accumulate any [
     [let inner: between "$(" ")" (as word! inner)]
     further between <here> [<end> | ahead "$("]
 ]]

(That's a little tricky, because you need the FURTHER to avoid infinitely collecting empty strings once you reach the end with between <here> <end>)

I still don't like the repetition (e.g. repeating "$(" in ahead "$(").

This might need its own combinator that's a relative of BETWEEN which says what to do with the material that's not between.