Optimizing TRANSCODE Usage in String/Binary PARSE

hostilefork · August 21, 2022, 7:10pm

As written, the DATATYPE! combinator in UPARSE may do wasteful value loading when operating on string input.

Consider this case.

>> parse "[some big block ...] 10" [collect some [keep integer! | block!]]
== [10]

Pretty impressive that it works. (Red will only do this on BINARY! input, but Ren-C's UTF-8 everywhere allows it to do it on strings too!)

But at the combinator level, it's wasteful. What happens is:

Hitting the INTEGER! combinator, causing it to scan the next element, loading [some big block ...] as a series into memory.
- It then checks the type, notices it's not an integer, and the INTEGER! combinator gives back a rejection...so the BLOCK! combinator goes to the next alternate.
It hits the BLOCK! combinator and scans the block again.
- This time it matches, so the parser returns success and the synthesized block
- But the block isn't actually desired, so it is thrown away
The next iteration scans the INTEGER! and keeps it.

Why Does It Work This Way?

It's based on TRANSCODE, and does basically exactly what I said:

[item remainder]: transcode input except e -> [return raise e]

if datatype != type of item [
    return raise ["Could not TRANSCODE" datatype "from input"]
]
return item

If we could pass in a datatype to TRANSCODE when using the /NEXT option (e.g. requesting a remainder, as we are above) then it could short-circuit and we wouldn't need that test.

Red Has Looked At This Kind of Problem

There are a bunch of new arguments to Red's TRANSCODE function:

USAGE:
     TRANSCODE src

DESCRIPTION: 
     Translates UTF-8 binary source to values.
     Returns one or several values in a block. 

ARGUMENTS:
     src          [binary! string!]
     {UTF-8 input buffer; string argument will be UTF-8 encoded.}

REFINEMENTS:
     /next        => Translate next complete value (blocks as single value).
     /one         => Translate next complete value, returns the value only.
     /prescan     => Prescans only, do not load values. Returns guessed type.
     /scan        => Scans only, do not load values. Returns recognized type.
     /part        => Translates only part of the input buffer.
         length       [integer! binary!] "Length in bytes or tail position."
     /into        => Optionally provides an output block.
        dst          [block!] 
     /trace       => 
        callback     [function! [
                        event [word!]
                        input [binary! string!]
                        type [word! datatype!]
                        line [integer!]
                        token
                        return: [logic!]
                      ]] 

RETURNS:
    [block!]

I'm not sure exactly how useful the /PRESCAN option is (what good is a "guess" of the type?) But the /SCAN option would offer some bit of efficiency.

It would mean instead of one call to TRANSCODE followed by a datatype test, there'd be two calls

The first as TRANSCODE/SCAN to get the datatype (but not synthesize a value from it)
A second call to scan again and get the value

We assume the idle mode of scanning without producing anything can be fast.

I would suggest the scan feature be transcode/types so it worked more generally, not just with /NEXT.

>> transcode/types [1 a [b]]
== [#[datatype! integer!] #[datatype! word!] #[datatype! block!]]

_{(When I figure out the story of datatypes, there are going to be a lot of forum posts fixing up the above ugly notation.)}

But What About The Synthesis Of Unused Values?

This is a bit of a pickle. We don't know if you're going to use the product or not.

UPARSE's design has values bubbling out the top, and no line of communication to be aware of whether what it produces will be used:

>> uparse "[a] (b)" [block! group!] 
== (b)

You might think that when the block! rule is going to be run, UPARSE could notice it wasn't at the end and send some kind of signal to the BLOCK! combinator that it doesn't have to synthesize an output. But there's no a-priori psychic power saying that GROUP! hasn't been configured to evaluate to void. Until the combinator gets looked up and run, it's potentially the same situation as this:

>> uparse "[a] (b)" [block! void] 
== [a]

It Seems We Have Two Choices

We can assume that a plain DATATYPE! intends to synthesize a value, and use a different combinator to say you only want to match the type:

>> uparse "[a b c]" [scan block!]
== #[datatype! block!]  ; cheap (but useful) return value, no series synthesis

>> uparse "[a b c]" [block!]
== [a b c]

We can reverse it and say that by default it does the cheap thing, and you have to explicitly ask to get the expensive thing:
```
>> uparse "[a b c]" [block!]
== #[datatype! block!]

>> uparse "[a b c]" [scan block!]
== [a b c]
```

Looked at in isolation, it might seem like (2) would be the obvious winner.

The thorn is that this would be a pretty notable divergence from how array parsing works, which I would basically call non-negotiable:

>> uparse [[a b c]] [x: block!]

>> x
== [a b c]

So is there actually an option 3?

Make lone datatype! an error, and have two distinct operations for transcoding:

>> uparse "[a b c]" [block!]
** Error: On string input, use either TRANSCODE BLOCK! or SCAN BLOCK!

>> uparse "[a b c]" [transcode block!]
== [a b c]

>> uparse "[a b c]" [scan block!]
== [a b c]

Urg. That kind of sucks.

I think the answer is to accept option (1) being suboptimal performance, allowing those who are performance-minded to tune it. There's no overt harm by scanning things you throw away, it's just wasteful.