Incomplete TRANSCODEs: Actually an Optimization Problem

hostilefork · August 22, 2022, 3:09pm

Ren-C has a multi-return interface for TRANSCODE. Without /ONE, you get the whole thing:

>> transcode "abc def"
; first in pack of length 2
== [abc def]

If you ask for a remainder in this case, it will be empty:

>> [value pos]: transcode "abc def"
== [abc def]

>> pos
== ""

With the /ONE refinement, it will go one item at a time. Because the main return (value) can come back NULL, there may not be a remainder, so POS needs to be optional:

>> [value /pos]: transcode/one "abc def"
== abc

>> pos
== " def"

You know that you're at the end of the input when it returns null as a main result, with all the benefits of easy reactions to NULL with IF and ELSE and friends:

>> [value /pos]: transcode/one ""
== ~null~  ; anti

>> pos
== ~null~  ; anti

Writing foolproof loops to process items are a breeze:

while [true]
    [item /utf8]: transcode utf8 else [break]
    print mold item
 ]

; or for the THEN/ELSE haters out there (you know who you are :-P)

while [true]
    if null? [item /utf8]: transcode utf8 [
        break
    ]
    print mold item
 ]

This Runs Circles Around Red and R3-Alpha

For starters: neither support strings as input--because the scanner is built for reading UTF-8 files...and both R3-Alpha and Red unpack strings into fixed-width encodings. So if you have string input, you have to pay for a copy encoded as UTF-8 via TO BINARY!. (Ren-C's UTF-8 Everywhere wins again!)

R3-Alpha unconditionally returns a block with the last element as a remainder, whether you ask for one item via /NEXT or not:

r3-alpha>> transcode to binary! "abc def"
== [abc def #{}]

r3-alpha>> transcode/next to binary! "abc def"
== [abc #{20646566}]

r3-alpha>> transcode/next to binary! ""
== [#{}]

So if you were transcoding an entire input, you have to TAKE/LAST an always-empty binary off of the result.

But you are using /NEXT you have to PICK out the element from the start of the array and the remainder from the end. But you need to notice the exception of no-value-produced where the block is length 1 instead of 2.

That's awkward, but as usual... Red somehow manages to make an incompatible interface that is as much worse as it is better:

The better part is that if you don't ask for /NEXT you just get the block back, like in Ren-C:

red>> transcode to binary! "abc def"
== [abc def]

But the /NEXT interface is outright broken:

red>> transcode/next to binary! "abc def"
== [abc #{20646566}]

red>> transcode/next to binary! ""
== [[] #{}]

It might look better because you don't have to guess about which position to find the remainder in--it's always in the second slot. But it has a fatal flaw: you can't distinguish the result state of scanning "[]" and any string with nothing but comments and whitespace.

Consider this very basic loop to scan one item at a time and print it:

red>> utf8: to binary! "abc def"

red>> while [not tail? utf8] [
     set [item utf8] transcode/next utf8
     print mold item
]
abc
def

You get two items. But what if you had something that was--say--a comment:

red>> utf8: to binary! "; I'm just a comment"

red>> while [not tail? utf8] [
     set [item utf8] transcode/next utf8
     print mold item
]
[]

You get one spurious item. (They chose BLOCK! for the item, but it wouldn't matter what it was--a NONE! would be just as bad, you're just losing the distinction between empty strings and "#[none]" then.)

If I were prescribing a solution for Red I'd say:

Make /NEXT take a variable to write the next position into
Error on #{} input, so anyone doing a TRANSCODE/NEXT knows they are responsible for testing for TAIL? before they call (if they're not sure their input is non-empty)
- This way an empty remainder returned in the /NEXT variable will uniquely signal the reached-end state
Make the synthesized product at the tail something ugly but assignable (so not an unset!)
- an ERROR! saying "end of input" is at least informative in case it winds up getting treated as an actual value somewhere

That would at least give them patterns like:

if not tail? utf8 [  ; needed if you're not sure it's non-empty
    while [true] [
        item: transcode/next utf8 'utf8
        if tail? utf8 [break]
        print mold item
    ]
]

Ren-C Also Thrashes R3-Alpha and Red In Error Handling

Ren-C TRANSCODE has these potential behaviors:

RETURN a BLOCK! (if plain TRANSCODE)
RETURN an ANY-VALUE! or NULL (if TRANSCODE/ONE)
It can do a "hard FAIL"
- This would happen if you asked something fundamentally incoherent...like asking to TRANSCODE a with input that was non-UTF-8...like a GOB!, or something like that
- Such errors are only interceptible by a special SYS.UTIL.ENTRAP method--they are not supposed to be easy to gloss over and unlikely to have meaningful mitigation. So only special sandboxing situations (like writing consoles that print out the error) are supposed to trap them.
It can RETURN an isotopic ERROR! ("raised error") if something went wrong in the transcoding process itself
- This would be something like a syntax error, like if you asked transcode "a bc 1&x def"
- These will be promoted to a hard FAIL if the immediate caller doesn't do something to specially process them.
- You can casually ignore or intercept these, because you can be confident that it was a formal return result of the thing you just called--not some deeper problem like a random typo or other issue.

I won't rehash the entire "why definitional errors are foundational" post, but TRANSCODE was one of the first functions that had to be retrofitted to use them.

>> transcode "a bc 1&x def" except e -> [print ["Error:" e.id]]
Error: scan-invalid

The definitionality is extremely important! I spent a long time today because in the bootstrap shim I had a variation of transcode...parallel to this in R3-Alpha:

r3-alpha>> transcode: func [input] [
               prnit "My Transcode Wrapper"  ; oops, typo
               return transcode input
           ]

r3-alpha>> if not attempt [transcode to binary! "abc def"] [print "Bad input"]
Bad input

But the input isn't bad!!! This leads to a nightmare of trying to figure out what was going wrong. I had just one of those nightmares today in the bootstrap executable when tinkering with the shim implementation of TRANSCODE. A bug in the shim was leading to silently skipping work that should have been done, because the caller wanted to be tolerant of bad transcode input.

There's simply no practical way of working on code of any complexity without something like definitional failures, and experience has proven this day after day.

Getting Incomplete Results Via R3-Alpha's /ERROR

R3-Alpha offered this feature:

/error -- Do not cause errors - return error object as value in place

The intended use is that you might want the partial input of what had been successfully scanned so far. If the code went and raised an error, you could trap that error. But you wouldn't have any of the scanned items.

It would put it any ERROR! as the next-to-last item in the block, with the remainder after that:

>> transcode/error to binary! "a bc 1&x def"
== [abc make error! [
    code: 200
    type: 'Syntax
    id: 'invalid
    arg1: "pair"
    arg2: "1&x"
    arg3: none
    near: "(line 1) a bc 1&x def"
    where: [transcode]
] #{20646566}]

>> to string! #{20646566}
== " def"  ; wait...why isn't 1&x part of the "remainder"

It's clumsy to write the calling code (or to read it...testing to see if the next-to-last-item is an ERROR! and reacting to that.

(Also: What if there was some way to represent ERROR! values literally in source? This would conflate with such a block that was valid...but just incidentally had an ERROR! and then a BINARY! in the last positions.)

But the thing that had me most confused about it was the remainder. Notice above you don't get 1&x as the start of the stuff it couldn't understand.

Was it trying to implement some kind of recoverable scan? What would that even mean?

Ultimately I think this was just a leaking of an implementation detail as opposed to any reasonable attempt at recoverable scanner. It only didn't tell you where the exact tail of the successfully scanned material was because it did not know.

The scanning position is based on token consumptions, and so if you started something like a block scan and it saw a [ then it forgets where it was before that. Then if something inside the block goes bad, it will just give you a remainder position somewhere inside that--completely forgetting about how many nesting levels it was in.

So what you were getting was a crappier implementation of scanning one by one, and remembering where you were before the last bad scan:

pos: input
error: null
block: collect [
   while [true] [
       keep [# pos]: transcode pos else [
           break
       ] except e -> [
           error: e
           break
       ]
   ]
]

That gives you a proper version, setting error if something happened and giving you the block intact.

So Finally... We See It's An Optimization Problem

Question is if there's some way of folding this into TRANSCODE, so it's doing the looping and collecting efficiently for you.

But this interface wants to get back a "remainder". And I kind of hate to sacrifice the property that TRANSCODE's asking for a remainder means scan one element. :-/

I guess we could say that there's a logical process you follow:

The output parameter is called REST
An additional output parameter is added for ERROR
If you ask for the REST and don't ask for an ERROR, that suggests you want to encode a single item
- You could have just intercepted the error if you wanted it
- Nothing is lost because there wouldn't be any partial results to miss (if you're only doing one item, there will always be zero items completed before it)
If you ask for the REST and do ask for an error, then it assumes you must not want the one-item-only semantics after all.

It's a little bit awkward because it conflates partial output with fully successful output

>> [block rest error]: transcode "a bc"
== [a bc]

>> error
; null

>> [block rest error]: transcode "a bc 1&x def"
== [a bc]  ; no indication something failed

>> error? error  ; you'd have to remember to check this
== #[true]

That's not a deal breaker, and Ren-C makes it easy to work with, using circling an output in multi-return to make it the primary return result:

>> [block rest @error]: transcode "a bc"
; null

>> [block rest @error]: transcode "a bc 1&x def"
== make error! [...]

What's much more jarring to me is the flipping back and forth of whether you're asking for a full transcode or not.

>> x: transcode "abc def"
== [abc def]

>> [x y]: transcode "abc def"
== abc

>> [x y z]: transcode "abc def"
== [abc def]

Ick. Should I be willing to bend on the transcode "requested parameter" behavior in this case, by adding a /ONE refinement?

>> [block rest]: transcode "abc def"
== [abc def]

>> rest
== #{}  ; kind of useless, but honest

>> [block rest]: transcode/one "abc def"
== abc

>> rest
== " def"

That would make me feel grief, as it loses one of the first showcases of return value sensitivity. And it irks me to think that the beauty is ultimately being given up for the sake of what amounts to an optimization.

Answer For Now: Kill Off /ERROR

The answer /ERROR has been giving back in error cases for the remainder is sketchy, and I don't want to figure out how to fix it.
You can get the behavior reliably just by intercepting errors going one transcode item at a time.
This is a good opportunity to write tests of item-by-item scanning with error handling
Red added a bunch of refinements on transcode [/next /one /prescan /scan /part /into /trace], and they didn't pick up /error themselves

Speaking of adding lots of refinements: I also want to get away in general from investments in weird C scanner code and hooks (especially if it's just an optimization).

What we should be investing in is more fluid mixture of PARSE of strings/binary with the scanner. e.g. we should have ways of knowing what line number you're at during the parse for any combinator, and just generally pushing on that. Adding TRANSCODE parameters up the wazoo isn't a winning strategy.