Lessons from Encapping: Improving BINARY! PARSE

hostilefork · March 28, 2021, 5:08pm

The Atronix R3 version that Ren-C started from had a bunch of OS-specific C code (that wasn't in R3-Alpha) pertaining to encapping.

I saw this as a liability. Also, it was inflexible...you'd have to have a Windows EXE to do encapping on Windows, and a Linux executable to do encapping on Linux, etc.

Encapping seemed like a perfect opportunity to be moved fully to usermode...to exercise BINARY! parsing. The well-documented ELF, PE, and Mach-O specs would be a great test of how well those specs could be translated into a resource-rewriting tool.

As a first task, I undertook the ELF binary. And it was painful. The code did not feel good.

Shixin was somehow inspired and did the Windows PE version (on his own time over a weekend, perhaps, if I recall). And in this case, he actually probably did it in a more Rebol-like way than I did. His implementation abstracted the PARSE rules more into his own GEN-RULE dialect, so it was a bit less redundant. But it was still far from ideal.

The source today is not pleasant (%encap.reb). However, I think it has potential, and can evolve into a showcase of better practices as features get shored up.

Basic Problem: No Composition Of Higher-Level Captures

A huge problem I've identified with historical PARSE is that your SET and COPY operations were limited by the type of your input. If your input is a string, the only thing you can directly capture are characters or portions of strings. If your input is a block, the only thing you can directly capture are values or ranges out of that block.

This doesn't give you any abstraction power. It's harrowing.

Let's say you want to capture a little-endian 16-bit unsigned integer. You make a rule, which has to associate with a buffer variable and then a variable to hold the translated value:

u16: _
buf: _
u16-le-rule: [copy buf 2 skip (u16: debin [LE + 2] buf)]

This is already worlds better than what we had, because it's using DEBIN (which did not exist back then).

Even so, you're not done. Now if you want to capture a uint16 little-endian, you need to run the rule...and then grab this mysterious variable:

parse bin [u16-le-rule (my-var: u16)]

This is not competitive with the parser combinators people are familiar with in other languages! For starters, it would at least have to be much more like:

parse bin [my-var: u16-le-rule]

But I'd argue that capturing various forms of integers with different byte widths and signedness and endianness is so foundational that this should basically be in the box.

parse bin [my-var: debin [LE + 2]]

-or-

u16: [debin [LE + 2]]
parse bin [my-var: u16]

I don't know if that's exactly right. But it shouldn't be any harder than that.

Combinators Should Be Able To Return ANYTHING

The main point to absorb here is that abstraction isn't going to work unless a "value-bearing" parse rule is able to return whatever it wants...disconnected from the input type of the series. It's a major oversight in PARSE.

UPARSE is seeking to fix this.

There's no Pattern For Reading and Writing Duality

When you describe a file format, you have all the information you need to read it...but also all the information to write it. How do you deal with that?

My awkward answer was to lay out rules like this, looking for fields in the binary (e_xxx)

    4 skip ; e_flags
    2 skip ; e_ehsize
    begin: here, 2 skip (handler 'e_phentsize 2)
    begin: here, 2 skip (handler 'e_phnum 2)
    begin: here, 2 skip (handler 'e_shentsize 2)

"Uninteresting" data was skipped. "Interesting" data would laboriously mark the start position, skip to the end position, and then call a handler with the name of the field and the size. If in write mode, that handler would CHANGE the span to the encoding of a variable with that name. If in read mode, it would decode into that variable.

I was planning on writing a higher-level abstraction for this. But I didn't.

The handler looks like:

handler: func [name [word!] num-bytes [integer!]] [
    either mode = 'read [
        let bin: copy/part begin num-bytes
        set name debin [(either endian = 'little ['le] ['be]) +] bin
    ][
        let val: ensure integer! get name
        change begin enbin [
            (either endian = 'little ['le] ['be]) + (num-bytes)
        ] val
    ]
]

I guess a question to answer is whether PARSE should have any particular facilitation of bidirectional specification of a rule. It might be able to pick values out of a buffer, or build a buffer from values, or just tweak particular values in an existing buffer.

This may be a job for a higher-level dialect. I'll also point out the TLS EMIT dialect (and Oldes' take on it) for something to consider that's related.

Plenty of Room for Improvement

I don't know what the timeline is for making encap better, but now that it has at least some testing, maybe it will be a good place to try new ideas.

hostilefork · April 19, 2021, 7:05am

So UPARSE is now being used in %unzip.reb for BINARY! parse, and it's leveraging the improvements that I talk about here.

It's not just a switch to UPARSE, but also fixes up some code Giulio added to solve a problem with streaming zips that rebzip couldn't decode. So it's more robust and now has lots of comments, and is factored. Hence even though the UPARSE features help the expressions be shorter, it got longer because it's doing more (and saying more).

I went ahead and changed SEEK to take a rule, as well. I still think we should think about the question of which combinators break the rules and do things like quote variables and such. But I actually did hit a case here of wanting to be able to seek a calculated expression, so SEEK of a GROUP! was important...and so long as I was needing that I figured it might as well use a rule vs. doing something weird and quoting groups.

Sidenote on various frustrations: ZIP: How not to design a file format