"Bincode"

'Bincode' is the working name I've given to a pair (well, trio) of functions designed to work with binary formats. I know my general approach here isn't altogether unique, but I think in implementation it has some qualities of its own that I at least have found to be endearing.

Consume (+ Advance)

CONSUME takes a reference to a BINARY! and a named binary datatype (such as 'SIGNED-32, 'FLOAT-64, etc) or an INTEGER!, returns that value (or errors out if there's not enough input) and updates the reference to the point after said value.

source: #{010003}
consume source 'signed-16
; returns 64, source => #{03}

Additionally CONSUME can take a BLOCK! in which shorthand functions for the various datatypes are available:

source: #{03010203}
values: collect [
    consume source [
        loop unsigned-8 [
            keep unsigned-8
        ]
    ]
]
; values => [1 2 3], source => #{}

ADVANCE simply skips a given INTEGER! amount.

Accumulate

ACCUMULATE goes in the reverse direction using the same shorthand functions:

accumulate #{} [
    utf-8 65
    utf-8 8212
    float-64 pi
    repeat x 3 [
        unsigned-8 x
    ]
]
; => #{41E28094400921FB54442D18010203}

This snippet is another pass at creating a single-pixel PNG image (with an assist from a R2 DEFLATE wrapper):

chunkify: func [
    target [binary!]
    header [word!]
    data [binary!]
    /compress
][
    header: as-binary form header

    if compress [
        data: deflate/envelope data 'zlib
    ]

    accumulate target [
        unsigned-32 length? data
        accumulate header
        accumulate data
        accumulate crc32-checksum-of join header data
    ]
]

probe accumulate png: #{} [
    accumulate #{
        89504E47
        0D0A1A0A
    }

    chunkify png 'IHDR #{
        00000001
        00000001
        08 02 00 00 00
    }

    chunkify/compress png 'IDAT #{
        00 CC0000
    }
    
    chunkify png 'tEXt join #{} [
        "Title" null "Single Pixel!"
    ]

    chunkify png 'IEND #{}
]

Notes

Fairly sure this is bait for a lot of where Ren-C has gone with uparse/streaming/ports etc. That's fine, what I'm looking for is the vocabulary that stretches over a handful of common file formats (including Zip). I don't necessarily think that this is a silver bullet, though it has worked well enough to be used in that domain with enough clarity to make adjustments and retain readability. Doesn't seem too much of a stretch to consider this sitting over a stream or even compressed stream too.

2 Likes

Related: ENBIN and DEBIN:

BINARY! Dialected Encoding/Decoding instead of TO conversions?

They make you be explicit about the endianness, and I thought using +/- to indicate signedness was cute.

If I get the gist of your rethink (?) it seems a bit like you are thinking that reinventing basic primitives in each dialect is a losing battle... because you'll miss some. So you want to inherit all the control flow types of operations by default vs. reinventing that each dialect.

For other-language inspiration on this front: Haskell doesn't quite do that, but offers a certain reusable form of operations that can be used in their "dialects" (monads, which you might think of as being dialect-like, in some school of thought).

Their REPEAT that can be used in parser combinators is the monadic operator replicateM:

https://hackage.haskell.org/package/base-4.14.0.0/docs/Control-Monad.html#v:replicateM

Which you might think of it kind of like there are some forms of REPEAT that know they have some state they need to tunnel through to their bodies that come from the "dialecting context" and a more fundamental form that lacks that context.

Fairly sure this is bait for a lot of where Ren-C has gone with uparse/streaming/ports etc.

The merging of UPARSE with evaluator services is definitely a thing, because they have so much in common...like "current expression" for telling you where an error is, feeding along the code.

So I see the likelihood that UPARSE is one of those "parameterized evaluators" I was speaking of.

But I also was thinking of implicitly allowing any function with a certain signature to act as a combinator... if it takes in input and has a named output saying how much it consumed, then you'd have instant combinator with just minus one parameter (input implicit).

This made me think maybe ENBIN and DEBIN could be those kinds of functions.

1 Like

As there is a fair amount of overlap between the primitives each format uses and since a lot of these formats work together it seems inefficient to reinvent that for each one. Where you get some quirky types, such as dates in Zip, I will start with primitives and work from there.

Indeed, I have a less mature version of this module for R3C that uses them.

One consideration for ENBIN/DEBIN (that I'm not sure has been addressed since versions I'm familiar with) is whether they could work without copying from the source. Perhaps that's overkill for such small units of data, I don't know.

A combinator would require it (e.g. a combinator has to report how much of the input it consumed in the process of producing its synthesized result). I should add it.

The combinator calls the secondary output "remainder". Transcode calls it "next", but then variables tend to put it in something called rest.

The idea I had was that if the signature on a function had something called input and then this remainder/rest then UPARSE would be willing to call it...filling in input from the current position, and updating the position based on the remainder.

Here's a semi-related note on the subject of copying and INTO

1 Like

Do you have tests for these?

Rebol2 can run under 64-bit Windows, so it would be possible to use a GitHub Actions runner to test them.

It's possible to make multiple workflows per GitHub repo. I typically do things fairly granularly...so each script gets its own repository to have a separate issue tracker, and then I use the workflows for different platforms or other variations.

But...if you wanted, you could also have a monolithic Scripts repository and then make separate workflows for Bincode, Zip, Pdf. (Though if it were me I'd make r3c-bincode, r3c-zip, r3c-pdf repositories..)

What I'd like to do is to port these experiments to run under emulated Redbol. Having a smattering of tests would be a big help in that.

I do, but my test framework is still a little in flux (I have a concept for a testing format that I haven't yet put into code, but for another day). Is there a particular format that would work best?

I've had people suggest one-repo-per-script before. I'm not necessarily against it, it just seems a lot more difficult to track and doesn't seem to reflect well the interdependence of each as small parts of a larger whole.

Working in separate modules for development, and then snapshotting copies of the files in a master repository, seems a good compromise if you're trying to have one-repo-to-rule-them-all in a known good state for deployment.

But in terms of breaking it all the way to one script per repository, maybe that's excessive, and it should just be r3c-zip and r3c-pdf, where common files pick one home or the other.

I'm still using the super-basic "just write expressions that return true" format, as with Saphirion tests.

Though one interesting concept you might have missed out on is that the Ren-C test suite is now a client of the "make your own module" syntax, in that it isolates all test files into modules...and then within the file everything in a BLOCK! is made into a module

So due to Sea of Words you can run all the thousands of tests making hundreds of these modules, and nothing winds up added to the user context.

But that's really the only area of innovation there. ISSUE! and URL! are allowed as literals to serve as comments without needing semicolons.