Kaitai Struct Declarative Language for Binary Formats

hostilefork · February 21, 2023, 10:36pm

This is an interesting declarative language in YAML designed to generate classes for binary formats. There are many formats defined:

https://formats.kaitai.io/

That's a lot of cases...but as one instructive example, you can look at how it describes a ZIP file--with little snippets of code in it for extracting the values:

ZIP archive file format spec for Kaitai Struct

The C++ code generated is like this:

ZIP archive file: C++11/STL parsing library

But it also can also be used to make code for C#, JavaScript, Python, Ruby, Nim, PHP, Lua, Perl... (though we'd assume that if you escape code in, that part will only work be available for that language)

The regimentation of YAML provides the typical repetition in the "dialect". This is the same as the Rebol complaint about JSON--not really leveraging "parts of speech", but repeating tags over and over like id: and type:

  - id: version
    type: u2
  - id: flags
    type: gp_flags
    size: 2
  - id: compression_method
    type: u2
    enum: compression

But it's still pretty hard to compete with, especially when you consider this is giving a compilable specification...so the performance is going to be much better.

I always thought BINARY! parse was something that Rebol would have a unique story for, and Ren-C's UPARSE makes that a stronger story (by allowing rules to synthesize arbitrary results via extraction)... but seeing this kind of stuff reminds me that there are diminishing returns.

jhgorse · January 27, 2025, 5:33pm

@hostilefork,

This is an open unsolved problem in computer science and engineering. Broadly speaking, the tools that we have to speak abstractly about the binary that our Turing Machines actually use is incredibly limited. From source code to compiler to linker to loader to execution to debugger/disassembler/perf tools, we lose a lot of visibility into behavior of the software, particularly with regard to data. Data today is treated as static or executable. A simple proof that it is both. A variable sized packet with a length field ahead of the payload. Another is the first bit of a signed integer. And finally E3A0 0010
will write 0x100 to the ARM Program Counter (PC). The data has meaning about what is to come. Without it, this schema, the data is meaningless noise. With it, it might become Information to either the machine or the end user.

Google Protobuf famously does not support specifying the binary format. I do not know why. Though I do know that no one has added the functionality which means that people who do hardware-level protocols for a living, like IEEE for Ethernet, don't care about the abstract benefits provided by Protobuf. Fair enough. Google tools are not for everyone.

Kaitai Struct is a Java-first approach to the binary decoding problem. The initial design was decode-only. Serialization was added later for Java and Python due to "overwhelming" demand, though it continues to be secondary in the design and not merged into mainline:

Further, C support for decoding is not supported.

The C ABI is lingua franca. Strategically significant for interop. Further Dennis Ritchie and Ken Thompson wrote C to interface with digital systems. C is one of the more complete approaches to data. Hence its widespread continued 53 years after its initial release in 1972.

Taking a strategic step back, what is the value of writing a dynamic YAML representation of binary protocols if it cannot write? Everything that is necessary to pack the structure is described. All you need are the necessary data elements populated and to execute the logical assembly. Further, the benefit of proving that the transform is not lossy when inverted to the original structures seems too good to pass up. To be fair, there are implementation simplifications that can occur being read-only, but not enough to outweigh the benefit of serialization.

Wrapping up, I want to encourage thinking and solutions for bridging the hardware-software divide, of magical registers whose bits change the world to the software performs the reads and writes in a human comprehensible fashion.

When I work in this space, I see the need to describe the metadata of the binary representation. For each field, the bit length, format (signed, unsigned, floating point, fixed point, and so on), name/description, and causal effect expressed or human readable description (e.g. a callback function).

I've written an Extended Struct for Python which extends their struct.pack/unpack() to handle arbitrary bit length fields with the 'v' format character. '4v' is 4 bits. Assumes Big Endian. Works well enough for the little hacking I need to do for now. Still looking for a way to describe a binary format once and not write the serializer/deserializer in each language that touches the data.

Cheers,
Joe

hostilefork · January 30, 2025, 6:35am

Some might consider this more a proof of failure of the educational system vs. a proof of the primacy of C...

...and so you've got some competing demands here, if you are talking about "each language"...

It's a super-common pattern in classical imperative programming, where humans are bad at writing forward and reverse transformations and maintaining that code.

I'm not sure you'll get "reversibility" from a specification if the language that interprets it isn't expressive enough in its type system to properly box and unbox everything under the right conditions relative to the spec. We've spoken about this before--about why FP manages to let you express things that aren't intrinsically reversible, yet you can write your program in a way that sort of pretends they are...as if the reverse transformation exists, even though it doesn't. It just sifts the order out so that it never has to actually reverse anything, but figures the meaningful ordering out (in terms of things like bind and return, if monads, but not everything is a monad).

(Take the above imprecise language with a grain of salt.)

What you need are combinators, and to get those combinators you need to lean on a type system. Or failing that, something like ~~Rebol~~ Ren-C that is so paradigm neutral that you wind up having to tailor it ad-hoc to where someone might ask you why you are reinventing the wheel. I'm just trying to make it unusually easy for laypeople who want to play with a Minecraft-like-programming model to do these things.

("We must not forget that the wheel is reinvented so often because it is a very good idea; I've learned to worry more about the soundness of ideas that were invented only once." --David Parnas)

@bradrn would know more than me about the specifics if you want to draw inspiration from Haskell. But if you're looking for the "right" answer here you need something "combinator-like". Maybe cereal or Data.Binary.Parser

Since my Haskell is weak I asked an AI to give an example, e.g. for ZIP, here's a bit of a suggestion:

data ZipEntry = ZipEntry {
    signature    :: Word32,
    compression :: Word16,
    filename    :: String,
    data        :: ByteString
} deriving (Show, Eq)

zipEntryFormat :: BinaryFormat ZipEntry
zipEntryFormat = do
    sig <- field "signature" (constant 0x04034b50)
    comp <- field "compression" word16LE
    fname <- field "filename" (lengthPrefixed word16LE utf8String)
    dat <- field "data" (remainingBytes)
    pure $ ZipEntry sig comp fname dat

Core mechanism something like this:

data BinaryFormat a = BinaryFormat {
    parseFrom :: ByteString -> Either Error (a, ByteString),
    serialize :: a -> ByteString
}

You can ask your nearest AI to explain how it works. But here's some fake C++:

// This is NOT real C++, but illustrates the concept
template<typename T>
class BinaryFormat {
public:
    virtual T parse(const std::vector<uint8_t>& bytes) = 0;
    virtual std::vector<uint8_t> serialize(const T& value) = 0;
};

// Each primitive must implement both directions
class Word16Format : public BinaryFormat<uint16_t> {
public:
    uint16_t parse(const std::vector<uint8_t>& bytes) override {
        return bytes[0] | (bytes[1] << 8);
    }
    std::vector<uint8_t> serialize(const uint16_t& value) override {
        return {uint8_t(value), uint8_t(value >> 8)};
    }
};

// Composing formats preserves bidirectionality
template<typename T1, typename T2>
class PairFormat : public BinaryFormat<std::pair<T1, T2>> {
    BinaryFormat<T1>* first;
    BinaryFormat<T2>* second;
    // ... implementation that uses both parse and serialize from sub-formats
};

So far, what we've done is more like Kaitai than it is like combinators:

But, as UPARSE has shown, Ren-C can tackle higher-order, combinator-like problems.

If you want an introduction to combinator-based thinking, this article is pretty good.

Parsing Huge Simulated Streams In Attoparsec | AE1020: Lazy Notebook

bradrn · February 18, 2025, 9:01am

Only just saw this, sorry…

I agree that combinators are the best approach here. Unfortunately, the AI code isn’t the way to do it: I’m almost certain that that BinaryFormat type is not a monad. It’s still possible to make a bidirectional parser/serialiser, but you have to be a bit clever to get it working. It looks like the key is to go through profunctors: see e.g. Kowainik’s article on tomland for details.

A simpler approach is to just get the user to write the parser and serialiser separately. This is slightly more error-prone, but not by much, especially since both of them can be given their own easy-to-use combinators. This is what cereal does, as well as the other popular packages binary and serialise.