Concept: Filters

rgchris · January 5, 2023, 3:49am

Filters are a means of incrementally transcoding data from source to brokered output (bytes/binary! or characters/text!). A goal is to provide a standard API for transcoding that can be implemented and used as efficiently as possible (e.g. extracting a portion of encoded data without extracting the whole; native transcoders for Deflate). A filter could conceivably be implemented as a distinct type that has object-like properties (as the PORT! type does) and could thus be acted upon by the appropriate Rebol actors (COPY/SKIP/NEXT/TAIL, etc.).

I've alluded to a similar idea in an earlier post, however this concept is more focussed on transcoding one series of numbers/characters to another. Filters are NOT scanners/tokenizers/lexers.

A filter source can be:

BINARY! or TEXT! values
PORT! values that stream BINARY! or TEXT! (including files/network resources)
filter values (i.e. filters can be layered)

Examples of filter types:

Retrieves binary contained within a file/network resource
Decodes text encoded as UTF-8, UTF-16, ISO-8859-1, CP-1252, etc. (or even unspecified using something like Chardet)
Decodes binary compressed per Deflate, LZW, etc.
Decodes binary encoded as 'text' per Base64, Ascii85, Hexadecimal, etc.
Decrypts binary encrypted per e.g. Rebol 2 ENCLOAK/DECLOAK (but obviously more)
Decodes text encoded mostly literally but with escape sequences, e.g. JSON strings, Rebol strings, XML/HTML data sequences/attribute values

Filters should have at least the following capabilities:

Copy all encoded data
Copy part of the encoded data
Skip part of the encoded data (Deflate could potentially iterate faster if it wasn't emitting simultaneously)

Filters should possibly have the following capabilities:

BACK/HEAD/negative SKIP support
TAKE/REMOVE/CLEAR as a means of clearing buffers

Functions that consume data should support filters as a pseudo-series type, e.g.

Parse
BINARY/READ (from Oldes/Rebol3)
CONSUME (from rgchris/bincode)

Filter values are exhausted when:

An end-of-content signal/delimiter has been found e.g. self-terminating formats such as Deflate; quote marks ending a JSON string
A filter cap has been reached e.g. the filter has a specified length
An unrecoverable error occurs (e.g. invalid UTF-8 sequences in strict mode; the 'g' character in a hexadecimal stream)
The source has been exhausted

It should be possible to recover the current source at the corresponding index within a filter value though this may require additional state info, e.g. in Deflate or Base64 where a byte within an encoding has information pertaining to more than one decoded byte

Filter algorithms can be native (e.g. Deflate tied to Zlib, UTF-8) or in user-mode (thus extensible).

rgchris · January 5, 2023, 4:01am

I'll spitball samples/test cases in replies. The example from the ports/streams/iterator posts was a layered encoding where an compressed encoding is in turn encoded as text:

encoded: make filter! [
    ; implied type: 'text
    source: "start F3DI7!! end"
]

consume encoded "start"

encoded-ascii85: make filter! [
    type: 'ascii85
    source: encoded
]

encoded-deflate: make filter! [
    type: 'deflate
    source: encoded-ascii85
]

result: copy encoded-deflate

consume encoded " end"
=> true

Extracting from a large file

big-file: open/direct %big-file.txt

file-content: make filter! [
    ; implied type: 'stream
    source: big-file
]

copy/part skip file-content 1'000'000 25

As mentioned, ideally filter algorithms would have mechanisms to skip content without buffering, so you could do the following as fast as any other way of doing it:

big-file: open/direct %big-file.txt

file-content-as-utf-8: make filter! [
    type: 'utf-8
    source: make filter! [
        ; implied type: 'stream
        source: big-file
    ]
]

copy/part skip file-content 1'000'000 25
; skips 1'000'000 UTF-8 characters without copying/buffering

rgchris · January 5, 2023, 3:06pm

One interesting file format is that of Xara (XAR) vector images. They consist of an eight-byte header followed by a collection of records. Records consist of two little-endian 32-bit integers denoting the record type and size followed by the record content. Presumably easy to traverse—however: the wrinkle here is that there is one record type that indicates that subsequent records are encoded in a single Deflate stream that terminates with both the end of the Deflate stream and a record type that indicates the end of the compressed section.

This is a simplified way of looking at it (each tag representing a whole record):

---- regular byte stream ----
<8-byte header>
<title>
<metadata>
<start-compressed-section>
---- deflate compressed byte stream ----
<thing>
<thing>
...
<end-compressed-section>
---- regular byte stream ----
<thing>
<thing>
---- eof ----

This type of switch could be handled with filters.

file-content: make filter! [
    source: open/direct %image.xar
]

content: file-content

consume content #{... magic number ...}

until [
    type: consume content 'unsigned-32-le
    size: consume content 'unsigned-32-le

    switch type [
        types/start-compressed-section [
            content: make filter! [
                type: 'deflate
                source: file-content
            ]
        ]

        types/end-compressed-section [
            assert [
                tail? content
            ]

            content: file-content
        ]
    ]

    ... record dispatcher ...

    type == types/end-of-file
]

hostilefork · January 5, 2023, 6:16pm

Related threads:

I'm curious about this distinction... why wouldn't a filter that takes in UTF-8 and produces a block of transcoded data be covered here? What's the missing capability?

The umbrella term I tend to myself use above "Filter"/"Codecs"/"Ports" is "Streaming". This is because of the use of the term in C++ and in Haskell in the cases I find interesting. But even in Haskell it branches off into different models, such as Conduit, Pipes and Machines

I'm glad you're putting down some concrete examples and experimenting with them. Because if one stays fully general it gets hard to know what the limits are that would inform a useful design.

Maybe we could get examples worked up and see some contrast with how those would be done e.g. in C++ (with Boost.Asio) or in Conduit/Pipes/Machines/Streams of Haskell? I feel that really helped with UPARSE to drive the design from ideas that had already been worked out.

rgchris · January 5, 2023, 8:33pm

I think it's ultimately scope creep. In my proposal in the other thread, I'd conceived as such a mechanism as tokenizing as well, but in hashing out many of the data points I've been considering, it just became too convoluted. I feel in isolating binary/text it's really a clean separation that lends well to optimization and doesn't fall into the Codec black box that I think is ultimately problematic.

'Filter' is the term used in PDF and I think lends itself to this specific isolation. 'Streaming' would apply, but could also be saved for more appropriate use, such as the mechanism for moving data in from external sources. And one can apply filters to a stream...

I'm still following through on my line of inquiry here and see if I can get some success in streamlining some of the 'Codec' work I've been exploring. Certainly some of this thinking comes, again, from working a bit with Iterators in Javascript. I still think tokenization can be done in an iterative fashion as well, but that's a bit down the line.

I'm inclined to think this concept offers something new and quite compelling that I don't see in these alternatives (or at least, Pipes would seem similar but I find the examples quite opaque). The ability to work through layered encoding in an incremental manner that would only place in memory the essential parts of a stream, and in a way that is really quite minimal in terms of its impact on the language. The mechanics for obtaining the stream are where they should be and it encourages a style of handling incoming data in a way that is incremental, precise and relatively non-blocking—180º from the traditional Rebol family model if nothing else.

Of course, it's easy to say so. Need to actually flex it a bit to back that up. And I haven't got to PostScript yet...