Filters are a means of incrementally transcoding data from source to brokered output (bytes/binary! or characters/text!). A goal is to provide a standard API for transcoding that can be implemented and used as efficiently as possible (e.g. extracting a portion of encoded data without extracting the whole; native transcoders for Deflate). A filter could conceivably be implemented as a distinct type that has object-like properties (as the PORT! type does) and could thus be acted upon by the appropriate Rebol actors (COPY/SKIP/NEXT/TAIL, etc.).
I've alluded to a similar idea in an earlier post, however this concept is more focussed on transcoding one series of numbers/characters to another. Filters are NOT scanners/tokenizers/lexers.
A filter source can be:
- BINARY! or TEXT! values
- PORT! values that stream BINARY! or TEXT! (including files/network resources)
- filter values (i.e. filters can be layered)
Examples of filter types:
- Retrieves binary contained within a file/network resource
- Decodes text encoded as UTF-8, UTF-16, ISO-8859-1, CP-1252, etc. (or even unspecified using something like Chardet)
- Decodes binary compressed per Deflate, LZW, etc.
- Decodes binary encoded as 'text' per Base64, Ascii85, Hexadecimal, etc.
- Decrypts binary encrypted per e.g. Rebol 2 ENCLOAK/DECLOAK (but obviously more)
- Decodes text encoded mostly literally but with escape sequences, e.g. JSON strings, Rebol strings, XML/HTML data sequences/attribute values
Filters should have at least the following capabilities:
- Copy all encoded data
- Copy part of the encoded data
- Skip part of the encoded data (Deflate could potentially iterate faster if it wasn't emitting simultaneously)
Filters should possibly have the following capabilities:
- BACK/HEAD/negative SKIP support
- TAKE/REMOVE/CLEAR as a means of clearing buffers
Functions that consume data should support filters as a pseudo-series type, e.g.
- Parse
- BINARY/READ (from Oldes/Rebol3)
- CONSUME (from rgchris/bincode)
Filter values are exhausted when:
- An end-of-content signal/delimiter has been found e.g. self-terminating formats such as Deflate; quote marks ending a JSON string
- A filter cap has been reached e.g. the filter has a specified length
- An unrecoverable error occurs (e.g. invalid UTF-8 sequences in strict mode; the 'g' character in a hexadecimal stream)
- The source has been exhausted
It should be possible to recover the current source at the corresponding index within a filter value though this may require additional state info, e.g. in Deflate or Base64 where a byte within an encoding has information pertaining to more than one decoded byte
Filter algorithms can be native (e.g. Deflate tied to Zlib, UTF-8) or in user-mode (thus extensible).