Backtrack: Zip

In contemplating what I'm looking to achieve with R3C, I've been backtracking on ways in which I've/we've approached working with common formats.

I've mused over Zip before and given thought I'd revisit this first. This time in Rebol 2 (along with a core-friendly Deflate implementation).

Unpacking

First goal here is versatility: no unzip and be done—rather try to isolate the atomic steps of retrieval so as to offer more control over the process. Thus:

archive: zip/load %archive.zip

Returns an object representing the archive and

entry: zip/step archive

Will return the next entry (or none if at the end). So far nothing has been decompressed thus is a relatively cheap operation. Metadata for that entry such as filename/date is readily available.

entry/filename
entry/date

As is the compressed content itself:

content: zip/unpack entry

With these building blocks, it's easy enough to extract a whole archive or target specific files:

doc: zip/load %document.odt

while [
    file: zip/step doc
][
    if file/filename == %mimetype [
        probe zip/unpack file
        ; => "application/vnd.oasis.opendocument.text"
    ]
]

Packing

This presented a few conceptional challenges as to how one represents archives/entries in construction. Essentially an archive is a block of entry objects sharing the same structure as above (conceptually you could just throw those extracted entry objects into a block and pack them, but in writing this I don't recall if I implemented that—need to go back and look, would be cool if so).

new-archive: reduce [
    zip/prepare %thing #{01234567}
]

write %thing.zip zip/pack new-archive

Again, a key here is to keep the steps atomized so they can be handled in different ways. One such way is a wrapper that provides shorthand contextual functions (in one sense a 'dialect' but is just regular code).

write %thing.zip zip/build [
    add-file %mimetype "application/x-rebol+zip"

    repeat count 3 [
        add-file join %thing- count #{01234567}
    ]

    add-comment "A possibly useful note"
]

Relevance

I don't think anything here resembles anything particularly revolutionary, only that it's a departure from the typical Rebol way of doing things that I think opens up some possibilities. This isn't an end point, there's a bit more digging to do

1 Like

I think it's a good approach (and a good tool for building higher-level zip dialects).

(Maybe it could be called ZIPPER or something like that to speak of its generality, and save ZIP for higher-level tools that could be very situationally specific to what people are doing in a particular project? Doesn't matter as everyone can redefine everything.)

I would put in a plug for wherever this goes supporting the creation of a URI! for generalized reading, e.g. extracting single file, which I've mentioned before:

data: read zip://wherever/something.zip/folder/file.dat

But that would be built on top of this.

I've been skeptical of the arity-2 unzip, because I feel like UNZIP with the "unzip and be done" should take a file/url/binary and dump it in the "current directory". (But the current directory should be able to be in memory in a virtual filesystem...)

archive: zip/load %archive.zip

My first thought was "hey that's kind of like a generator or yielder" (note: not merged into mainline builds yet, though the stackless abilities they depend on are merged)

So I thought "maybe it could be an 'unzipper'"

archive: unzipper %archive.zip  ; hmmm

But that makes ARCHIVE a function, which is hard to name. Your approach is probably better to have the archive separately represent the state, and then call operations on it.

That said, a generator/yielder might be useful in the internal implementation. It may be easier to write ZIP/STEP if your enumeration can stay in the middle of whatever loop it's in and YIELD the data at each step, then communicate with the stepper.

write %thing.zip zip/build

So zips can not only be read as streams, but also written as streams. I remark on something interesting about that, involving a bug fixed in Ren-C's unzip:

; NOTE: The original rebzip.r did decompression based on the local file
; header records in the zip file.  But due to streaming compression
; these can be incomplete and have zeros for the data sizes.  The only
; reliable source of sizes comes from the central file directory at
; the end of the archive.  That might seem redundant to those not aware
; of the streaming ZIP debacle, because a non-streaming zip can be
; decompressed without it...but streaming files definitely exist!

I guess since streamed-zip-writing came later in the lifetime of zip, there were a lot of files that had all the compressed sizes up front. Enough so that some decompressions (at least unzip.reb) presumed that the person who wrote the files was able to go back and patch the bytes at the earlier point in the file.

Anyway, the point being that streaming zip writing is a thing. Since you're looking at more granular means of not putting everything in memory at once, that might be relevant to consider... in terms of how the ZIP/BUILD might be able to emit information to a port as it goes.

Of course that's all in the "we don't know quite how to do this"...but stackless and green threads are things I believe will play in big with the streaming puzzle.

I think in-memory is important when approaching Zip as a file format (ODT, ePub, etc) as opposed to an archive, so having a skeleton that can go in different directions is key

Yes, but (and not a show stopper)—within ZIP/BUILD, individual functions return the entry so it can be manipulated:

write %thing.zip zip/build [
    mimetype: add-file %mimetype "application/x-rebol+zip"
    mimetype/date: now - 10
    mimetype/is-executable: true
]

I mean, sure—ADD-FILE could be refined to handle options, just food for thought

That is a good ability. I guess I'd have to review to see how many of these attributes can be deferred to the directory at the end of the stream... maybe all of them can?

If so, ADD-FILE could already have sent the file over a network or whatever (maybe even be done by the time it returned, if it was quick). Then if it keeps a memory of these metadata objects, it could write all your changes at the end...if it waited to write the directory until the very end of the build.

But anything that needs to be on the entry itself would have to be part of the input to ADD-FILE or you couldn't assume you could start streaming it yet.

(I haven't looked at zip in a while, so my knowledge is rusty.)

You can possibly get away deferring everything to the index, however—I think it is considered best practice to have the index and entries match (A better Zip Bomb is my guardrail for this)

1 Like

There's a few areas where this would apply and I'm increasing leaning toward taking the format name (noun) and using that as a namespace for all operations pertaining to that format (where that makes sense). I know that ZIP(/UNZIP) is familiar as a verb, but I'm tending to find that starting from here leads to burning such words on such a narrow subset

I'd add DEFLATE(/INFLATE) as another data point: Deflate (the format) has more nuance to it than thing <-> compressed-thing. The current INFLATE verb, for instance, doesn't take advantage of Deflate's self-terminating quality, thus while it meets the most common need, more words are required for more subtle usage.

One solution could be to group such words in a single context: e.g. formats/zip/unpack. Idk, depends how messy that gets in code I guess.

2 Likes

Maybe encode zip and decode zip could be a literate way of asking for the operation, leaving zip.xxx available as a namespace...

(This puts the burden on ENCODE and DECODE for explaining exactly what all it is they bring to the table as verbs, and what the "value-add" is over zip.encode and zip.decode as plain functions...but articulating that "value-add" is needed system-wide for things like READ and such as well.)

2 Likes