Make COMPRESS/DECOMPRESS default to gzip

The "deflate" algorithm is pretty much the gold standard for lossless compression of generic data. However, it is very bare bones. Among things that are left up to the container format is whether the container wants to specify the length of the uncompressed data.

If you don't have the length up front, then it's a bit more expensive to do the decompression. Heuristics need to be used to guess how big a buffer to make to decompress into. So the buffers are either allocated bigger than they have to be (then you have to worry whether to return the data in an oversized allocation or to reallocate it and shrink it) or they have to undergo the cost of resizing as they go along.

Hence it's not surprising that Rebol's COMPRESS sticks a length onto the end of the zlib data... to make DECOMPRESS more efficient. But this trivial-seeming decision still constitutes a "container format". And it's one that competes with nearly the same thing... the gzip format.

I think it's good to include the length by default. But not good to deal in yet-another-file-format.

So I propose that we make COMPRESS and DECOMPRESS default to gzip. There will be zlib and raw deflate support, so it's not hard to support the old way in code that needs to. For decompression at least, a usermode compatibility ADAPT-ation could sniff the binary to see if it was gzip-like... and if not, trim off the length and do ordinary zlib decompression.

The other reasonable option would be to make COMPRESS and DECOMPRESS default to zlib (a barely-there container) or pure deflate/inflate, neither of which have lengths included. Basically, whatever the default is, not have it be an unnamed format not published elsewhere.

I'm happy with defaulting to gzip - :thumbsup:

Sounds like a plan to me.

Yes, sounds good. ........

Good we seem to agree that having "Rebol format compression" only confuses things. But the next question I guess would be whether we want to use this opportunity to generalize COMPRESS and DECOMPRESS or not.

compress/method asdf 'deflate

compress [%abc.txt %def.txt] ;-- default to ZIP?

This means putting in a dispatcher of some kind. If we went with this, the bare bones deflate might make more sense by default, then gzip and gunzip could be specializing COMPRESS and DECOMPRESS with 'gzip.

I'm not really all that sure. I kind of like the idea that gzip and gunzip be explicitly named such, so that could shift the bias to COMPRESS and DECOMPRESS defaulting to deflate with no zlib header...

Main thing--again--killing the Rebol format (!)

How about falling in-line with COMPRESS as used on Unix (file.Z) and X-internet (x-compress) which uses LZW compression?

Example usage:

compress asdf    ;;  defaults to LZW 

compress/with asdf 'gzip

;; also with block you can add options like this
compress/with asdf [gzip level: "fast"]
compress/with asdf [gzip level: 5]
compress/with asdf [gzip level: "best"]

NB. This suggestion is not because I need LZW but more just synergy with COMPRESS history/nomenclature.

It's a sensible line of thinking, though Rebol is trying to commandeer generic words. Though one might wonder if encode {data} 'gzip is even more generic.

compress: specialize 'encode [codec: 'lzw]
gunzip: specialize 'decode [codec: 'gzip]
zip: specialize 'encode [codec: 'zip]

Question being if there's already a big enough can of worms with codecs...providing options to those codecs...and registration/unregistration, detection of file extensions...that we could just bring it under the umbrella of encoding/decoding.

Hm. Well it's a thought.

1 Like

It's a good thought so if it's doable then it gets a :thumbsup: from me!

So potentially this is how it would work then:

load %file.Z                    ;; uncompress (LZW)
save %file.gz "xxxx"            ;; gzip
load/type %file "xxxx" 'gzip    ;; forces gunzip

So where the compression codec has configurable parts then something like this may be needed?

save/type %file "xxxx" [gzip level: 5]

ref: trello - Encodings used by LOAD and SAVE are based on FUNCTION!s that can be either native or otherwise, allowing usermode "codecs"