How to Compress Files in POSIX

This was quite an amusing one to read about. On one hand, the results kind of surprised me, but then, on the other... What exactly did I expect?

Anyway! How does one compress files in a POSIX-compliant system?

By the power of the Hinchliffe's rule, I say: you don't. Wait, what kind of tutorial is this?

The standard way

the way

POSIX defines three utilities for compression but let's focus on two of them: compress and uncompress. They have quite descriptive names and are incredibly simple to use. Just give them names of the files to process and it will do the work. Result will be stored in a *.Z file:

$ ls
archive.tar
$ compress archive.tar
$ ls
archive.tar.Z
$ uncompress archive.tar.Z
$ ls
archive.tar

By default, the input file is replaced by the output. This can be avoided with -c option that redirects the output to the standard output:

$ compress -c archive.tar >archive.tar.Z
$ uncompress -c archive.tar.Z >archive.tar.bak
$ ls
archive.tar archive.tar.bak archive.tar.Z

And of course standard input can be used as well with the special filename: -.

So far it all looks good. That's because we're discussing here an imaginary implementation of these utilities the way they are described by the POSIX standard. However, that's not the real world.

The actual way

If you are just like me and come from more Linux background, then prepare for disappointment. If you are coming from BSD, then I have great news for you: you can stop here, because your system actually implements the standard.

the other way

Instead of compress most Linux distributions come with gzip(1). Usually, it is the GNU Gzip. The reason for that is of course legal work and patenting issues. The full reasoning is covered in No GIF Files. However, this is all in past because the LZW patents already expired.

Let's put the story and reasons aside. What we have is an inability to conform to a standard due to legal reasons and this inability became a standard. And so in Linux systems you will end up using gzip or xz(1), or bzip2(1), or really anything else:

$ ls
archive.tar
$ gzip archive.tar
$ ls
archive.tar.gz
$ gzip -d archive.tar
$ ls
archive.tar

You can replace gzip with any of the mentioned utilities - they have very similar interfaces. Not only that, they are also partially compatible with the interface for compress defined by the POSIX standard. Each of them has an additional un command (e.g., gunzip, unxz) that can be used instead of -d option. If you feel adventurous you could try symlinking them (especially gzip since it implements LZ77).

This brings a question regarding formats compatibility, but it's a comparison big enough to have its own article.

In the end, if you want to use POSIX compress in GNU/Linux - you don't. Unless...

The other way

Unless you use ncompress which has both compress(1) and uncompress(1). Even more, it inherits directly from the original implementation. But there is one thing you need to know about it.

It's bad. Yes, a detailed comparison of compression algorithms is yet another huge and interesting topic, but this particular case really can be summed up in: it's bad. It's OK with text. At least, it implements the POSIX standard and is most likely available in your distribution's repository.

Here are results of compression of an arbitrary tarball with majority of source code and some resources, all done with default options:

Source22M
bzip25.4M
gzip5.9M
xz2.6M
ncompress9.1M

In other words, it's not bad, but it's staying behind (more) modern programs. This also could be an additional reason for why it is not used or even installed by default in most Linux distributions. I didn't check BSD's implementation, but I expect rather good results.

The main takeaway from this article is that if you plan to write anything that is portable across POSIX-compliant or semi-compliant systems, then you need to give compressing slightly more attention.