Archiving with POSIX utilities

The usual answer is tar. As you may see I intentionally linked to the GNU Tar. If you are a *BSD user then you use some other implementation. Both of them follow and extend POSIX'es standard for tar utility. Or so you would think.

Right now there is no POSIX tar utility. It has been marked as legacy already in 1997 and disappeared from the standard soon after. It's place took a behemoth called pax. The name gets even funnier when you consider the rationale and the size of this thing. But pax didn't came from just tar. There was one more influencer in here called cpio. You may know this one if you ever tinkered with RPM packages or initramfs.

In other words we have three utilities on today's table: tar, cpio and pax. According to Debian's popularity contest the frequency of each being installed is in the exact same order, with tar being at 8th place overall, cpio at 52nd, and pax at 6089th. I can't just talk about the least popular one, so I'll explain shortly how to use each of them in your usual Linux distribution while keeping in mind what POSIX had to tell us back in the day.

tar

Like I've already mentioned tarballs are the most popular. Not only that, they are commonly described as the easiest to use, although the interface is something that you can find jokes about. All operations on tarballs are handled via single tar utility.

box

Let's go through three basic operations: create an archive, list out the content, and extract it. Tar expects to have first argument to match this regular expression: [rxtuc][vwfblmo]*. The first part is function, and the second is a modifier. I'll focus only on those necessary to accomplish before-mentioned tasks.

To create an archive you:

$ tar cf ../archive.tar a_file a_directory

This will create an archive that will be located in parent directory of current working directory, and will contain a_file and recursively a_directory. Let's map every part of the command for clarity:

tar
Call tar
c
Create an archive
f
Use first argument after cf as the path to the archive
../archive.tar
Path to the archive (without f it would be treated as another file to include in the archive)
a_file a_directory
Files to include in the archives

Now that you have an archive, you can see it's content:

$ tar tf ../archive.tar
a_file
a_directory/
a_directory/another_file

As you have probably guessed t function is used to write the names of files that are in the archive. f works exactly the same way: first argument after tf is meant to point to the archive file.

To extract everything from the archive you:

$ tar xf ../archive.tar

Or add more arguments to extract selected files:

$ tar xf ../archive.tar a_file

This one will extract only a_file from the archive.

That's pretty much it about tar. The are two more functions: r that adds new file to existing archive, and u that first tries to update the file in archive if it exists and if it doesn't then it adds it. Note, that the usual compression options are not available in POSIX, they are an extension.

cpio

Heading off from the usual routes we encounter cpio. It's a more frequent sight than pax, but it still is quite niche compared to tar's omnipresence. Frankly, I like this one the most because of the way it handles input of file lists. Sadly, this also makes it slightly bothersome to use.

Now, now, cpio operates in three modes: copy-out, copy-in and pass-through. Our goals are still the same: to create an archive, list files inside, and extract it somewhere else and for that we'll only need the first two modes.

To create an archive, use the copy-out mode, as in: copy to the standard output:

$ find a_file a_directory | cpio -o >../archive.cpio

This instant you probably noticed that cpio doesn't accept files as arguments. In copy-out mode it expects list of files in standard input, and it will return the formatted archive through standard output. See a somehow step-by-step explanation:

find a_file a_directory |
List files, directories and their content from arguments and pipe the output to the next command
cpio
Call cpio (duh!)
-o
Use copy-out mode
>../archive.cpio
Redirect standard output of cpio to a file

You now have an archive file called archive.cpio in parent directory. To see its content type in:

$ cpio -it <../archive.cpio
a_file
a_directory
a_directory/another_file
1 block

Nice! What's left is extraction. You do it with copy-in mode like this:

$ cpio -i <../archive.cpio
1 block

Huh? What's that? Listing files and extracting both use copy-in mode? That's right. Like "copy-out" means "copy to standard output", "copy-in" can be understood as "copy from standard input". The t option prohibits any files to be written or created by cpio, nonetheless archive is read from standard input and then translated to list of files in standard output. Some extended implementations let you use t directly as sole option and imply the copy-in mode.

You can also use patterns when extracting to select files:

$ cpio -i a_file <../archive.cpio
1 block

You can copy nested files if you use d option:

$ cpio -id a_directory/another_file <../archive.cpio
1 block

This option tells cpio that it's allowed to create directories whenever it is necessary.

pass-through

Bonus! Pass-through mode can be used to copy files listed in standard input to specified directory. It doesn't create an archive at all.

$ ls ../destination
$ ls
a_directory  a_file
$ find a_file a_directory | cpio -p ../destination
0 blocks
$ ls ../destination
a_directory  a_file

pax

Finally, at the destination! This one lives up to the name of this post as it's still part of POSIX. The fun part is that you probably don't even have it installed, but don't worry, I didn't have it until like two days ago. It truly feels like a compromise forced on you and your siblings by your parents. Jokes aside, I actually started to like it, bulky but kind of cute.

Anyway, let's see what this coffee machine can do for us; same goals as previously. This will be confusing, because this utility is a compromise, and so it supports both usage styles: tar-like and cpio-like.

To create an archive you can use either:

$ pax -wf ../archive.pax a_directory a_file
$ find a_file a_directory | pax -wd >../archive.pax
$ find a_file a_directory | pax -wdf ../archive.pax

They are equivalent. You can mix the style as much as you want, as long as it doesn't become mess it's quite handy. As for what option does what:

-w
Indicates that pax will act in write mode (tar's c and cpio's -o)
f ../archive.pax
Argument after f is the path to the archive; note that it behaves slightly different compared to tar, it always takes next argument instead of first path that appears after flags. It means you can't put any options between -f and the path.
a_directory a_file
find a_file a_directory |
Both of these accomplish the same goal of letting know pax what files should be in archive. They are mutually exclusive! If there is at least one argument pointing to a file, then standard input is not supposed to be read.
d
This one is used to prevent recursively adding files that are in a directory, so that the behaviour is the same as in cpio:
$ find a_file a_directory | pax -wvf ../archive.pax
a_directory
a_directory/another_file
a_directory/another_file
a_file
pax: ustar vol 1, 4 files, 0 bytes read, 10240 bytes written.
$ find a_directory a_file | pax -wvdf ../archive.pax
a_directory
a_directory/another_file
a_file
pax: ustar vol 1, 3 files, 0 bytes read, 10240 bytes written.

The v option is used to increase verbosity of the "error" output. You can find similar functionality in most of command line utilities, including tar and cpio.

To list files that are in archive you can also use both styles:

$ pax <../archive.pax
a_directory
a_directory/another_file
a_file
$ pax -f ../archive.pax
a_directory
a_directory/another_file
a_file

Yes, that's the default behaviour of pax and you don't need to specify any argument (in case of cpio-like style). Sweet, isn't it?

To extract the archive use one of:

$ pax -r <../archive.pax
$ pax -rf ../archive.pax

For selecting files to extract use the usual patterns:

$ pax -r a_file -f ../archive.pax
$ pax -r a_directory/another_file <../archive.pax

That's all of the most basic use case. There's more, for instance pax supports mode similar to the pass-through mode we already know from the cpio. But there is something more important to mention about pax. It's supposed to easily support various different formats.

POSIX tells that pax should support: pax, cpio and ustar formats. I installed GNU pax and it seems to support: ar, bcpio, cpio, sv4cpio, sc4crc, tar and ustar. The default format for my installation is ustar as you have probably noticed in verbose output in one of the examples above. Pax format is extension for ustar, that's most likely the reason it's usually omitted.

You can select format with -x option, for supported formats please refer to your manual. Also note that explicitly specifying format should be only needed when writing an archive. When reading pax can identify archive's format efficiently:

$ find a_file a_directory | cpio -o >../archive.cpio
$ pax -vf ../archive.cpio
-rw-rw-r--  1 ignore   ignore    0 Jul 22 22:30 a_file
drwxrwxr-x  2 ignore   ignore    0 Jul 22 22:30 a_directory
-rw-rw-r--  1 ignore   ignore    0 Jul 22 22:30 a_directory/another_file
pax: bcpio vol 1, 3 files, 512 bytes read, 0 bytes written.

Final thoughts

Now then, it's time to finally wrap it all up. There is nothing left to say but remember to always check your manual, all of those utilities have various implementations that are compliant to POSIX in various degrees. Don't be naive and don't get tricked by them. I find pax the most reliable of them as its "novelty" and the interface that was quite "modern" from the start resulted in decently compliant implementations. Moreover, it includes nice things one may know from both cpio and tar. Find a moment to check it out!

Let's pretend that ar doesn't exist. Thank you.

boo!