• Efficiently copying files to multiple destinations

    If you need to copy large files between two machines, combining nc (netcat) with pigz (parallel gzip) is an easy and fast option. But what if you need to copy the same set of files to multiple destinations? This is a common need here at Tumblr, for example when we need to spin up several additional replicas of a MySQL instance at once in a fast, automation friendly way.

    You could copy from a single source to two destinations serially, but that takes twice as long. Or you could copy from a single source to two destinations in parallel, but that doesn’t end up being faster, due to contention for the network uplink.

    Happily, there’s a better way, utilizing some standard UNIX tools. By adding tee and a FIFO to the mix, you can create a fast copy chain: each node in the chain saves the data locally while simultaneously sending it to the next server in the chain.

    First, set up the last destination box. This one just listens with nc (we’re using port 1234 for example), and pipe to pigz to decompress, and tar to extract:

    nc -l 1234 | pigz -d | tar xvf -

    Next, working backwards along the chain, set up the other destination boxes. We do the same listen/decompress/extract combo; but before the decompress/extract stages, we use the tee command to also redirect the output to a FIFO.  A separate shell pipeline reads from the FIFO and sends the data — which is already archived and compressed — to the next destination box in the chain:

    mkfifo myfifo
    nc hostname_of_next_box 1234 <myfifo &
    nc -l 1234 | tee myfifo | pigz -d | tar xvf -

    Finally, on the source box, kick off the copy chain by sending the files to the first destination in the chain:

    tar cv some_files | pigz | nc hostname_of_first_box 1234

    In my testing, each box added to the chain only incurred a 3 to 10% performance penalty, which is a huge improvement over copying serially or in parallel from a single source.

    We’ve wrapped this technique in some Ruby code that automates the process and verifies every step. Besides using it to spin up database replicas, we also use it as part of our process to split shards that have grown too large — a topic we’ll delve into further in a future post.