Hyper-Zipping With pigz

Hyper-Zipping With pigz

Hyper-Zipping With pigz

Pigz is a crazee fast compression utility, as it can scale in parallel to the number of threads available to a computer, however it is not designed to be easy to use, it is easily installed with:

sudo apt install pigz -y

Using it can really trick you, if you set off to use it like zip (-r) with compression you would think it would work the same and it seemingly does!

pigz --fast -r coding_new.zip coding_new

Naturally this looks like it would quickly create 'coding_new.zip' a new zip file recursively pulling from the coding_new directory.

  • The command completely works, but will trick the person into thinking it is building like the 'zip -r' command, but what it is actually doing is is recursively crawling the directory AND the coding_new.zip and whatever files it finds it is building a single .gz file and then leaving it in the recursive directories it is crawling!
  • This can be a disaster as it can muck directory structures, the following script will clean this back:
f=$(find . -type f -name "*.gz")

for file in $f; do
   printf "Unzipping $file\n"
   gunzip $file
done
  • The above will reverse an inadvertent pigz that may even have been run multiple times. Save it to a script and enable it as an executable with:
chmod +x unpigz.sh

Here is the correct way to get a clean single .gz file that is recursively built, it literally built a 3 GB .gz single file from 73,000 files in about 12 seconds on a 24-core Ryzen 9. That's fast!

# Check if pigz is installed
if ! command -v pigz &> /dev/null; then
    echo "Error: pigz is not installed. Please install it."
    exit 1
fi

# Check if tar is installed
if ! command -v tar &> /dev/null; then
    echo "Error: tar is not installed. Please install it."
    exit 1
fi

# Check if exactly two parameters are provided
if [ $# -ne 2 ]; then
    echo "Usage: $0 <directory_to_compress> <output_tar_gz>"
    exit 1
fi

INPUT_DIR="$1"
OUTPUT_FILE="$2"

# Check if the first parameter is a directory
if [ ! -d "$INPUT_DIR" ]; then
    echo "Error: '$INPUT_DIR' is not a directory."
    exit 1
fi

# Check if the second parameter ends with .gz
if [[ ! "$OUTPUT_FILE" =~ \.gz$ ]]; then
    echo "Error: Output file '$OUTPUT_FILE' must end with .gz"
    exit 1
fi

# Get number of CPU cores for pigz
THREADS=$(nproc)

# Compress the directory using tar and pigz
echo "Compressing '$INPUT_DIR' to '$OUTPUT_FILE' with $THREADS threads..."
tar -c -C "$INPUT_DIR" . | pigz -p "$THREADS" > "$OUTPUT_FILE"

# Check if compression was successful
if [ $? -eq 0 ]; then
    echo "Compression completed successfully. Output: $OUTPUT_FILE"
else
    echo "Error: Compression failed."
    exit 1
fi

Let's time it, this time the script will behave very similar to a zip -r however when it activates all the cores, we get:

time ./pigzip.sh coding_new coding.gz
Compressing 'coding_new' to 'coding.gz' with 24 threads...
Compression completed successfully. Output: coding.gz

real	0m15.006s

To get the size of the folder exactly in bytes

du -sb coding_new

Which gives us: 6533415246 / 15 -> 435 MB/s effectively the saturation speed of the SSD reading it.

That's fast!

Now to unzip it, we can apply a final script for that again:

#!/bin/bash

# Check if pigz is installed
if ! command -v pigz &> /dev/null; then
    echo "Error: pigz is not installed. Please install it."
    exit 1
fi

# Check if tar is installed
if ! command -v tar &> /dev/null; then
    echo "Error: tar is not installed. Please install it."
    exit 1
fi

# Check if exactly one parameter is provided
if [ $# -ne 1 ]; then
    echo "Usage: $0 <input_tar_gz>"
    exit 1
fi

INPUT_FILE="$1"

# Check if the input file exists
if [ ! -f "$INPUT_FILE" ]; then
    echo "Error: '$INPUT_FILE' does not exist."
    exit 1
fi

# Check if the input file ends with .gz
if [[ ! "$INPUT_FILE" =~ \.gz$ ]]; then
    echo "Error: Input file '$INPUT_FILE' must end with .gz"
    exit 1
fi

# Get number of CPU cores for pigz
THREADS=$(nproc)

# Extract the tar.gz file using pigz and tar
echo "Extracting '$INPUT_FILE' with $THREADS threads..."
pigz -dc -p "$THREADS" "$INPUT_FILE" | tar -x

# Check if extraction was successful
if [ $? -eq 0 ]; then
    echo "Extraction completed successfully."
else
    echo "Error: Extraction failed."
    exit 1
fi

The same process is applied:

time unzip.sh file.gz 

On a fresh 8-core 16-thread laptop it was blisteringly fast to rebuild this file:

Extracting 'coding.gz' with 16 threads...
Extraction completed successfully.

real	0m21.059s
user	0m19.741s
sys	0m19.262s

Which again is saturating the SSD itself!

Linux Rocks Every Day