Compression Benchmarks for PostgreSQL Archives

August 29, 2024

By far, my favorite RDBMS is PostgreSQL. Having worked with MySQL, MSSQL, Cassandra, Redis and more; the versatility of PostgreSQL continues to inspire me to write better and more complex SQL. I have a piece of software that reaches out to various news websites following best practices for crawling. Scraping content using provided by the sitemap.xml and following the rules set forth by the robots.txt file. The content of these sites can be anything, but the general approach is to collect as much news as possible to see if I can develop a series of heuristics to provide as technical indicators.

I've been running the software for about two years now, and a massive PostgreSQL table has been created from the result of it. Today, I'd like to start making regular backups of the data in the table. This article will focus on benchmarking how long it'll take to backup a table using the pg_dump program & Docker.

Standing on the shoulders of giants, I've found a comprehensive review of algorithms used for compression. I'm not concerned with parallel processing; I'll stick to evaluating the programs as provided. What I'm looking to glean is a definitive and consistent measurement of how long it'll take to export and compress information for the massive PostgreSQL table.

Benchmark Setup

Benchmark setup is fairly straightforward. A table called location tracks various metrics about URLs such as crawl_delay, domain of the URL, change_ferq, and lastmod. Properties provided by a sites' sitemap.xml. Which in turn allows for the development of an algorithm to select more relevant pages, yet still allow the crawler to search for archived content while not overburdening the website. The location table will be used to identify the best compression algorithm benchmark for the much larger table

timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
date -d "$timestamp" +"%s"
docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t location -a -d dbname > $host_filepath
timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')

The database is ran using docker, specifically the postgis:15-3.3-alpine image. Port 5432 is exposed to the host, but we won't use it because it seems that connecting to a port exposed to the host will route traffic through the LAN. Instead we'll export data in the container and compress the output from the exec command to a filepath on the host.

How the database has been initialized using Docker

docker run -p "5432:5432" --name postgresql-datastore -e 'POSTGRES_USER=dbuser' -e 'POSTGRES_PASSWORD=password' -e 'POSTGRES_DB=dbname' -d --shm-size=12g postgis:15-3.3-alpine

The compression algorithms to be tested are zstd, gzip, bzip2, lz4, and xz.

Here is the full script to run the benchmark

#!/usr/bin/env bash

set -e

TABLES="location"

timestamp=$(date '+%Y-%m-%dT%H:%M:%S')
dump_dir=/datatore/backups

for table in $TABLES; do
  echo "ZSTD Benchmark"
  filepath=$dump_dir/$table.$timestamp.sql-data.zstd
  echo "Dumping table '$table' to $filepath"
  container_id=$(docker ps|grep datastore|awk '{print $1}')
  timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_start_interval=$(date -d "$timestamp_start" +"%s")
  docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t $table -a -d dbname | \
    zstd -T2 > $filepath
  timestamp_stop=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_stop_interval=$(date -d "$timestamp_stop" +"%s")
  timestamp_diff=$(($timestamp_stop_interval - $timestamp_start_interval))
  filesize=$(du -sh $filepath|awk '{print $1}')
  echo "Benchmark for Table: $table, Compression: ztsd. Filesize: $filesize. Took $timestamp_diff seconds"

  echo "GZIP Benchmark"
  filepath=$dump_dir/$table.$timestamp.sql-data.gzip
  echo "Dumping table '$table' to $filepath"
  container_id=$(docker ps|grep datastore|awk '{print $1}')
  timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_start_interval=$(date -d "$timestamp_start" +"%s")
  docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t $table -a -d dbname | \
    gzip -c > $filepath
  timestamp_stop=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_stop_interval=$(date -d "$timestamp_stop" +"%s")
  timestamp_diff=$(($timestamp_stop_interval - $timestamp_start_interval))
  filesize=$(du -sh $filepath|awk '{print $1}')
  echo "Benchmark for Table: $table, Compression: gzip. Filesize: $filesize, Took $timestamp_diff seconds"

  echo "bzip2 Benchmark"
  filepath=$dump_dir/$table.$timestamp.sql-data.bzip2
  echo "Dumping table '$table' to $filepath"
  container_id=$(docker ps|grep datastore|awk '{print $1}')
  timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_start_interval=$(date -d "$timestamp_start" +"%s")
  docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t $table -a -d dbname | \
    bzip2 -cz > $filepath
  timestamp_stop=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_stop_interval=$(date -d "$timestamp_stop" +"%s")
  timestamp_diff=$(($timestamp_stop_interval - $timestamp_start_interval))
  filesize=$(du -sh $filepath|awk '{print $1}')
  echo "Benchmark for Table: $table, Compression: bzip2. Filesize: $filesize. Took $timestamp_diff seconds"

  echo "lz4 Benchmark"
  filepath=$dump_dir/$table.$timestamp.sql-data.lz4
  echo "Dumping table '$table' to $filepath"
  container_id=$(docker ps|grep datastore|awk '{print $1}')
  timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_start_interval=$(date -d "$timestamp_start" +"%s")
  docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t $table -a -d dbname | \
    lz4 > $filepath
  timestamp_stop=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_stop_interval=$(date -d "$timestamp_stop" +"%s")
  timestamp_diff=$(($timestamp_stop_interval - $timestamp_start_interval))
  filesize=$(du -sh $filepath|awk '{print $1}')
  echo "Benchmark for Table: $table, Compression: lz4. Filesize: $filesize. Took $timestamp_diff seconds"

  echo "xz Benchmark"
  filepath=$dump_dir/$table.$timestamp.sql-data.xz
  echo "Dumping table '$table' to $filepath"
  container_id=$(docker ps|grep datastore|awk '{print $1}')
  timestamp_start=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_start_interval=$(date -d "$timestamp_start" +"%s")
  docker exec $container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t $table -a -d dbname | \
    xz -cz > $filepath
  timestamp_stop=$(date '+%Y-%m-%dT%H:%M:%S')
  timestamp_stop_interval=$(date -d "$timestamp_stop" +"%s")
  timestamp_diff=$(($timestamp_stop_interval - $timestamp_start_interval))
  filesize=$(du -sh $filepath|awk '{print $1}')
  echo "Benchmark for Table: $table, Compression: xz. Filesize: $filesize. Took $timestamp_diff seconds"
done

Benchmark Results

Benchmark for Table: location, Compression: ztsd. Filesize: 1.6G. Took 88 seconds
Benchmark for Table: location, Compression: gzip. Filesize: 1.7G, Took 200 seconds
Benchmark for Table: location, Compression: bzip2. Filesize: 1.3G. Took 1144 seconds
Benchmark for Table: location, Compression: lz4. Filesize: 2.5G. Took 88 seconds
Benchmark for Table: location, Compression: xz. Filesize: 1.3G. Took 2662 seconds

Understanding the Benchmark Results

A core trade off when selecting an optimal compression algorithm is the amount of time taken; relative to the ratio of the file before compression size over after compression size.

$$ Compression Ratio = {Uncompressed Size\over{Compressed Size}} $$

I'm more interested in the backup running quickly and am willing to accept slightly larger archived files. bzip2 and xz both provided the smallest filesize, but the time to archive took well over twenty minutes for each. ztsd took about a minute and a half, and provided a slightly larger file at 1.6GB in size

Data Stream

I'm not archiving files. I'm archiving a stream coming from a docker exec command and that has additional overhead to consider. Rather than understanding the overhead, produce a metric for how many MB/s of information is being transmitted from the docker exec command.

Exporting the same stream to dd for about a minute will tell us the maximum TransmissionRate of data being sent to various compression algorithms. Interprocess communication is synchronous between bash processes when using a pipe. Therefore, the TransmissionRate will only be as fast as the compression algorithm can compress the information.

$ docker exec container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t location -a -d dbname|dd| zstd -T2 > /dev/null
6886619648 bytes (6.9 GB, 6.4 GiB) copied, 65.8708 s, 105 MB/s

$ docker exec container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t location -a -d dbname|dd| gzip -c > /dev/null
2551970816 bytes (2.6 GB, 2.4 GiB) copied, 61.7171 s, 41.3 MB/s

$ docker exec container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t location -a -d dbname|dd| bzip2 -cz > /dev/null
422925824 bytes (423 MB, 403 MiB) copied, 63.8073 s, 6.6 MB/

$ docker exec container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t location -a -d dbname|dd| lz4 > /dev/null
6205816832 bytes (6.2 GB, 5.8 GiB) copied, 61.3169 s, 101 MB/s

$ docker exec container_id /usr/local/bin/pg_dump -h localhost -U dbuser -t location -a -d dbname|dd| xz -cz > /dev/null
704544768 bytes (705 MB, 672 MiB) copied, 61.2334 s, 3.0 MB/s

Calculate the UncompressedSize of each archive.

$$ UncompressedSize = Seconds \times \text{Average MB/s} $$

Calculate the CompressionRatio of each archive.

$$ CompressionRatio = {UncompressedSize \over {CompressedSize}} $$

Algorithm	Math	Ratio	Time
zstd	$$ 88 \times 105.0 \over {1600} $$	5.775	1 minute 28 seconds
gzip	$$ 200 \times 41.3 \over {1700} $$	4.8588	3 minutes 20 seconds
bzip2	$$ 1144 \times 6.6 \over {1300} $$	5.808	19 minutes 4 seconds
lz4	$$ 88 \times 101.0 \over {2500} $$	3.5552	1 minute 28 seconds
xz	$$ 2262 \times 3.0 \over {1300} $$	5.22	44 minutes 22 seconds

Compression Ratios can vary based on a multitude of factors, most notably is the amount of repeated information in the file being archived. For instance, text data often repeats while random data does not. Therefore we'll be able to achieve higher compression ratios on archives with text data (such as a PostgreSQL dump).

Good enough

The archival software can only archive information at a certain rate. Compression software is generally tested based on how quickly the archive can be generated from a file-descriptor or streaming input. Rather than re-creating those test results, I came up with what I've seen in initial iterations. Something quick that could be refactored at a later time to be more robust. I've also omitted talking about decompression metrics, as the goal here was to identify how quickly the archive can be created. I purposefully left out considerations on file-system performance.

Joseph Curtin

Compression Benchmarks for PostgreSQL Archives

Benchmark Setup

Benchmark Results

Understanding the Benchmark Results

Data Stream

Good enough

Article References