Benchmarking SnapFS

This page is for beta users who want to measure local scan performance before choosing scanner settings for a host.

The goal is not to produce one universal score. The goal is to compare:

one host against itself
small-file behavior against large-file behavior
one hash algorithm against another
cold-path runs against warmed or repeated runs

When To Benchmark

Benchmarking is especially useful when:

scan performance matters for a production agent host
you are deciding whether to install xxhash and use xxh64
you want to compare worker counts on a real dataset
your storage path may be the bottleneck and you want to confirm that

Install Benchmark Support

At minimum, install SnapFS:

python3 -m pip install -U snapfs

If performance matters on that host, prefer installing with xxhash support so you can benchmark xxh64 as well:

python3 -m pip install -U 'snapfs[xxhash]'

xxh64 is often much faster on many-small-file, warm-cache, or CPU-limited scan workloads.

Benchmark Shape

For useful results, benchmark at least two dataset shapes:

a small-file tree with many files
a large-file tree with relatively few files

Those two shapes often behave very differently.

Small-file runs tend to emphasize:

per-file overhead
metadata walking cost
scheduling overhead
hash CPU cost

Large-file runs tend to emphasize:

storage throughput
read-ahead behavior
whether the scan is I/O-bound instead of hash-bound

Benchmark Tools

The public package includes the snapfs CLI, but the local benchmark helper currently lives in the public repository:

https://github.com/snapfsio/snapfs

Clone the repo when you want to use the bundled benchmark scripts:

git clone --depth 1 https://github.com/snapfsio/snapfs
cd snapfs

The repo includes:

scripts/bench_scan.py for direct local scan-engine benchmarks
scripts/example_benchmark_matrix.json as a sample matrix to copy and edit
scripts/run_benchmarks.py to expand and execute a local benchmark suite

Quick Commands

Benchmark a single dataset directly:

python3 scripts/bench_scan.py /path/to/tree --force --workers 4 --algo sha1
python3 scripts/bench_scan.py /path/to/tree --force --workers 4 --algo xxh64

Run the matrix runner:

mkdir -p /tmp/snapfs-bench
cp scripts/example_benchmark_matrix.json /tmp/snapfs-bench/benchmark_matrix.json
$EDITOR /tmp/snapfs-bench/benchmark_matrix.json
cd /tmp/snapfs-bench
python3 /path/to/snapfs/scripts/run_benchmarks.py

The matrix runner looks for ./benchmark_matrix.json in the current working directory by default.

Example Results

Representative example:

Dataset	Tool	Mode	Algo	Workers	Files	Bytes	Elapsed s	MiB/s	Files/s	Repeat	Notes
small-files	snapfs	force	sha1	4	717	0.69 GiB	0.813	872.5	881.827	2	warmed run
small-files	snapfs	force	xxh64	4	717	0.69 GiB	0.319	2223.5	2249.476	1	warmed run
large-files	snapfs	force	sha1	4	129	27.20 GiB	258.104	107.9	0.500	2	storage-limited
large-files	snapfs	force	xxh64	4	129	27.20 GiB	256.301	108.7	0.503	3	storage-limited

This is a common pattern:

xxh64 is clearly faster on the small-file workload
the large-file workload stays near the same MiB/s

That usually means:

small-file performance is meaningfully affected by hash cost
large-file performance is limited by the storage path, not the hash algorithm

How To Choose A Hash Algorithm

Current SnapFS options:

sha1: the current default
xxh64: available when xxhash is installed
sha256: supported when you want a SHA-256 hash specifically

Practical guidance:

use sha1 if you want the current default behavior and have not benchmarked yet
try xxh64 first when host throughput matters and xxhash is available
use sha256 when you specifically want SHA-256, even if it may cost more CPU

Interpreting Results

A few patterns to watch for:

if xxh64 is much faster than sha1 or sha256, hashing cost matters on that host
if xxh64 and SHA-based runs are almost identical on large files, the storage path is probably the bottleneck
if the first run is much slower than later runs, page cache or remote-storage warmup is probably affecting the result
if worker count helps small files only a little, you may be dominated by metadata or scheduling overhead instead of hashing

Be careful with:

NFS
SMB/CIFS
FUSE mounts
overlay or merger filesystems

Those can make first-read results much less stable than warmed runs.

Next Step

After benchmarking, you can keep the current default settings or tune a host more aggressively:

install xxhash and use xxh64
increase worker count
re-run the systemd installer with the same scanner name to update an existing agent

For the agent install flow, continue with Installing an Agent.