Benchmarking SnapFS
This page is for beta users who want to measure local scan performance before choosing scanner settings for a host.
The goal is not to produce one universal score. The goal is to compare:
- one host against itself
- small-file behavior against large-file behavior
- one hash algorithm against another
- cold-path runs against warmed or repeated runs
When To Benchmark
Benchmarking is especially useful when:
- scan performance matters for a production agent host
- you are deciding whether to install
xxhashand usexxh64 - you want to compare worker counts on a real dataset
- your storage path may be the bottleneck and you want to confirm that
Install Benchmark Support
At minimum, install SnapFS:
python3 -m pip install -U snapfs
If performance matters on that host, prefer installing with xxhash support so
you can benchmark xxh64 as well:
python3 -m pip install -U 'snapfs[xxhash]'
xxh64 is often much faster on many-small-file, warm-cache, or CPU-limited
scan workloads.
Benchmark Shape
For useful results, benchmark at least two dataset shapes:
- a small-file tree with many files
- a large-file tree with relatively few files
Those two shapes often behave very differently.
Small-file runs tend to emphasize:
- per-file overhead
- metadata walking cost
- scheduling overhead
- hash CPU cost
Large-file runs tend to emphasize:
- storage throughput
- read-ahead behavior
- whether the scan is I/O-bound instead of hash-bound
Benchmark Tools
The public package includes the snapfs CLI, but the local benchmark helper
currently lives in the public repository:
https://github.com/snapfsio/snapfs
Clone the repo when you want to use the bundled benchmark scripts:
git clone --depth 1 https://github.com/snapfsio/snapfs
cd snapfs
The repo includes:
scripts/bench_scan.pyfor direct local scan-engine benchmarksscripts/example_benchmark_matrix.jsonas a sample matrix to copy and editscripts/run_benchmarks.pyto expand and execute a local benchmark suite
Quick Commands
Benchmark a single dataset directly:
python3 scripts/bench_scan.py /path/to/tree --force --workers 4 --algo sha1
python3 scripts/bench_scan.py /path/to/tree --force --workers 4 --algo xxh64
Run the matrix runner:
mkdir -p /tmp/snapfs-bench
cp scripts/example_benchmark_matrix.json /tmp/snapfs-bench/benchmark_matrix.json
$EDITOR /tmp/snapfs-bench/benchmark_matrix.json
cd /tmp/snapfs-bench
python3 /path/to/snapfs/scripts/run_benchmarks.py
The matrix runner looks for ./benchmark_matrix.json in the current working
directory by default.
Example Results
Representative example:
| Dataset | Tool | Mode | Algo | Workers | Files | Bytes | Elapsed s | MiB/s | Files/s | Repeat | Notes |
|---|---|---|---|---|---|---|---|---|---|---|---|
| small-files | snapfs | force | sha1 | 4 | 717 | 0.69 GiB | 0.813 | 872.5 | 881.827 | 2 | warmed run |
| small-files | snapfs | force | xxh64 | 4 | 717 | 0.69 GiB | 0.319 | 2223.5 | 2249.476 | 1 | warmed run |
| large-files | snapfs | force | sha1 | 4 | 129 | 27.20 GiB | 258.104 | 107.9 | 0.500 | 2 | storage-limited |
| large-files | snapfs | force | xxh64 | 4 | 129 | 27.20 GiB | 256.301 | 108.7 | 0.503 | 3 | storage-limited |
This is a common pattern:
xxh64is clearly faster on the small-file workload- the large-file workload stays near the same MiB/s
That usually means:
- small-file performance is meaningfully affected by hash cost
- large-file performance is limited by the storage path, not the hash algorithm
How To Choose A Hash Algorithm
Current SnapFS options:
sha1: the current defaultxxh64: available whenxxhashis installedsha256: supported when you want a SHA-256 hash specifically
Practical guidance:
- use
sha1if you want the current default behavior and have not benchmarked yet - try
xxh64first when host throughput matters andxxhashis available - use
sha256when you specifically want SHA-256, even if it may cost more CPU
Interpreting Results
A few patterns to watch for:
- if
xxh64is much faster thansha1orsha256, hashing cost matters on that host - if
xxh64and SHA-based runs are almost identical on large files, the storage path is probably the bottleneck - if the first run is much slower than later runs, page cache or remote-storage warmup is probably affecting the result
- if worker count helps small files only a little, you may be dominated by metadata or scheduling overhead instead of hashing
Be careful with:
- NFS
- SMB/CIFS
- FUSE mounts
- overlay or merger filesystems
Those can make first-read results much less stable than warmed runs.
Next Step
After benchmarking, you can keep the current default settings or tune a host more aggressively:
- install
xxhashand usexxh64 - increase worker count
- re-run the systemd installer with the same scanner name to update an existing agent
For the agent install flow, continue with Installing an Agent.