How Merkle-tree dedup actually works in pinemere

The chunking strategy, the hash tree, and why fixed-size blocks were a dead end.

Abstract data streams

When I describe pinemere as using "Merkle-tree deduplication," I get two kinds of reactions. People familiar with git or IPFS nod and assume it works roughly the same way (it doesn't, quite). People who haven't encountered the term before think it sounds more complex than it is. This post is for both groups.

The problem with whole-file hashing

The simplest dedup strategy is to hash entire files. If two files have the same hash, they're identical — store one, point both paths at it. Rsync does something similar with its checksum mode.

This breaks down when files are almost identical. Append one byte to a 2 GB video file and the whole-file hash changes. Now you need to transfer 2 GB again, even though 99.99999% of the content is unchanged. Worse, if you have two copies of that video with different metadata headers (different container, same frames), whole-file hashing sees them as completely different.

Content-defined chunking

The fix is to split files into smaller pieces — blocks — and hash each block independently. If two files share a region of identical bytes, the blocks covering that region will have the same hash, and pinemere only stores/transfers the block once.

But how do you decide where to split? Fixed-size splitting (every 64 KB) has a fatal flaw: insert a single byte at the beginning of a file, and every block boundary shifts by one byte. Every block gets a new hash. You've defeated the purpose of block-level dedup.

Pinemere uses Rabin fingerprinting for content-defined chunking. The idea: slide a window across the file's bytes and compute a rolling hash. When the hash meets a certain condition (e.g., the low 16 bits are all zero), declare a block boundary. Because the boundary depends on the content around it (not on absolute position), inserting bytes only affects boundaries near the insertion point. Blocks far from the edit retain their original boundaries and hashes.

Variable-size blocks

Content-defined splitting produces variable-size blocks. Pinemere's parameters:

The target size is controlled by the number of low bits in the Rabin hash that must be zero. For a 64 KB target, we check 16 bits (216 = 65,536 ≈ 64 KB expected distance between matches). For 128 KB target, 17 bits. And so on.

In practice, block sizes follow a geometric distribution centered near the target. On a real workload (my photo library, 48 GB, ~6,200 files), the actual mean block size was 61.3 KB with a standard deviation of 24.1 KB.

The Merkle tree

Once a file is chunked into blocks, pinemere builds a hash tree. Leaf nodes are the BLAKE3 hashes of individual blocks. Parent nodes hash the concatenation of their children's hashes. The root of the tree is the file's root hash — a single 32-byte value that uniquely identifies the file's entire content, regardless of how many blocks it contains.

         root hash
        /         \
     h(AB)       h(CD)
    /     \     /     \
  h(A)   h(B) h(C)   h(D)
   |      |    |      |
 block0 block1 block2 block3

The tree is binary and balanced. If the number of blocks isn't a power of two, the last level is padded with a sentinel hash. The branching factor could be higher (restic uses a fan-out of 16), but binary trees make the implementation simpler and the overhead is negligible — the tree metadata for a 10 GB file with 64 KB blocks is about 5 MB, stored in the SQLite manifest.

Why this matters for sync

When pinemere syncs to a target, it doesn't compare file lists. It compares root hashes. If the root hash matches, the file is identical — skip it. If the root hash differs, walk the tree top-down: compare the two children, recurse into whichever subtree differs. This converges on the specific blocks that changed, without scanning the whole file.

For a 2 GB file where one block changed, pinemere does O(log n) hash comparisons to find the changed block, then transfers 64 KB. Rsync's rolling checksum achieves something similar, but pinemere's approach generalizes across files — if the changed block happens to match a block in a different file that the target already has, it's not transferred at all.

Dedup ratios in practice

On my photo library (lots of duplicate RAW files, some edited JPEGs sharing EXIF headers):

On a code repository (monorepo, lots of vendored dependencies across branches):

The savings depend heavily on the data. Random binary files (encrypted archives, compressed video) get almost no block-level dedup because compression destroys content patterns. Text, source code, raw photos, and database dumps get excellent ratios.

Further reading