The Fascinating Application of Rolling Hashes in Backup Systems
Introduction
Rolling hashes play a very important role in modern backup software like Arq Backup, Duplicacy or Kopia. Instead of traditional incremental backups where to restore, you must restore a full backup then restore each incremental backup in order, they instead always make full backups and then use de-duplication techniques to save space and not have to store the parts that are duplicate between full backups.
File-based de-duplication
The simplest version of this is content-addressable storage (CAS) where each files are not stored by name but by their hashes (e.g. a SHA256 hash). In this case, if you read a file, hash it and see that we already have the content of this hash saved, we don’t need to backup this file again but simply refer to the existing hash. As the hash is determined only by the content of the file, even if the file is renamed or moved, the hash doesn’t change the de-duplication still works.
Fixed-size chunk de-duplication
However, the disadvantage of that is that even if a small part of a file change, the whole file needs to be uploaded. You can imagine why this will not work with VM images or large database dumps. A slightly better way is to split the file into multiple smaller chunks of fixed size.
Imagine we have this file and we split every 5 characters (presented by a space).
abcde fghij klmno pqrst uvwxy z
Now if only the second chunk changes:
abcde fDEij klmno pqrst uvwxy z
Only the second chunk needs to be re-uploaded. (We can use the same hashing technique to identify if a chunk already exists or not.)
Variable-size chunk de-duplication
A limitation of the fixed-size chunking is that if there’s any insertion or deletion, all the following chunks change. From the previous example, if there’s an insertion:
abcde fDEgh ijklm nopqr stuvw xyz
Now all chunks except the first chunk is changed even though there’s only a minor modification! What’s better is if we a chunk it in a way that is not based on the position of the data within the file. For example, if we decide to chunk every time we encounter a vowel.
The original chunk would be:
abcd efgh ijklmn opqrst uvwxyz
The modified chunk would be:
abcd efD Egh ijklmn opqrst uvwxyz
Now the second chunk has been split into two chunks, but the magical part is that the rest of the chunks did not change and we only have to re-upload two chunks rather than almost the entire file!
Of course that was a simplistic example and such chunking algorithm would not work well for all documents. If the document has no vowels, then it would be one big chunk, or if the document has only vowels, then we will have too many small chunks.
Finding where to chunk
A simple solution
So the problem we need to solve the problem of “finding where to chunk” based on only the content of the file. Before going to the popular solution of rolling hashes, I would first like to discuss a simpler to understand solution.
Let’s say that we have a 1 GB file and we want to split it into chunks that are, on average, 8 MB. We need a function that can, for each point in the file, determine whether we should start a new chunk or not.
The function should have the following characteristics:
- It should not take the current position within the file as an input as we do not want the chunking to change if the position of content is shifted.
- It can use the current byte and the previous bytes as input. However, the window size to look at the previous bytes should be limited as we do not want a change in a place far away to affect the current chunking. In other words, the chunking is self-synchronizing.
- It should work for a variety of input. For example, it should work well for both text files and binary files.
- It should be fast as we need to run it for every byte.
One such way to do this is for every byte in the file, take the last 64 bytes, run MD5 on it, and create a new chunk if the output meets a certain criteria that gives the probability we want. Wanting chunks of size 8 MB, means we want the function to return true one time for every 8,388,608 bytes. The log base 2 of that is 23. So if we ignore all other bits of the MD5 output except the first 23 bits and return true when the first 23 bits is all 0, we get the probability that we want.
Let’s try it out on a sample of 1.26 GB. You can find the code here.
There are a total of 156 chunks. The smallest chunk is 18 KB, the largest chunk is 38 MB. The medium chunk size is 5.6 MB. The average chunk size is around 8 MB as expected. This size distribution is as expected, but less than ideal.
However, the biggest problem about this is the time is took. Processing the file took around 8 minutes so a backup with this technique would take forever!
A faster solution with rolling hash
The problem with MD5 is that for every byte, you need to need to recompute the hash from scratch, taking a long time. What we need is a hashing algorithm that we can incrementally compute one byte at a time. That is called a rolling hash.
Here I’m going to try cyclic polynomial hash, also called Buzhash. You can find the code here.
There are a total of 145 chunks. The smallest chunk is 20 KB, the largest chunk is 39 MB. The medium chunk size is 6.7 MB. The average chunk size is around 8.7 MB.
You can see that the result is very similar to using MD5 but this is a lot faster! In fact, processing the same file only took around 2 seconds.
Conclusion
This articles described how variable-sized chunking is valuable in backup applications and rolling hash algorithms can be used to achieve it.
However, due to it being based on probability, the chunk size can be much bigger or much smaller than the desired size. Many backup software avoid this issue by mandating a minimum and maximum chunk size which will override the result from the chunking algorithm.
Another interesting thing to investigate would be to find the ideal chunk size for various applications. A smaller chunk size will increase overhead (as metadata needs to be stored for each chunk) while a larger chunk size will reduce is less deduplication probability (as change will result in re-upload of a larger chunk).