Designing a Backup System (Addendum)
Recently, I’ve been studying how Kopia stores its data and from that, I have some changes I would like to make to the hypothetical backup system I’ve designed.
Local Encrypted Index in each Pack
How Kopia packs work is that each pack is basically just data from multiple chunks appended together without any marker in between. The metadata of where each chunk ends and another starts is stored in indices. There are two places for the indices:
- A separate encrypted index file where indices from all packs are stored together. In normal operation, this is the index file used to find chunks without having to download all packs and scan them.
- At the end of each pack, the encrypted index is also stored there in case the separate index is lost, then we can simply scan the end of each pack to re-build the index.
In my previous design, my “pack” design was to simply append all chunks with no special index. While it was nice and simple, the downside is that we’re leaking data about how many different chunks are in the pack and what are the compressed size the chunks. I would switch to Kopia’s design of having an encrypted index instead.
Strictly Append Only
While my previous design was already append-only, seeing how much work and complexity Kopia has put into doing maintenance, reminds me how risky and how much of a nightmare mutable designs can be.
Thus, my ideal backup system would be 100% append only. Of course, we need to prune old backups too as we do not have infinite storage (or infinite money to pay for infinite storage). The pruning method I would use is to be delete-only with no index updates or rewrites. It would work as follows:
- During the creation of a snapshot, if we are to re-use any chunks or packs, the last-modified-time of the chunk or pack should be updated to the current time.
- Delete any snapshots we no longer want.
- Delete any chunks or packs with the last-modified-time older than X days and not referenced by any current snapshots.
Some readers might already think, what about packs that only a small amount of data is being used, won’t we keep them forever? My solution to that is:
- Put packs whose utilization is under X% into a “do not re-use” list.
- The next snapshot we make will no longer re-use chunks inside that pack but instead create a new pack.
- While we cannot get rid of the old packs right away, when we eventually prune the snapshots using the pack, we will be able to get rid of the packs.
Of course, this would not work if you don’t prune the old snapshots. But if you don’t prune old snapshots, it’s unlikely that you will end up with a lot of partially unused packs in the first place.