First things first, yesterday I’ve decided to write my own backup software, but after a bit more investigation, I came across Kopia which did most of what I wanted, has great CLI and documentation, and is in active development. Thus, I’ve decided to put a break on the project and instead use Kopia.
Nevertheless, I have already drafted some idea on how I’d want my backup system to work thus it’s probably a good exercise to write it down here in case I change my mind later.
- Data — Actual content within the file. Does not include metadata such as file name or modification time.
- Metadata — Data describing a file, such as file name, modification time, or permission.
- Chunks — A part of a file. A big file is split into multiple chunks for easier handling.
- Packs — Multiple chunks combined into one file for more efficient handling.
- Blobs — File data (chunks or packs) stored in our backup medium.
- Snapshot — One backup at a point in time is called a snapshot. Making an update to the backup creates a second snapshot.
- Backup Medium — Where the backup is actually stored. It can be cloud storage or an external hard-drive.
- There should be a clear separation between the data and metadata. This allows the design of the metadata to evolve without re-uploading the blobs.
- Multiple snapshots should be mostly separate except they happen to refer to some of the same blobs. Making a new snapshot should not touch the old snapshots at all. There should be no global index with a list of snapshots, chunks or packs as that index is prone to corruption and synchronization issue.
- Use content-addressing as much as possible. For example, a chunk stored in our backup medium should have the file name as the hash content of the chunk. This creates determinism (no matter how many times you try to upload a chunk, you will end up with the same file name) and reduces the use of index files.
- It is OK to have some redundancy in the metadata as the size is relatively small. It is not OK for the entire backup to be unreadable if a tiny piece of metadata got corrupted.
The layout on the backup medium is as follows:
│ ├── 7d6fd7774f0[...]d4.chunk (single chunk)
│ └── f2ca1bb6c7e[...]02.pack (pack file w/ multiple chunks)
└── ab03c34f1ec[...]c5.snapshot (snapshot metadata)
File data is stored either as single chunk or a pack file (packing shall be explained later). The snapshot file contains all metadata. Only blobs are shared between multiple snapshots. Each snapshot has independent metadata even if they share some files. The file name is a hash value and the extension is the file type. Note that the snapshot is in its own directory to facilitate listing all snapshots.
The following steps are used to generate a backup:
- Get the list of files to be included in the backup.
- Read each file and split the file into smaller chunks if necessary.
- Calculate the hash of the chunk.
- Check if the hash already exists or not. If the hash already exists, then skip to the next chunk.
- If the hash doesn’t exist, compress and encrypt the chunk.
- If the compressed chunk size is big enough, upload the chunk as a single chunk file. If the chunk size is small, combine multiple chunks into a single pack file before uploading.
- After all chunks are processed, generate, compress, encrypt and upload the snapshot file.
The following sections shall go into detail about each step.
Each file shall be chunked before processing. This can be either a fixed-size chunk or content defined chunking such as Buzhash. Small files will have only one chunk which contains the entire file.
Chunking allow us to process only a part of a file at a time, allowing us to limit memory usage. For example, we may have a 1 GB file, but if we process only 10 MB chunks of that file at a time and discard it from RAM after uploading, then we won’t need 1 GB of memory to process and upload it.
- Some other backup software (e.g. Duplicacy, Tarsnap) pack the files before chunking. This means the first or last part of a file may be combined with other files in a chunk leading to a potential chunk change if we rename a file.
- The chunking algorithm should not be changed after a backup is made as a different chunking algorithm will result in different chunks causing files to get re-uploaded.
Hashing allows us to uniquely refer to a chunk in a deterministic way. Different runs of the same backup on the same file should result in the same chunking and thus the same hash. This way, if we find that the hash already exists in the backup medium, we can skip the rest of the process because the content of the chunk is already stored.
We want to hash before compressing and encrypting since even if a chunk is compressed with a different algorithm and encrypted with a different key, it is still the same chunk and should still have the same name.
To avoid data leakage, we don’t want to use plain hashing algorithm like SHA256 but instead use the HMAC version of it with a secret key derived from the pass-phrase. If we use plain hashing, then attackers with access to the backup medium will be able to know if you have a specific file or not because the hash matches the file that they have. (This is similar to using salt to prevent rainbow table lookup of passwords.)
We want to compress the chunk in case the chunk is highly compressible (such as being a text file) to save space on the backup medium.
If the compressor does not already provide a header, this step should prepend a small header so it is possible to find out why decompressor to use when we want to restore a backup.
- We cannot chunk after compressing because content defined chunking will completely break if we compress first as a small change in file may cause the entire file to change after compression.
- We cannot compress after encryption because compression algorithms do not work well on encrypted data.
We want to encrypt the file to prevent people from being able to access the files even if they have access to the storage.
The key to encrypt a chunk should be derived as follow:
- First a “master key” is derived from a pass-phrase using a Password Based Key Derivation Function (PBKDF) such as
scrypt. This creates a strong key out of pass-phrases which might not be as strong and provides a level of protection against brute-force attacks.
- Then for each chunk, the actual key to encrypt the chunk is derived from the “master key” using a HMAC-based Key Derivation Function (HKDF) with the HMAC hash value of the original content of the chunk computed in the previous step. This reduces risks resulting from re-using one master key to encrypt a large amount of content. (For example, there are known issues with AES-GCM if an IV is eventually re-used with the same key.)
- A random IV can be used. The encrypted content does not need to be reproducible for the backup system to work. (It does not matter what the encrypted content is as long as we get the original chunk content after decryption.) Nevertheless, if we want reproducible encrypted content, I do not see any issue with using a static IV as the key is already different per chunk.
While we can already verify the authenticity of the original chunk data from the HMAC hash value, it is still desirable to use an authenticated encryption algorithm such as AES-GCM otherwise the compressed data is not authenticated at all and passing unverified data to the decompressor unnecessarily increases our attack surface.
After encryption, we can optionally apply a forward-error-correction (FEC) algorithm to protect against small errors in the backup medium. This is not required for cloud backups as correctness is guaranteed by the cloud provider but might help for backup to other media.
The final output should be formatted as:
- A static header to identify that the file belongs to this backup application (this for failure recovery)
- The HMAC hash value of the original uncompressed unencrypted chunk content
- The encryption and FEC algorithm so we know which algorithm to use to decode the information
- The size of the encrypted content (this is for failure recovery)
- Whether this is a snapshot metadata or a normal chunk (this is for failure recovery)
- The encrypted content
- The authentication tag (this is an additional output from AES-GCM used to verify the authenticity of the encrypted data)
- Data for the FEC algorithm (optional)
Note that even though the original hash value is already in the original file, we still want to include it for two reasons:
- When packed, we do not have the original file name, so having it here allows us to know the hash value even when the file is a part of a pack. (See next section.)
- For failure recovery. (Will be described in a future section.)
Packing combines multiple small chunks into one large pack in order to increase efficiency when working with cloud storage providers which charge per API call.
There is no special structures for the pack. Each small chunks are simply concatenated until we reach the “minimum pack size” or we run out of new data. In the metadata, when we refer to a pack, we will point to an exact offset inside the pack.
The name of the pack should be a HMAC hash of the concatenated original hash values inside the pack. This actually does not matter too much because unlike chunking, it is highly unlikely for the same hash to be reproduced except when running against the exact same set of files. Thus for de-duplication, we must refer to the metadata instead of expecting the same hash value of the pack to be generated.
Packing is done after compression because we do not know about the final size before compressing. Packing before compression still fulfills our requirement to combine multiple small files together before uploading. However, we may end up with packs that are way too small after compression. Packing after compression may have worse compression ratios but result in more stable pack sizes.
Packing is done after encryption because we want to be able to download only a part of a pack and fully decrypt it. This leaks some data about how many small files are inside the pack, but I think it is worth the flexibility. If leaking data is not desired, then encryption after packing is also possible.
As packing adds a level of indirection (we cannot efficiently access data within a pack without a lookup table), we want to do it only when required. Thus, we should pick a chunk size considerably larger than the “minimum pack size” to avoid a chunk getting too small after compression and require packing for chunks with average compression ratios.
The snapshot file contains three parts:
- The metadata about the snapshot (e.g. snapshot time, source computer, backup volume name)
- List of files in the snapshot with all metadata (e.g. full file path, modification time, file size, file permission) as well as list of chunks for the file as a list of hashes.
- List of chunks that are inside packs. This should be a mapping from
“chunk hash value” to “pack hash value with starting and end offset within the pack”.
The format of the snapshot file can simply be a JSON file. Other more efficient format such as Message Pack or Protocol Buffers can also be used, but as the snapshot file is relatively small, there may be little point in trying to optimize the file size.
The snapshot should be compressed and encrypted the same way chunks are. The filename can be random as snapshots are unique (since they contain metadata such as snapshot time) so there is no point trying to derive the name in a deterministic way. However, the name should look like a valid hash value as we still use the hash value to derive the encryption key.
While the process for creating a backup is already described in the summary above, this section will go into more details.
Before running backups, first we want to download all snapshot files so we can find out which files and chunks already exist on the backup medium, either as an individual blob or within a pack so we can skip accordingly.
After listing the files, as an optimization, for each file, we can compare the file size and modification time on the file system with the one in the last snapshot. If they match, we can skip hashing the file and simply copy the list of chunks as well as any pack metadata from the last snapshot into the current snapshot.
If the size and modification time is different (or if we are paranoid), we chunk the file and hash each chunk. For each chunk, if the chunk has already been referenced in a previous snapshot, then we know that the chunk already exist in the backup medium and we can skip to the next chunk. If the chunk is in a pack, then the pack metadata needs to be coped into the current snapshot. If the chunk was not referenced, it still might be already in the backup medium (due to an incomplete backup) and we should check for its existence before attempting to upload. (If we are paranoid, we can always check for the chunk’s or pack’s existence in the backup medium in case it got accidentally deleted by the pruning process.)
Finally after the all chunks are uploaded, the snapshot metadata should be uploaded.
The software should be able to efficiently resume interrupted backups. As the snapshot metadata is not written to the backup medium until all the chunks are written, an interrupted backup simply results in orphaned chunks and packs written to the medium without a snapshot file. (It is assumed here that if a file is present on the medium, it was written completely. S3-compatible storage provides this guarantee but for other medium we may want to use the upload-then-rename technique.)
To resume simply run the backup again, and when we want to upload a chunk, do a check first whether than chunk already exists or not. If it already exists, then we simply skip it.
Unfortunately, we cannot do the same with packs. Theoretically, if we run the backup with the exact same source file, then we will get the same packs generated again allowing the presence check to be used. However, if a small file is added to the backup source, then it might shift the content within the pack causing the name of all the packs to change. Thus it is useful to keep a local cache of content of successfully uploaded packs not yet included in a snapshot which we can refer to during a resume to avoid redundant uploads.
Restoring a backup without any previous local information can be done as follow:
- Download all snapshot files and decrypt using user-provided pass-phrase.
- Read the content of the snapshot files and ask the user to pick a snapshot.
- Ask the user to pick the list of files to restore.
- Download all the chunks. Chunks inside packs should be downloaded from the pack file referred to in the snapshot metadata. Chunks not mentioned in the snapshot metadata can be downloaded using the hash value as the file name.
- Assemble the original file from the chunks.
Pruning Old Backups
Deleting old backups can be done by deleting the relevant snapshot file, and then deleting blobs which became unused due to the snapshot file getting deleted. To find out which blobs are unused, we must download all snapshots to figure out which blobs are still being used.
It is important to not attempt blob pruning while another backup is running since we may end up deleting a blob referred to by the new backup.
From this design, we can recover from certain classes of failure.
First, let’s enumerate the classes of error we cannot recover from and how to possibly prevent them:
- Complete destruction of backup medium— We can avoid this situation by having multiple separate backups (e.g. local backup and cloud backup).
- Corruption of backup data due to newer backup — This should never happen as newer backups do not modify older snapshots at all by design. Object locking on S3 can be enabled to further strengthen this guarantee.
- Missing metadata — If the metadata is missing, then while we can decrypt the chunks, it is very hard to put them together in the correct order. In practice, if we have multiple snapshots, then each snapshot has its own metadata so it is unlikely for the metadata for all snapshots to accidentally go missing.
- Lost pass-phrase or access key to cloud providers — If the pass-phrase is lost, then there is no way to recover the data. Thus it is crucial to store the pass-phrase securely. In extreme cases, you can use a secret sharing algorithm to share the secret with multiple friends (who do not know each other) in a way that enough parts of the secret need to be collected to restore the original secret.
Next are errors we can prevent with defensive coding:
- Corruption of data during processing or transfer — Data can be corrupted during processing or transfer (e.g. due to bad RAM). For cloud storage, we can prevent this by doing a verification after final encryption but before upload (to avoid download costs). First we make an MD5 checksum of the final chunk or pack. Then we decrypt/decompress the chunk or pack and compare with the original file, read from disk again, to see if they match or not. Finally, during upload, send the MD5 to S3 to ask S3 to verify the checksum of the uploaded content. This makes sure the data on S3 is the same as the data we have just verified. For local backups, we should simply do a full verification by reading from backup and and comparing to the local file after the backup is complete. (FEC cannot help in this case as by the time we run FEC, the data might already be corrupted.)
- Corruption of data during storage — This should not happen for cloud storage, but for local storage, we can use FEC to recover from minor corruptions.
Next are errors we can partially recover from:
- Missing chunks — We cannot recover from missing chunks. However, we can be sure that missing chunks only affect the recover from the files referencing the chunk. They will not cause issue recovering other files.
- Corrupted file system — For local storage, if a file system got corrupted and cannot be mounted, we still have a chance to recover the data, unlike some other full-disk encryption system where it is impossible to recover if the header is lost. The caveat is that only chunks stored contagiously on disk can be recovered. (Which should mostly be the case since the chunks are relatively small, but I’m not a file system expert so don’t quote me on it.)
The recovery steps for a corrupted file system are as follows:
- Scan raw disk for our static headers and create a mapping of the hash value of the map to the offset inside the disk. The hash value, chunk size and whether it’s a snapshot or not can be found inside the header. (Note that we can ignore the concept of packs as they will be picked up as different chunks anyway.)
- Decrypt all chunks marked as snapshots with the user’s provided pass-phrase and ask the user which snapshots/files they want to recover. (Note that we explicitly do not use any encryption indirection to the tune of generating a random key, encrypting that with the pass-phrase and storing it. While that allows the pass-phrase to be changed, it makes it almost impossible to recover any data if we lost the encrypted key.)
- Follow standard restore procedure except instead of going to the backup medium for a specific chunk, get them from the chunk list created in step 1.
- Any useful operation requires downloading all snapshots to figure out what is the current state of the backup medium. In a large backup, this may lead to performance issues. Caching those files can speed up the process to a degree. I believe that this will not be an issue for my use-cases.