As more and more of our life are becoming digitalized, I’m sure many of you will agree that it can be devastating to loss data, perhaps some important work that will take ages to re-do, or some photo with sentimental value that are lost forever. Thus, it is crucial to keep backup of important data.
For my most important data, I have both offline and online backups.
I backup files to an external hard-drive. This is my main go-to location for restoring data as it is the fastest. It is also a “cold” backup as it is not attached to any computer system and thus malware cannot destroy data on it. Currently, I am using
rsync to copy the files the external hard-drive which is encrypted with native OS funcationality.
I backup files to Backblaze B2 cloud storage. This is a secondary backup in case something happen causing me to not be able to access my primary backup (e.g. it was destroyed due to a fire together with my primary laptop). After investigating all choices, I’ve selected Arq Backup as my main backup software but then later switched to Duplicacy as I’ve switched to Linux as my primary OS on my personal laptop and Arq Backup is only available on Windows and macOS.
With the release of Apple Silicon processors, I’m contemplating switching back to macOS for my personal computing. (I got an M1 MacBook Pro for work and I love it!) However, that means that I might need to change my backup system again:
- My offline backup is using ext4 and LUKS which is Linux only. If I switch to macOS, I will need to reformat and re-do my backup. It is also annoying to not be able to read the backup from any OS. Thus, I’m planning to switch to using an unencrypted NTFS/exFAT partition and then using a backup software to encrypt my files instead. As long as that backup software runs on an OS, then I can access the file.
- My online backup is using Duplicacy and while I can also use Duplicacy on macOS, I’m thinking that it might not be the best idea. Duplicacy is source-available but it is not under an open-source license and is only free for personal use. It also has some issues that if you move files around, it will cause at least some part of the files to be re-uploaded. The next time I change my OS, I want to be able to move my files over and resume the backup with almost no re-uploads.
Thus, I thought, why not write my own backup software which does exactly what I want?
Wait, do you really want to write such a critical software?
Of course, I should not be taking this lightly. If my software has a bug, then I might not be able to restore my backup and lose important data. If there is an existing software that does what I want at a reasonable cost, then I absolutely want to use it. However, it seems like the selection of open-source backup software available did not change much since the last time I did my research which unfortunately meant that there’s not off-the-shelf software meeting my needs. Arq Backup is still the top contender, but the lack of Linux support makes it a no-go for me.
UPDATE: After publishing this article, I have discovered Kopia which does most of what I want, has good documentation and is under active development. I might switch to it instead.
So let’s list the requirements so I can come back and check and make sure I have covered all of it.
- Forever Incremental — Each backup should be a full backup but use de-duplication to save space and bandwidth. The solution should be able to prune old backups as needed without needing to re-upload everything again.
- Support Local and Cloud Storage — I intend to use this solution both with an external hard-drive and with Backblaze B2 and it should natively support both of these solutions. Storage should be pluggable so additional backends can be added at a later point.
- Cross-Platform — It should work in Windows, macOS and Linux. I intend to write it in Java not only because it’s a language I’m familiar with but also because it’s cross-platform with decent enough performance. In the future, it’s even possible to extend it to run on Android, either to backup files on the phone or to retrieve backed up files from my phone.
- Simple — Both the core code and the storage format should be simple to reduce the chance of bugs. In the worst case, I should be able to write a reader for existing data from scratch within a day.
- Easy Restores — Restores should be as simple as pointing to the backup location and providing the pass-phrase. All other metadata should be available from the backup storage itself. It should be possible to restore a single file or restore multiple files.
- Encryption — Files should be encrypted from the client-side. Anyone with access to the cloud storage or external hard-drive should not be able to read the content without the encryption key.
- Efficiency — The backup system should be efficient. Compression should be applied to reduce file size when possible. Small files should be bundled into one big file to reduce API calls to cloud storage systems. All processing should be able to handle file sizes larger than the available RAM.
- Verification — There should be a system to verify that my backup is correct, this reduces the risk of backups being unrestorable due to software bugs. There should be two modes, a metadata-only mode only verifies that all the files are there without checking the content of the file and a full mode where a restore is attempted and the content is compared to the file on the local file system. The metadata-only mode is useful for verifying pruning and the full mode is useful for verifying issues such as loss of data due to hash collision.
- Multiple volumes in one bucket — One backup bucket should support multiple separate “volumes”. For example, I might have a volume for my home folder and another volume for my photos which are backed up with different cadences. I might backup my home folder every week but backup my photos only once a month as they rarely change and may not be on my laptop’s hard-drive.
- Moving files should never re-upload — Renaming files, moving files between folders or even between “volumes” should never trigger re-upload.
- Variable-Sized Chunking — Content Defined Chunking of big files allows deduplication to still work when parts of the file is changed. This is not critical as most of my backups are photos which never change. However, this can be useful for server database backups where in an
.sqldump file, some tables often change but some do not.
- Forward Error Correction — While cloud storages are usually reliable, this can be useful to improve reliability storing backups on local hard-drives.
- Object Lock — When using cloud storage, it is useful to be able to use the “object lock” features to prevent ransomwares from reading the access key and then delete backups from the cloud storage. This is not critical for my personal backups as my external hard-drive is protection from ransomware due to being disconnected and is more useful for server backups where it is more likely for the server to get hacked. Even if not implemented for the first version, the storage format should be designed with immutability in mind so this can be easily added in the future.
- Prevent all information leakage — While I do not want anyone to be able to read the files without the encryption key (e.g. it may contain the SSH key), I do not mind small information leakages such as from file-size or compression ratio.
- Concurrent Backups — One of the main selling point of Duplicacy was its lock-free deduplication system allowing multiple machines to backup to the same store. As I’m the only one making backup, I do not need such support and it only adds to the complexity of the backup system.
- GUI and Scheduling — I am OK with running my backups manually from the command-line on a schedule. In the future, a GUI or web application can be created from the core backup code if desired.
Now that I have the requirement, let’s come up with the design for the storage format.