Notes on exFAT and Reliability
Why am I writing this?
When you want a pick a file-system for external storage that’s compatible with multiple operating systems, you get to pick between these:
- FAT32 — An oldie supported by all popular OSes, but no longer a good choice due to lack of support for files larger than 4 GB
- NTFS — The native file system for Windows. Great performance on Windows. Linux has decent support via ntfs-3g, but FUSE makes the performance less than ideal. (This is changing though as Tuxera is contributing a read-write kernel driver for NTFS.) However, macOS only has read-only support and commercial third party driver from Tuxera or Paragon is required for read-write support.
- exFAT — The newest file system of all. It had a rough start due to Microsoft charging for patent/licensing and thus there’s little incentive and demand for Linux support due to NTFS being a viable alternative. This changed in 2019 when Microsoft opened up the spec and now Linux has native read-write support in the kernel. On macOS, there is also native read-write support, most likely because MacBooks have an SD-card reader and exFAT is the default file system for SDXC cards thus support must be present to claim SDXC support.
For external hard drives, NTFS used to be a no-brainer as FAT32 has the 4 GB file-size limit while exFAT support in Linux has historically not been very good. But with that problem solved, is exFAT a viable solution now?
When my purchased my last external hard drive, it came formatted with exFAT. When I searched for recommendations on NTFS vs exFAT, many people recommended NTFS and complained about drives getting corrupted with exFAT due to lack of journaling. Some even said that exFAT is less reliable than FAT32 due to having only 1 FAT compared to FAT32 having 2 FATs.
Thus, I dived into the spec to see how much of a problem that is.
The Structure of exFAT
Microsoft’s spec, while complete, is very verbose making it quite hard to understand, thus I am going to give a brief summary here.
The exFAT file system is divided into 3 main parts:
- The Boot Region is basically the header of the file system containing basic information such as the length file system and the cluster size. Microsoft think that this is so important that 2 copies are kept. This may be to avoid a situation where an exFAT-unaware legacy device tries to mount an exFAT volume and messed up the boot region, in which case exFAT-aware devices can just fall back to the backup region.
- The File Allocation Table (FAT) is the namesake of the FAT file systems and is used to keep track of cluster allocation via a linked-list data structure. Wikipedia has a great example of how the FAT works so I am not going to repeat it here. The size of the FAT depends on the number of clusters inside the file system and the number after FAT in FAT12, FAT16, FAT32 denotes the size in bit of the FAT for each cluster. With exFAT, we are still using 32-bit per cluster like FAT32.
- The Data Region covers the rest of the partition. This is like a “heap” divided into units called clusters and each cluster can be allocated for any purpose defined by the file system. If data is too large to fit on one cluster, the FAT is used to link multiple clusters together. (There is an optimization on exFAT though, if a file is marked as non-fragmented, the FAT is not used at all.)
Inside the data region, we can find the following important data types:
- The Allocation Bitmap is an optimization provided by exFAT. Previously, free clusters are tracked by the FAT using a value of “0”. Tracking free space outside the FAT makes the data structure much smaller as only one bit is required per cluster, and thus faster to find free space on the file system of the desired size.
- The Directory Structure contains the content of each directory, which is a list of file name, file size, file type, file properties, the fragmentation flag and a pointer to another cluster on the file system. In case of file, the pointed cluster contains the file’s raw data. In case of sub-directory, the pointed cluster contains yet another directory structure. Each entry has a fixed size of 32-bytes and multiple entries can be used per file or directory, for example, to store a long file name. As a side note, in Linux or POSIX, each entry in this directory structure is called a dentry or directory entry.
- The File simply contains the raw data of a file with no metadata.
Each data type is stored in one or more clusters. If the data is too big to fit in one cluster, the behavior depends on the fragmentation (NoFatChain) flag. If the data is fragmented, the FAT is consulted for the next cluster. If the data is not fragmented, then the FAT is not consulted but the adjacent clusters will be read for the rest of the content.
You can see that as much data is being kept in the data region as possible as this allows the flexibility to put anything anywhere on disk and not preallocate space which might be too small or too large. (For example FAT16 had the root directory structure outside of the data region and supported only 512 entries in the root directory.)
An Example exFAT File System
Since the above section talked mostly in abstract term, it may be hard to understand. Here, I will talk about an example exFAT-formatted 2 TB external hard-drive.
The recommended cluster size by Microsoft for this size is 512 KB. Windows basically pick a cluster size to keep the cluster count under 4 millions for performance reason.
This disk has 3,815,420 clusters at 512 KB per cluster for a total size of around 1.8 TB. To track 3,815,420 clusters with 32 bits (4 bytes) per cluster, the FAT has a size of around 14.5 MB. The allocation table, however, requires only one bit per cluster and has the size of around 466 KB. You can see that the allocation table is much smaller than the FAT. The main and backup boot region each has a fixed size of 12 sectors (6 KB) which is negligible in the grand scheme of things.
Now that we understand the file system, it’s time to understand how different failures can occur and the size of each blast radius (e.g. does it corrupt one file or render the file-system unmountable)
During normal operation, the following data are often updated.
- The VolumeDirty flag inside the boot region. The VolumeDirty flag is set before an operation is performed on the file system and reset to zero afterwards. If an operation is interrupted, then when mounting we will see this flag set to 1 and CHKDSK or fsck needs to be ran to check the integrity of the file system. The worst that can go wrong is, due to bad implementation or write re-ordering, the volume is dirty but it is not marked as dirty. The potential affects shall be discussed in other fields.
- The PercentInUsage field inside the boot region specify how much of the disk has been used. Even if this field is wrong or is corrupted, the worst that can happen is the amount of free space displayed to the user is wrong. This value can always be re-computed from the allocation bitmap.
- The File Allocation Table (FAT) needs to be updated in case a fragmented file was written to disk. If it is not updated, the old value might point to an incorrect cluster. The most benign outcome of this is while reading the file, the incorrect data is returned. Things get more severe if we try to modify the file in question as we may write to a region that’s owned by another file and thus corrupting another file! Note that it is not possible to corrupt the entire FAT from interrupted writes as only parts related to the file being written to are changed.
- The Allocation Bitmap needs to be updated to show which clusters are being used and which clusters are not. When deleting a file, if the allocation bitmap is not updated, then the freed space will not be made accessible. When creating or extending a file, if the allocation bitmap is not updated, in the future, when the cluster is allocated to another file, the original file will become corrupted. Note that the allocation bitmap can be completely recalculated by traversing through all the directories, thus CHKDSK or fsck can put the allocation bitmap back into the correct state and recover any lost space.
- The Directory Structure needs to be updated when creating, deleting or renaming a file. During creation, if directory update failed, then the file will not be visible. During deletion, if the directory update failed, then the file will not actually be deleted. During renaming, depending on operation order, either the file will disappear or two files will point to the same location, which leads to undefined behavior when one of the file is modified. Each directory entry has a checksum, so it is possible to detect when incomplete updates or writes occur.
- The File raw data is updated whenever a file is modified. There is nothing interesting here as interrupted writes can cause issues on any file system (except for those utilizing copy-on-write).
The above may sound quite dangerous, but if the implementation follows the recommended write ordering of allocation bitmap → FAT → directory entry when creating a file and the reverse when deleting a file, then even interrupted writes do not pose an issue.
One or two FATs
Another interesting thing to discuss is having a 1 or 2 FATs. FAT32 has two mirrored FATs (they always have the same content) while exFAT has only one FAT (except on TexFAT only available on Windows CE). The FAT is quite important as if it lost, then it is impossible to fully read fragmented files anymore so it seemed like a bad idea to get rid of the backup copy. But after more consideration, it may be valid choice because:
- Having 2 FATs does not protect against implementation errors, which will write bad data to both FATs anyway.
- The FAT is not checksummed and thus it is not possible to know when a FAT is corrupted and to switch to the backup FAT. The only scenario the backup FAT can be automatically used is when the underlying media return a read error for the first FAT.
- On exFAT, as long as the file is not fragmented, the FAT is not used at all so it now plays a much smaller role.
- Having to update 2 FATs reduces the write performance as all updates have to been done twice.
- The exFAT file system is not as fragile as anecdotes on the Internet may lead you to believe. Most failures are limited to the file being written and interrupted writes do not corrupt the entire file system. Anecdotes of corruption on the Internet are likely due to bad implementation of the file system rather than the file system design itself.
- It is important to run CHKDSK or fsck after an interrupted write as invalid entries in the FAT or allocation bitmap may lead to future random data corruption.
- For mostly cold data requiring cross-platform compatibility, exFAT is a completely valid choice for a file system.
Nevertheless, you may not want to choose exFAT because:
- Due to large allocation unit (cluster size), exFAT is quite inefficient if you have a lot of small files (I’m looking at you,
node_modules). At 2 TB, the allocation unit for exFAT is 512 KB (any file size will be rounded up to 512 KB) while for NTFS it is 4 KB and NTFS even has special handling for very small files storing them directly in the MFT. Thus for this use case, it is better to use the native file system of each OS.
- The tooling, even by Microsoft, is not as great as other file systems. For example, Windows does not provide a defrag tool or a resize tool for exFAT.