The problem with depending on backups of SSDs is: what if it’s data, not metadata, being corrupted, and you are silently backing up corrupted data? When will you notice?
I’ve always worried about SSDs. This brings no comfort. Especially since most laptops these days ship with this stuff built-in.
Karim Yaghmour June 16, 2015 23:20
Recursively going through the filesystem and checking md5sums with previous runs is one way to do it, but no one ever goes through that kind of trouble.
John Pitney June 16, 2015 23:32
Btrfs scrub any help here?
Edward Morbius June 17, 2015 02:11
Would a RAID 1 mirror help here?
Cristian Gafton June 17, 2015 02:34
a checksumming filesystem like zfs is far cheaper than a raid1 for this particular issue. also, if you don’t want to play zfs, an encrypted block device setup would also protect you against this mess. just choose the weakest and fastest level of encryption - it will have no impact on the io performance and it has the nice property of instantly blowing up when TRIM or something else does the wrong thing to the wrong block.
Cristian Gafton June 17, 2015 02:52
Of course the most obvious solution here is do not abuse TRIM. Running TRIM every hour from cron is borderline stupid.
There are also details on the Internet of the race condition which triggers this bug and ironically that is part of the optimization designed to deal with too many TRIM commands received for no good reason.
Michael K Johnson June 17, 2015 05:42
+Edward Morbius No, RAID1 would not only be more expensive, but also ineffective. The read from the corrupted drive happily provides the zeros it thinks it was asked to provide. There are checksums in the drive; the bug is that the drive thinks that you asked to zero out the wrong data. So that means that adding checksums in another layer is the only effective solution; in the device (like Cristian’s suggestion of encrypted block device), filesystem, external checksum registry (here it doesn’t matter that md5 is weak), or file contents. Of course only the first two work for all files; the last only works for files containing checksums.
Michael K Johnson June 17, 2015 06:24
I think I’m probably lucky.
My work laptop has a Samsung SSD and has encryption enabled (by policy).
My home laptop has an older Samsung SSD (EVO 840 Pro) with older, non-upgraded firmware that might not be subject to this bug, with 40% of the drive space in an unused partition, and I use btrfs (which has data checksums) on the SSD, and I don’t enable discard, and I apparently got halfway through looking up enabling fstrim on fedora and got distracted, as I have no cron job or fstrim service defined. (It was originally an F20 install with the fstrim systemd unit not packaged.)
Michael K Johnson June 17, 2015 08:04
+John Pitney …and now I’m looking to see whether Fedora has an integrated btrfs scrub automation feature I can just turn on, or whether this is one of those things that everyone is expected to just do for themselves. Haven’t found one yet.
Imported from Google+ — content and formatting may not be reliable