[dm-crypt] disappearing luks header and other mysteries
arno at wagner.name
Mon Sep 15 16:51:20 CEST 2014
On Mon, Sep 15, 2014 at 08:12:40 CEST, Ross Boylan wrote:
> There seem to be problems below the dm-crypt layer. Both LVM volume groups report
> Incorrect metadata area header checksum
> Since the architecture is, starting at the top and working down,
> file system
> crypto (sometimes)
> LVM volume <- LUKS at this level
> LVM volume group <- corruption at this level
> partition of the RAID virtual disk
> physical disk partition
> The RAID at least still seems intact.
> I'm at a loss about what could have corrupted both volume groups, on
> separate physical disks. Things did seem to start going bad after the
> snapshot filled (6:30), though it was an hour later I got the first
> file system error.
As it worked initially, and then stopped working after data was
written, I suspect containers are overlapping, possibly because
there is some misalignment between containers on different layers
and data that goes into one container corrupts another one.
If I see this correcly, you have
That is decidedly too many. KISS is not even in the building
anymore with that. I know, likely the distro gave you something
like this, but really it is a symptom of a failed enginering
mind-set that keeps stacking up complexity until things fail.
> Maybe the crash a couple of days ago corrupted some key
> operating system files.
That should not happen, even with an overly complex set-up with
RAID, LVM and LUKS. All have their meta-data on disk and it is
usually not written or atomically written. And if you use user-space
RAID assembly (metadata 1.0, 1.1 or 1.2), that should either work
or fail, but not result in corruption. (Unless somebody did
something really, really stupid, like storing RAID geometry in
a file and then enforcing it ...)
> At any rate, if the volume groups are bad I suspect I'm toast and need
> to go to remote backups. There is some stuff about recovering the VG
> header on the net, but even if that succeeds it would be hard to trust
> the rest of the file systems.
Given the complexity, I don't think you can reasonably make
sure you repaired things even if you find errors.
I would recomend a complete rebuild, and, if possible, without
LVM. The only real benefit of LVM here is things like dynamic
resizing, but as your experience shows that is more of a
theoretical thing anyways. I suspect it fails more often than
not and is only useful when you need to do it online, but also
have the time/budget to go through several test-runs on an
identical test machine before to make sure it works, and also
the time to make really, really sure it has worked by analyzing
things in detail after each test-run...
Really, partition->RAID->LUKS and partition->RAID should
be quite enough. I have used that for 12 years with
excellent reliability including in a cluster set-up. I would
also recommend going with the old superblock 0.90 format for
RAID and kernel-level autodetection (partition 0xfc). That
increases reliability further as there is no dependency on
some user-space software or configuration for RAID assembly.
That also has the benefit that any kernel with RAID
auto-assembly can assemble the RAIDs, for example one from
Arno Wagner, Dr. sc. techn., Dipl. Inform., Email: arno at wagner.name
GnuPG: ID: CB5D9718 FP: 12D6 C03B 1B30 33BB 13CF B774 E35C 5FA1 CB5D 9718
A good decision is based on knowledge and not on numbers. -- Plato
If it's in the news, don't worry about it. The very definition of
"news" is "something that hardly ever happens." -- Bruce Schneier
More information about the dm-crypt