[dm-crypt] LUKS header recovery attempt, bruteforce detection of AF-keyslot bit errors
listdump at depressiverobots.com
Fri Apr 28 17:51:25 CEST 2017
my improvised simulate-AF-bitflip-and-decode bruteforce program based on
cryptsetup core code is working since yesterday. After finding several
issues in my initial code, including a failing call to the kernel crypto
API asking for a aes "xts-plain64" cipher (instead of the correct aes
"xts" version) that had me scratching my head and looking at strace
outputs for a while, I've managed to run it successfully against a
specially corrupted LUKS-encrypted disk containing an ext4 filesystem
(the first 512 bytes are guaranteed to be 0x00, as noted before) and
detected the error that was deliberately written to one of the AF
A run against the actual target device was unsuccessful, which could
have several reasons, including the fact that the LVM header might not
start with 512 bytes of 0x00, which is part of the decryption check
currently used. I will add a test against the magic "LABELONE" bytes in
the header for LVM-specific, but given the presence of ECC on most SSDs,
it's very unlikely I'll actually manage to find any undetected simple
error pattern and recover the disk data.
The brute force speed yesterday was 250 AF-keyslot changes per second
and core against the target.
Today I've managed to further improve this up to about ~285 iterations /
second and core by limiting the first-stage decode of the AF-keyslot
sectors to exactly those that were influenced by the bit flip. These
values are for aes-xts plain64, sha1 and a 512 bit masterkey.
Given that a 512-bit masterkey leads to 4000 * 512 bits of AF-keyslot
data, this gives (4000⋅512)/(285⋅60⋅4) = ~30 minutes of total bruteforce
duration on a modern quadcore CPU using four threads for trying a fixed
localized error pattern such as a single bit flip against the disk, as
long as the partial contents of a fixed sector are known, such as
ext4/LVM file header constants in sector #0 and #1.
The main processing time during bruteforce is spent on AF_merge(), which
in turn spends most of it's time on the diffuse() function that is
concerned with hashing the AF-key data (with the sha1 hash, in our
case). Note that there is no adjustable number of hashing operations for
this step, as opposed to other steps of the LUKS header decryption. It
is therefore obvious that a adversary with access to hashing
accelerators (GPU, FPGA, ASIC, see bitcoin hardware developments for
specialized hashing) can easily benefit from several magnitudes of
performance gain over the speed obtainable on a CPU without any
parameter to make this harder except changing to a more demanding hash
algorithm. (This has little impact on a CPU, as their speed differences
are within a factor of two)
IMHO this approach is only interesting for recovery purposes, but not a
real attack vector in terms of opportunity for significantly improved
offline attacks against a drive, as it only applies for the very rare
cases where an attacker has the complete password as well as almost all
of the AF data in their original form but lacks a few bits or bytes
(depending on his knowledge of the corrupted sections) that he now has
to test against.
If the diffuse() function and therefore the hash have excellent
cryptographic properties (which might not be the case), "reconstructing"
AF-sections on disk that were properly overwritten on the physical level
is be exponentially hard in a sense that quickly makes a brute-force
attack against the actual symmetric XTS key the better choice, at least
for this "naive" way of regarding the AF-keyslot as a large symmetric key.
A quick example of this:
* overwriting the first 4 bytes of your AF-keyslot with 0x42 0x42 0x42
0x42 with the intention to make it unusable will allow a break it within
~43 days ( (255^4)/(285⋅4⋅60⋅60⋅24) ) on a single desktop machine and
the current performance, and half that time that on average
* overwriting the first 64 bytes will lead to a similar complexity for
this bruteforce algorithm as trying the masterkey from null.
Note: There are likely other shortcuts, as hinted at before, so be sure
to overwrite as much of the LUKS header bytes in practice if you have
the intention of destroying it.
Looking for further opportunities at optimizing this bruteforce process,
it is clear that given the highly-optimized hashing code of the nettle
library mentioned before, we're unlikely to push significantly more
unique hash operations through a given core. But most of the hashing
operations and their results will be identical between the different
fault positions we're testing, at least as long as the processed data
hasn't changed yet compared to the last iteration (the simulated fault
is further down in the AFkeyslot), they don't all have to be performed
again each time. Maybe I'll manage to squeeze even more performance out
On 23.04.2017 22:03, protagonist wrote:
> On 22.04.2017 20:02, protagonist wrote:
>> I've manually compiled https://github.com/mbroz/pbkdf2_cryptsetup_test
>> as well as cryptsetup itself to find possible improvements with
>> different crypto backends, gcc optimizations such as -Ofast and
>> -march=native, but I've been unsuccessful to improve on the 1.25M/s
>> number so far. Openssl beats gcrypt, but the default kernel backend
>> still seems faster.
> Update: the nettle library is the fastest crypto backend by a
> significant margin according to my tests. Also, contrary to my previous
> remark, the kernel crypto appears to be slower than openssl, at least
> under Debian Jessie with 3.16.
> The sha1-performance can be grouped like this:
> nettle > openssl > kernel 3.16 > gcrypt > nss
> cryptsetup benchmark with libnettle runs at 1.68M/s:
> PBKDF2-sha1 1680410 iterations per second for 256-bit key
> This value includes benefits by switching the CFLAGS from "-O2" to
> "-Ofast -march=native", which are mostly insignificant (<1% improvement)
> and probably rarely worth the effort.
> Switching from libnettle 4.7 to freshest libnettle 6.3 brings only minor
> PBKDF2-sha1 1702233 iterations per second for 256-bit key
> Given that they provide hand-optimized x86 assembler code for the sha
> computation, this is not entirely surprising.
>> One could go even further and replace the time-consuming steps 5) and 6)
>> with a routine that decrypts an encrypted part of the disk with the
>> masterKeyCandidate and compares it to a known plaintext (or a more
>> elaborate heuristic, such as several entropy checks looking for
>> "correctly" decryptable areas on disk leading to low-entropy output),
>> which might be a lot faster given AES-NI support and AES-xts throughput
>> speeds, but since we don't know much about actual disk content, this
>> seems to be too much of a gamble to be worth the effort at this point.
>> It's easily thinkable one tests against several areas containing a
>> well-compressed movie or a similarly high-entropy area, leading to a
>> false-negative and missing the only correct master key.
> Correction: I meant omitting steps 4) and 5).
> While thinking about the disk layout, I've realized that the encrypted
> storage almost definitely contains a LVM scheme commonly used by
> full-disk encryption installer setups to store separate logical volumes
> for both root and swap. In the related case of a manual setup of an
> external storage device, this would often be plain ext4.
> Now, there is more "known plaintext" available than I initially
> suspected: both LVM and ext4 don't just contain special headers with
> magic numbers, labels and checksums in well defined positions, but they
> actually include at least one unused sector right in the beginning.
> According to my tests, those bytes are set (or kept) to 0x00.
> For LVM this is implied with the following specification:
> "The physical volume label is stored in the second sector of the
> physical volume." [^1]
> Given the SSD sector size of 512 bytes and the fact that they reserve
> four sectors for the label, we have at least three sectors à 512 bytes
> of known zeros at sector #0, #2 and #3, which should be more than plenty
> for some fast & simple decryption check that doesn't assume much about
> the specific version and configuration of the logical volumes.
> For ext4, the "Group 0 Padding" of 1024 Bytes would serve a similar
> purpose. [^2]
> Now that this shortcut seemed attractive, I've started cannibalizing the
> LUKS_open_key() function in lib/luks1/keymanage.c to host my bruteforce
> approach and made some progress already.
> Small-scale benchmarking with 10000 rounds of AF-decryption, AF_merge
> and (unfinished) block target decryption calls take about 50s combined,
> including "normal" cryptsetup program initialization, initial disk reads
> and most other necessary computations.
> This sets overall performance at something in the order of 150 to 200
> individual checks per second and core for the final routine, which is a
> good improvement over the naive bruteforce version once it's ready.
> [^2] https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Layout
> dm-crypt mailing list
> dm-crypt at saout.de
More information about the dm-crypt