[dm-crypt] ext3 + dm_crypt performance impact (CentOS 5.4 AMD64)

Robert.Heinzmann at deutschepost.de Robert.Heinzmann at deutschepost.de
Mon Dec 27 18:03:28 CET 2010

Hello list, 
I'm currently testig some dm_crypt performance on CentOS 5.4 (AMD64, Kernel One part of the evaluation is a simple end-2-end disk benchmark. I use the Kernel for this. The routine is: 
   0) Create filesystem on the backend device and mount
   -- record starttime
   1) Extract Kernel (2.6.37)
   2) Copy the extracted kernel once
   3) Copy the extracted kernel once
   4) Copy the extracted kernel once
   5) Remove the copy 
   6) Remove the copy 
   7) Remove the copy 
   8) Remove the original
   9) call sync
   -- record endtime 
I do this for plain ext3 filesystem on /dev/sdf and ext3 on crypto filesystem (I use dm_crypt and truecrypt as backends). 
I found that if I run this Test with a ext3 filesystem created in step (0), performance drops ~40% for dm_crypt (runtime is 40 % longer). I also see that the cpu load is not increased, only I/O wait increases, but not the overall cpu usage. For me this looks like a latency issue, because stil a lot of CPU is free (actually mostly all of it). If I do a /proc/diskstat before and after the test, I see that I/O time is much longer in dm_crypt. 
a) /proc/diskstat for one run with plain disk backend: 
  DISK sdf: reads=101 rmerge=0 rsect=808 rtimems=6366 writes=44052 wmerge=372381 wsect=3321608 wtimems=4993178 current=158 iotime=36041 iotimeweighted=5006161
b) /proc/diskstat for one run with dm_crypt disk backend (dm-1 is the dm_crypt): 
  DISK sdf: reads=110 rmerge=0 rsect=880 rtimems=5084 writes=55169 wmerge=421079 wsect=3809984 wtimems=5275315 current=0 iotime=42127 iotimeweighted=5280393
  DISK dm-1: reads=110 rmerge=0 rsect=880 rtimems=13457 writes=476248 wmerge=0 wsect=3809984 wtimems=1467043364 current=0 iotime=42602 iotimeweighted=1467056820
I also see that the disk writes are split in 4k: 
for a) the average write size is 3281192 / 48726 = 67,33 = ~34 KB per Write
for b) it is dm-1 with ext3: Average Write Size = 3809984 / 476248 = 8 = 4k per write. 
Because this is cached I/O (buffer cache), dm_requests are processed in units of 4k (somehow this seems to be a implementation specific thing).Then this small requests are merged again in the scheduler for the /dev/sdf backend device. I would expect that this is not such a big issue. I tested the system and it can to  80 MB/sec encryption / decryption. The backend does ~50 MB/s writes. So I was expecting a performance impact of ~10 percent. However it seems to be much more (40%). 
When I run the same test with ext2: I see a average request size of 22 Sectors (~11k) to the dm_device and merges on the backend (46 vs 42 seconds ~ 10% perf. impact):
With XFS the same (95 seconds instead of 87 ~10% perf. impact, avg I/O size 34 Sectors). Only ext3, is fixed at 4k requests.
I also measured via a dm_zero backend the latency impact of dm_crypt. It seems it adds ~0.1ms latency to the I/Os for small I/Os (4/8k) up to 10ms for 1M (I/O). It also looks like dm_crypt does only scale to one CPU Core per device. 
So there are now several questions. 
1) Can I force ext3 to use larger I/O's also ? 
2) Is my assumtion correct that the cause for this issue can be a accumulated serialized latency issue ? 
  - ext3 splits I/O's in units of 4k and adds them to the device mapper 
  - first device mapper target in the stack receives the requests (in this case this is dm_crypt)
  - dm_crypt encrypts each 4k block individually and serially (because of single workqueue) and adds them to the lower level device (in this case sdf) - this adds up the latency addition (10 x 4k blocks = + 1ms).
  - sdf queue merges the requests (not as efficient anymore (55169  vs 44052 writes) 
  - sdf sends the I/O's to the backend sevice 
Especially the step 3 adds "significant" latency to the procedure to slow down the process considerably. 
Can this be the reason ? 
3) for direct I/O, request size is flexible, thus database workloads should not see major impact on performance (~10%) until the per device CPU limit is hit - is this a correct assumption ? 
4) Cached I/O can be slowed down considerably, also if the average I/O rate is below the CPU limit due to latency multiplication - Is this a correct assumption ? 
It would be great if you would help me understand this issue :)

Mit freundlichen Grüßen / Kind Regards

Robert Heinzmann

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.saout.de/pipermail/dm-crypt/attachments/20101227/a81f878c/attachment-0001.html>

More information about the dm-crypt mailing list