[dm-crypt] [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

Mikulas Patocka mpatocka at redhat.com
Fri Apr 12 02:06:10 CEST 2013

On Thu, 11 Apr 2013, Tejun Heo wrote:

> On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote:
> > If this becomes an actual bottleneck, the right thing to do is making
> > css ref per-cpu.  Please stop messing around with refcounting.
> If you think this kind of hackery is acceptable, you really need to
> re-evaluate your priorities in making engineering decisions.  In
> tightly coupled code, maybe, but you're trying to introduce utterly
> broken error-prone thing as a generic block layer API.  I mean, are
> you for real?
> -- 
> tejun

All that I can tell you is that adding an empty atomic operation 
"cmpxchg(&bio->bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);" 
to bio_clone_context and bio_disassociate_task increases the time to run a 
benchmark from 23 to 40 seconds.

Every single atomic reference in the block layer is measurable.

How did I measure it:

(1) use dm SRCU patches 
that replace some atomic accesses in device mapper with SRCU. The patches 
will likely be included in the kernel to improve performance.

(2) use the patch v2 that I posted in this thread

(3) add bio_associate_current(bio) to _dm_request (so that each bio is 
associated with a process even if it is not offloaded to a workqueue)

(4) change bio_clone_context to actually increase reference counts:
static inline void bio_clone_context(struct bio *bio, struct bio *bio_src)
        BUG_ON(bio->bi_ioc != NULL);
        if (bio_src->bi_ioc) {
                bio->bi_ioc = bio_src->bi_ioc;
                if (bio_src->bi_css && css_tryget(bio_src->bi_css)) {
                        bio->bi_css = bio_src->bi_css;
                bio->bi_flags |= 1UL << BIO_DROP_CGROUP_REFCOUNT;

(5) add "cmpxchg(&bio->bi_css->refcnt, bio->bi_css->refcnt, 
bio->bi_css->refcnt)" to bio_clone_context and bio_disassociate_task

Now, measuring:
- create 4GiB ramdisk, fill it with dd so that it is allocated
- create 5 nested device mapper linear targets on it
- run "time fio --rw=randrw --size=1G --bs=512 
	--filename=/dev/mapper/linear5 --direct=1 --name=job1 --name=job2 
	--name=job3 --name=job4 --name=job5 --name=job6 --name=job7 
	--name=job8 --name=job9 --name=job10 --name=job11 --name=job12"
(it was ran on 12-core machine, so there are 12 concurrent jobs)

If I measure kernel (4), the benchmark takes 23 seconds. For kernel (5) it 
takes 40 seconds.


More information about the dm-crypt mailing list