Skip to content

Commit e502187

Browse files
committed
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD updates from Shaohua Li: - Add Partial Parity Log (ppl) feature found in Intel IMSM raid array by Artur Paszkiewicz. This feature is another way to close RAID5 writehole. The Linux implementation is also available for normal RAID5 array if specific superblock bit is set. - A number of md-cluser fixes and enabling md-cluster array resize from Guoqing Jiang - A bunch of patches from Ming Lei and Neil Brown to rewrite MD bio handling related code. Now MD doesn't directly access bio bvec, bi_phys_segments and uses modern bio API for bio split. - Improve RAID5 IO pattern to improve performance for hard disk based RAID5/6 from me. - Several patches from Song Liu to speed up raid5-cache recovery and allow raid5 cache feature disabling in runtime. - Fix a performance regression in raid1 resync from Xiao Ni. - Other cleanup and fixes from various people. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (84 commits) md/raid10: skip spare disk as 'first' disk md/raid1: Use a new variable to count flighting sync requests md: clear WantReplacement once disk is removed md/raid1/10: remove unused queue md: handle read-only member devices better. md/raid10: wait up frozen array in handle_write_completed uapi: fix linux/raid/md_p.h userspace compilation error md-cluster: Fix a memleak in an error handling path md: support disabling of create-on-open semantics. md: allow creation of mdNNN arrays via md_mod/parameters/new_array raid5-ppl: use a single mempool for ppl_io_unit and header_page md/raid0: fix up bio splitting. md/linear: improve bio splitting. md/raid5: make chunk_aligned_read() split bios more cleanly. md/raid10: simplify handle_read_error() md/raid10: simplify the splitting of requests. md/raid1: factor out flush_bio_list() md/raid1: simplify handle_read_error(). Revert "block: introduce bio_copy_data_partial" md/raid1: simplify alloc_behind_master_bio() ...
2 parents 46f0537 + e265eb3 commit e502187

26 files changed

+3582
-1483
lines changed

Documentation/admin-guide/md.rst

+29-3
Original file line numberDiff line numberDiff line change
@@ -276,14 +276,14 @@ All md devices contain:
276276
array creation it will default to 0, though starting the array as
277277
``clean`` will set it much larger.
278278

279-
new_dev
279+
new_dev
280280
This file can be written but not read. The value written should
281281
be a block device number as major:minor. e.g. 8:0
282282
This will cause that device to be attached to the array, if it is
283283
available. It will then appear at md/dev-XXX (depending on the
284284
name of the device) and further configuration is then possible.
285285

286-
safe_mode_delay
286+
safe_mode_delay
287287
When an md array has seen no write requests for a certain period
288288
of time, it will be marked as ``clean``. When another write
289289
request arrives, the array is marked as ``dirty`` before the write
@@ -292,7 +292,7 @@ All md devices contain:
292292
period as a number of seconds. The default is 200msec (0.200).
293293
Writing a value of 0 disables safemode.
294294

295-
array_state
295+
array_state
296296
This file contains a single word which describes the current
297297
state of the array. In many cases, the state can be set by
298298
writing the word for the desired state, however some states
@@ -401,7 +401,30 @@ All md devices contain:
401401
once the array becomes non-degraded, and this fact has been
402402
recorded in the metadata.
403403

404+
consistency_policy
405+
This indicates how the array maintains consistency in case of unexpected
406+
shutdown. It can be:
404407

408+
none
409+
Array has no redundancy information, e.g. raid0, linear.
410+
411+
resync
412+
Full resync is performed and all redundancy is regenerated when the
413+
array is started after unclean shutdown.
414+
415+
bitmap
416+
Resync assisted by a write-intent bitmap.
417+
418+
journal
419+
For raid4/5/6, journal device is used to log transactions and replay
420+
after unclean shutdown.
421+
422+
ppl
423+
For raid5 only, Partial Parity Log is used to close the write hole and
424+
eliminate resync.
425+
426+
The accepted values when writing to this file are ``ppl`` and ``resync``,
427+
used to enable and disable PPL.
405428

406429

407430
As component devices are added to an md array, they appear in the ``md``
@@ -563,6 +586,9 @@ Each directory contains:
563586
adds bad blocks without acknowledging them. This is largely
564587
for testing.
565588

589+
ppl_sector, ppl_size
590+
Location and size (in sectors) of the space used for Partial Parity Log
591+
on this device.
566592

567593

568594
An active md device will also contain an entry for each active device

Documentation/md/md-cluster.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -321,4 +321,4 @@ The algorithm is:
321321

322322
There are somethings which are not supported by cluster MD yet.
323323

324-
- update size and change array_sectors.
324+
- change array_sectors.

Documentation/md/raid5-ppl.txt

+44
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
Partial Parity Log
2+
3+
Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
4+
addressed by PPL is that after a dirty shutdown, parity of a particular stripe
5+
may become inconsistent with data on other member disks. If the array is also
6+
in degraded state, there is no way to recalculate parity, because one of the
7+
disks is missing. This can lead to silent data corruption when rebuilding the
8+
array or using it is as degraded - data calculated from parity for array blocks
9+
that have not been touched by a write request during the unclean shutdown can
10+
be incorrect. Such condition is known as the RAID5 Write Hole. Because of
11+
this, md by default does not allow starting a dirty degraded array.
12+
13+
Partial parity for a write operation is the XOR of stripe data chunks not
14+
modified by this write. It is just enough data needed for recovering from the
15+
write hole. XORing partial parity with the modified chunks produces parity for
16+
the stripe, consistent with its state before the write operation, regardless of
17+
which chunk writes have completed. If one of the not modified data disks of
18+
this stripe is missing, this updated parity can be used to recover its
19+
contents. PPL recovery is also performed when starting an array after an
20+
unclean shutdown and all disks are available, eliminating the need to resync
21+
the array. Because of this, using write-intent bitmap and PPL together is not
22+
supported.
23+
24+
When handling a write request PPL writes partial parity before new data and
25+
parity are dispatched to disks. PPL is a distributed log - it is stored on
26+
array member drives in the metadata area, on the parity drive of a particular
27+
stripe. It does not require a dedicated journaling drive. Write performance is
28+
reduced by up to 30%-40% but it scales with the number of drives in the array
29+
and the journaling drive does not become a bottleneck or a single point of
30+
failure.
31+
32+
Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
33+
not a true journal. It does not protect from losing in-flight data, only from
34+
silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
35+
performed for this stripe (parity is not updated). So it is possible to have
36+
arbitrary data in the written part of a stripe if that disk is lost. In such
37+
case the behavior is the same as in plain raid5.
38+
39+
PPL is available for md version-1 metadata and external (specifically IMSM)
40+
metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.
41+
42+
Currently, volatile write-back cache should be disabled on all member drives
43+
when using PPL. Otherwise it cannot guarantee consistency in case of power
44+
failure.

block/bio.c

+13-48
Original file line numberDiff line numberDiff line change
@@ -633,20 +633,21 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
633633
}
634634
EXPORT_SYMBOL(bio_clone_fast);
635635

636-
static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
637-
struct bio_set *bs, int offset,
638-
int size)
636+
/**
637+
* bio_clone_bioset - clone a bio
638+
* @bio_src: bio to clone
639+
* @gfp_mask: allocation priority
640+
* @bs: bio_set to allocate from
641+
*
642+
* Clone bio. Caller will own the returned bio, but not the actual data it
643+
* points to. Reference count of returned bio will be one.
644+
*/
645+
struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
646+
struct bio_set *bs)
639647
{
640648
struct bvec_iter iter;
641649
struct bio_vec bv;
642650
struct bio *bio;
643-
struct bvec_iter iter_src = bio_src->bi_iter;
644-
645-
/* for supporting partial clone */
646-
if (offset || size != bio_src->bi_iter.bi_size) {
647-
bio_advance_iter(bio_src, &iter_src, offset);
648-
iter_src.bi_size = size;
649-
}
650651

651652
/*
652653
* Pre immutable biovecs, __bio_clone() used to just do a memcpy from
@@ -670,8 +671,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
670671
* __bio_clone_fast() anyways.
671672
*/
672673

673-
bio = bio_alloc_bioset(gfp_mask, __bio_segments(bio_src,
674-
&iter_src), bs);
674+
bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
675675
if (!bio)
676676
return NULL;
677677
bio->bi_bdev = bio_src->bi_bdev;
@@ -688,7 +688,7 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
688688
bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
689689
break;
690690
default:
691-
__bio_for_each_segment(bv, bio_src, iter, iter_src)
691+
bio_for_each_segment(bv, bio_src, iter)
692692
bio->bi_io_vec[bio->bi_vcnt++] = bv;
693693
break;
694694
}
@@ -707,43 +707,8 @@ static struct bio *__bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
707707

708708
return bio;
709709
}
710-
711-
/**
712-
* bio_clone_bioset - clone a bio
713-
* @bio_src: bio to clone
714-
* @gfp_mask: allocation priority
715-
* @bs: bio_set to allocate from
716-
*
717-
* Clone bio. Caller will own the returned bio, but not the actual data it
718-
* points to. Reference count of returned bio will be one.
719-
*/
720-
struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
721-
struct bio_set *bs)
722-
{
723-
return __bio_clone_bioset(bio_src, gfp_mask, bs, 0,
724-
bio_src->bi_iter.bi_size);
725-
}
726710
EXPORT_SYMBOL(bio_clone_bioset);
727711

728-
/**
729-
* bio_clone_bioset_partial - clone a partial bio
730-
* @bio_src: bio to clone
731-
* @gfp_mask: allocation priority
732-
* @bs: bio_set to allocate from
733-
* @offset: cloned starting from the offset
734-
* @size: size for the cloned bio
735-
*
736-
* Clone bio. Caller will own the returned bio, but not the actual data it
737-
* points to. Reference count of returned bio will be one.
738-
*/
739-
struct bio *bio_clone_bioset_partial(struct bio *bio_src, gfp_t gfp_mask,
740-
struct bio_set *bs, int offset,
741-
int size)
742-
{
743-
return __bio_clone_bioset(bio_src, gfp_mask, bs, offset, size);
744-
}
745-
EXPORT_SYMBOL(bio_clone_bioset_partial);
746-
747712
/**
748713
* bio_add_pc_page - attempt to add page to bio
749714
* @q: the target queue

drivers/md/Makefile

+1-1
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o
1818
dm-era-y += dm-era-target.o
1919
dm-verity-y += dm-verity-target.o
2020
md-mod-y += md.o bitmap.o
21-
raid456-y += raid5.o raid5-cache.o
21+
raid456-y += raid5.o raid5-cache.o raid5-ppl.o
2222

2323
# Note: link order is important. All raid personalities
2424
# and must come before md.o, as they each initialise

drivers/md/bitmap.c

+48-11
Original file line numberDiff line numberDiff line change
@@ -471,6 +471,7 @@ void bitmap_update_sb(struct bitmap *bitmap)
471471
kunmap_atomic(sb);
472472
write_page(bitmap, bitmap->storage.sb_page, 1);
473473
}
474+
EXPORT_SYMBOL(bitmap_update_sb);
474475

475476
/* print out the bitmap file superblock */
476477
void bitmap_print_sb(struct bitmap *bitmap)
@@ -696,7 +697,7 @@ static int bitmap_read_sb(struct bitmap *bitmap)
696697

697698
out:
698699
kunmap_atomic(sb);
699-
/* Assiging chunksize is required for "re_read" */
700+
/* Assigning chunksize is required for "re_read" */
700701
bitmap->mddev->bitmap_info.chunksize = chunksize;
701702
if (err == 0 && nodes && (bitmap->cluster_slot < 0)) {
702703
err = md_setup_cluster(bitmap->mddev, nodes);
@@ -1727,7 +1728,7 @@ void bitmap_flush(struct mddev *mddev)
17271728
/*
17281729
* free memory that was allocated
17291730
*/
1730-
static void bitmap_free(struct bitmap *bitmap)
1731+
void bitmap_free(struct bitmap *bitmap)
17311732
{
17321733
unsigned long k, pages;
17331734
struct bitmap_page *bp;
@@ -1761,6 +1762,21 @@ static void bitmap_free(struct bitmap *bitmap)
17611762
kfree(bp);
17621763
kfree(bitmap);
17631764
}
1765+
EXPORT_SYMBOL(bitmap_free);
1766+
1767+
void bitmap_wait_behind_writes(struct mddev *mddev)
1768+
{
1769+
struct bitmap *bitmap = mddev->bitmap;
1770+
1771+
/* wait for behind writes to complete */
1772+
if (bitmap && atomic_read(&bitmap->behind_writes) > 0) {
1773+
pr_debug("md:%s: behind writes in progress - waiting to stop.\n",
1774+
mdname(mddev));
1775+
/* need to kick something here to make sure I/O goes? */
1776+
wait_event(bitmap->behind_wait,
1777+
atomic_read(&bitmap->behind_writes) == 0);
1778+
}
1779+
}
17641780

17651781
void bitmap_destroy(struct mddev *mddev)
17661782
{
@@ -1769,6 +1785,8 @@ void bitmap_destroy(struct mddev *mddev)
17691785
if (!bitmap) /* there was no bitmap */
17701786
return;
17711787

1788+
bitmap_wait_behind_writes(mddev);
1789+
17721790
mutex_lock(&mddev->bitmap_info.mutex);
17731791
spin_lock(&mddev->lock);
17741792
mddev->bitmap = NULL; /* disconnect from the md device */
@@ -1920,6 +1938,27 @@ int bitmap_load(struct mddev *mddev)
19201938
}
19211939
EXPORT_SYMBOL_GPL(bitmap_load);
19221940

1941+
struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot)
1942+
{
1943+
int rv = 0;
1944+
struct bitmap *bitmap;
1945+
1946+
bitmap = bitmap_create(mddev, slot);
1947+
if (IS_ERR(bitmap)) {
1948+
rv = PTR_ERR(bitmap);
1949+
return ERR_PTR(rv);
1950+
}
1951+
1952+
rv = bitmap_init_from_disk(bitmap, 0);
1953+
if (rv) {
1954+
bitmap_free(bitmap);
1955+
return ERR_PTR(rv);
1956+
}
1957+
1958+
return bitmap;
1959+
}
1960+
EXPORT_SYMBOL(get_bitmap_from_slot);
1961+
19231962
/* Loads the bitmap associated with slot and copies the resync information
19241963
* to our bitmap
19251964
*/
@@ -1929,14 +1968,13 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot,
19291968
int rv = 0, i, j;
19301969
sector_t block, lo = 0, hi = 0;
19311970
struct bitmap_counts *counts;
1932-
struct bitmap *bitmap = bitmap_create(mddev, slot);
1933-
1934-
if (IS_ERR(bitmap))
1935-
return PTR_ERR(bitmap);
1971+
struct bitmap *bitmap;
19361972

1937-
rv = bitmap_init_from_disk(bitmap, 0);
1938-
if (rv)
1939-
goto err;
1973+
bitmap = get_bitmap_from_slot(mddev, slot);
1974+
if (IS_ERR(bitmap)) {
1975+
pr_err("%s can't get bitmap from slot %d\n", __func__, slot);
1976+
return -1;
1977+
}
19401978

19411979
counts = &bitmap->counts;
19421980
for (j = 0; j < counts->chunks; j++) {
@@ -1963,8 +2001,7 @@ int bitmap_copy_from_slot(struct mddev *mddev, int slot,
19632001
bitmap_unplug(mddev->bitmap);
19642002
*low = lo;
19652003
*high = hi;
1966-
err:
1967-
bitmap_free(bitmap);
2004+
19682005
return rv;
19692006
}
19702007
EXPORT_SYMBOL_GPL(bitmap_copy_from_slot);

drivers/md/bitmap.h

+3
Original file line numberDiff line numberDiff line change
@@ -267,8 +267,11 @@ void bitmap_daemon_work(struct mddev *mddev);
267267

268268
int bitmap_resize(struct bitmap *bitmap, sector_t blocks,
269269
int chunksize, int init);
270+
struct bitmap *get_bitmap_from_slot(struct mddev *mddev, int slot);
270271
int bitmap_copy_from_slot(struct mddev *mddev, int slot,
271272
sector_t *lo, sector_t *hi, bool clear_bits);
273+
void bitmap_free(struct bitmap *bitmap);
274+
void bitmap_wait_behind_writes(struct mddev *mddev);
272275
#endif
273276

274277
#endif

0 commit comments

Comments
 (0)