#1697 PR merged: Automatically add 'missing' devices to MD arrays with not enough physical devices upon restore

Labels: enhancement, fixed / solved / done

rmetrich opened issue at 2018-01-19 09:18:

If a software raid device is in degraded state at the time of the rear mkrescue creation (e.g. a device has been removed from a RAID1 array), the disk layout restoration will fail with following error seen in the log:

mdadm --create /dev/md127 --force --metadata=1.2 --level=raid1 --raid-devices=2 --uuid=a0fc5637:fdf5dafa:a345e010:83293548 --name=boot /dev/sdb1
+++ echo Y
mdadm: You haven't given enough devices (real or missing) to create this array

The least intrusive fix is to do the following: Add missing special devices to the mdadm create command upon restore when expression raid-devices - #devices > 0 is true

jsmeix commented at 2018-01-19 10:10:

I am not at all a RAID expert but I think when any kind of RAID
is in any kind of degraded state at the time of "rear mkrescue"
then it looks like a sufficiently severe issue in the original system
so that ReaR should do someting about it at the time of "rear mkrescue".

My basic idea behind is to let "rear mkrescue" ensure that the contert
in disklayout.conf represents the state of a clean original system
so that things can be expected to "just work" later during "rear recover".

Accordingly I think in general when at the time of "rear mkrescue"
the original system is in any kind of degraded state
then "rear mkrescue" should error out.

I do not like it so much to try to fix a degraded original system
at the time of "rear mkrescue" later at the time of "rear recover".

Or is there a reasonable use case to let "rear mkrescue" finish successfully
regardless that the original system is in any kind of degraded state?

Regardless of my reasoning above I also think "rear recover"
should work as robust as possible (with reasonable effort)
against failing to recreate the system so that I appreciate your pull request
because it makes "rear recover" finish successfully more often than before.

But I think an additioanl test during "rear mkrescue" to detect
a degraded RAID state would be needed even more
to avoid possible issues during "rear recover" preventively.

jsmeix commented at 2018-01-19 10:31:

@rear/contributors
is there a RAID expert among us who could actually review it?

rmetrich commented at 2018-01-19 10:40:

@jsmeix I agree, something must be done on the "mkrescue" side also, but should be another PR, because it is not that easy. It will likely require user interaction to decide whether creating a rescue image shall be permitted or not depending on the RAID status.
Some admins would likely want to be able to create the backup anyway (typically you removed a failing devices but had no spare devices to put back yet).

rmetrich commented at 2018-01-19 10:45:

mdadm manpage:

   -n, --raid-devices=
          Specify  the  number  of  active  devices  in the array.  This, plus the number of spare devices (see
          below) must equal the number of component-devices (including "missing" devices) that  are  listed  on
          the  command  line  for  --create.   Setting  a value of 1 is probably a mistake and so requires that
          --force be specified first.  A value of 1 will then be  allowed  for  linear,  multipath,  RAID0  and
          RAID1.  It is never allowed for RAID4, RAID5 or RAID6.
          This  number  can  only be changed using --grow for RAID1, RAID4, RAID5 and RAID6 arrays, and only on
          kernels which provide the necessary support.

jsmeix commented at 2018-01-19 10:46:

@rmetrich
according to my "man mdadm" on SLES11

To create a "degraded" array in which some devices are missing,
simply give the word "missing" in place of a device name.
This will cause mdadm to leave the corresponding slot in the array empty.
For a RAID4 or RAID5 array at most one slot can be "missing";
for a RAID6 array at most two slots.
For a  RAID1 array, only one real device needs to be given.
All of the others can be "missing".

I think a test if devices_found is zero should be added
probably with an explicit Error exit message
to avoid that "rear recover" dies later at a non-working
"mdadm create missing missing missing ..."
command in the diskrestore.sh script, cf.
"Try to care about possible errors" in
https://github.com/rear/rear/wiki/Coding-Style

rmetrich commented at 2018-01-19 11:07:

That would indicate a totally broken array with no active device, which is unlikely to happen, since data couldn't be read out of it.

jsmeix commented at 2018-01-19 12:41:

@rmetrich
regarding your https://github.com/rear/rear/pull/1697#issuecomment-358929793

I agree that what needs to be done on the "mkrescue" side
can and should be implemented via a separated pull request.

Regarding "required user interaction":

Since there is the UserInput function in ReaR 2.3
user interaction does no longer cause problems
because any user interaction via the UserInput function
supports a default response after a timeout and also
any specifically needed (non-default) user interaction
can be automated by the user (i.e. user interaction
does no longer require a real user to be present)
so that "rear mkresuce/mkbackup" can run unattended
even at e.g. 03:00 at night (same for user interaction
via the UserInput function during "rear recover"), cf.
"First steps towards running Relax-and-Recover unattended in general" in
http://relax-and-recover.org/documentation/release-notes-2-3
and have a look at
https://github.com/gdha/rear-automated-testing/issues/36#issuecomment-342769055

jsmeix commented at 2018-01-23 10:46:

If there are no objections I would like to merge it today afternoon.

jsmeix commented at 2018-01-23 15:16:

@rmetrich
many thanks for making "rear recover" run more robust!


[Export of Github issue for rear/rear.]