#480 Issue closed: rear recover stopped by mdadm and udev

Labels: bug, fixed / solved / done

tbsky opened issue at 2014-10-17 15:10:

hi:
I am using scientific linux 6.5 x86_64 version with two sata harddisk.
rear version is 1.16.1 comes with EPEL.
when I use rear recover to restore systems with linux software raid, I met several issues. since these issues are related, I put them together.

if the disks already has foreign mdadm partition on it, rescue media will activate them. so the disk partition will fail since the disk is busy locked by mdadm.

even the disks are clean, when rear use parted to make a partition, udev will immediately call "/sbin/mdadm" to start md daemon, so the disks is busy locked by mdadm again and rear will fail
since it can not make another partition. to fix the problem, I need to "rm /lib/udev/rules.d/65-md-incremental.rules" before recover.

. mdadm is safe now. but rear will fail with every "partprobe -s" command. rear add "sleep 1" before every "parted" command, but it forgot to add "sleep 1" before every "partprobe". udev
will interrupt and cause busy if without "sleep 1".

these problems show at logs like this:

Print 'Creating partitions for disk /dev/sdb (msdos)'
test 1
echo -e 'Creating partitions for disk /dev/sdb (msdos)'
parted -s /dev/sdb mklabel msdos
sleep 1
parted -s /dev/sdb mkpart '"primary"' 1048576B 210763775B
sleep 1
parted -s /dev/sdb set 1 boot on
parted -s /dev/sdb set 1 raid on
parted -s /dev/sdb mkpart '"primary"' 210763776B 2000398843903B
sleep 1
parted -s /dev/sdb set 2 raid on
partprobe -s /dev/sdb
Warning: WARNING: the kernel failed to re-read the partition table on /dev/sdb (Device or resource busy). As a result, it may not reflect all of your changes until after reboot.
2014-10-16 08:38:17 An error occurred during layout recreation.

I try to fix these problems. the logic is stop mdadm (like stop lvm which rear will do) and pause udev when necessary. so rear don't need to "sleep 1".

patch below:

--- 10_include_partition_code.sh.orig   2014-05-12 14:37:21.000000000 +0800
+++ 10_include_partition_code.sh        2014-10-17 16:38:00.445570683 +0800
@@ -44,6 +44,12 @@
     StopIfError "Disk $disk has size $disk_size, unable to continue."

     cat >> "$LAYOUT_CODE" <<EOF
+Log "Stop mdadm and pause udev"
+if [ -d "/dev/md" ]; then
+    mdadm --stop /dev/md?* >&2
+fi
+udevadm control --stop-exec-queue
+
 Log "Erasing MBR of disk $disk"
 dd if=/dev/zero of=$disk bs=512 count=1
 sync
@@ -52,6 +58,8 @@
     create_partitions "$disk" "$label"

     cat >> "$LAYOUT_CODE" <<EOF
+Log "Resume udev"
+udevadm control --start-exec-queue
 # Wait some time before advancing
 sleep 10

gdha commented at 2014-10-17 15:12:

hey - I would be very grateful if you could make a pull request - life is
so much easier (for me at least:)

On Fri, Oct 17, 2014 at 5:10 PM, tbsky notifications@github.com wrote:

hi:
I am using scientific linux 6.5 x86_64 version with two sata harddisk.
rear version is 1.16.1 comes with EPEL.
when I use rear recover to restore systems with linux software raid, I met
several issues. since these issues are related, I put them together.

if the disks already has foreign mdadm partition on it, rescue media will
activate them. so the disk partition will fail since the disk is busy
locked by mdadm.

even the disks are clean, when rear use parted to make a partition, udev
will immediately call "/sbin/mdadm" to start md daemon, so the disks is
busy locked by mdadm again and rear will fail
since it can not make another partition. to fix the problem, I need to "rm
/lib/udev/rules.d/65-md-incremental.rules" before recover.

. mdadm is safe now. but rear will fail with every "partprobe -s" command.
rear add "sleep 1" before every "parted" command, but it forgot to add
"sleep 1" before every "partprobe". udev
will interrupt and cause busy if without "sleep 1".

these problems show at logs like this:

Print 'Creating partitions for disk /dev/sdb (msdos)'
test 1
echo -e 'Creating partitions for disk /dev/sdb (msdos)'
parted -s /dev/sdb mklabel msdos
sleep 1
parted -s /dev/sdb mkpart '"primary"' 1048576B 210763775B
sleep 1
parted -s /dev/sdb set 1 boot on
parted -s /dev/sdb set 1 raid on
parted -s /dev/sdb mkpart '"primary"' 210763776B 2000398843903B
sleep 1
parted -s /dev/sdb set 2 raid on
partprobe -s /dev/sdb
Warning: WARNING: the kernel failed to re-read the partition table on
/dev/sdb (Device or resource busy). As a result, it may not reflect all of
your changes until after reboot.
2014-10-16 08:38:17 An error occurred during layout recreation.

I try to fix these problems. the logic is stop mdadm (like stop lvm which
rear will do) and pause udev when necessary. so rear don't need to "sleep
1".

patch below:

--- 10_include_partition_code.sh.orig 2014-05-12 14:37:21.000000000 +0800
+++ 10_include_partition_code.sh 2014-10-17 16:38:00.445570683 +0800
@@ -44,6 +44,12 @@
StopIfError "Disk $disk has size $disk_size, unable to continue."

cat >> "$LAYOUT_CODE" <<EOF

+Log "Stop mdadm and pause udev"
+if [ -d "/dev/md" ]; then

mdadm --stop /dev/md?* >&2
+fi
+udevadm control --stop-exec-queue
+
Log "Erasing MBR of disk $disk"
dd if=/dev/zero of=$disk bs=512 count=1
sync
@@ -52,6 +58,8 @@
create_partitions "$disk" "$label"

cat >> "$LAYOUT_CODE" <<EOF
+Log "Resume udev"
+udevadm control --start-exec-queue
Wait some time before advancing

sleep 10


Reply to this email directly or view it on GitHub
https://github.com/rear/rear/issues/480.

tbsky commented at 2014-10-17 15:48:

hi:
i am sorry. this is my first time using GitHub. I will try to learn the concept of pull request.

Reiner030 commented at 2014-10-17 16:29:

I had also first some problems - I found no good howto for it...
The steps are mostly:

  1. fork the repo (rear/rear.git)
  2. make a branch from your repo (tbsky/rear.git) for each pull request you would made
  3. make your changes for a topic in the branch...
  4. create over github (or your gui tool ) an pull request to original repo
    4a) If there is discussion/modification needed you can add then to the branch till it's ready to merge
  5. destroy the branch after merge

(something forgotten?)

tbsky commented at 2014-10-17 17:02:

hi Reiner030:
thanks a lot for your hint! I am studying how to use rear. after I am familiar with rear, I will try to learn how to use pull request.

gdha commented at 2014-10-21 11:39:

@tbsky let me know if you need help with git?

gdha commented at 2014-10-30 15:12:

is it working with your patch on the master repo?

tbsky commented at 2014-10-30 16:09:

yes it is working!!

gdha commented at 2014-11-12 15:38:

As issue #508 states it does not work as expected I'm afraid. Will re-open this issue for further tracking.
Furthermore I believe the fix was especially for mdadm related devices. Perhaps we should re-arrange the if block.
Also, I noticed that udevadm control --stop-exec-queue always return an error (perhaps because return code is 2 on Fedora21 beta). On the shell it works, but return code is 2 as well.

tbsky commented at 2014-11-12 16:04:

hi:
under RHEL6 "udevadm control --stop-exec-queue" and "udevadm control --start-exec-queue" return 0. so there maybe something changed in fedora21 beta.
and udev is not only affected mdadm. there are many "sleep 1" code when rear make disk partition.
if we don't "sleep 1" then udev will break rear sometimes (please see my first post) . so I think pause udev is easier, just how to pause it..

gdha commented at 2014-12-24 13:18:

@tbsky I can confirm it works well under Fedora 21. Thank you for all the work you've put in this issue.

tbsky commented at 2014-12-24 15:36:

hi:
I am glad I can do some help for this great software..


[Export of Github issue for rear/rear.]