#533 Issue closed: rear git201501071534 "udevadm settle --timeout=20" hangs endlessly in recovery system for openSUSE 13.2

Labels: enhancement, support / question, fixed / solved / done

jsmeix opened issue at 2015-01-20 16:48:

I am testing rear git201501071534 version on openSUSE 13.2.

It basically works but needs manual interventions because
"rear -d -D recover" hangs at 'Creating partitions for disk /dev/sda (msdos)' at all the parted steps.

I have to manually kill each "/sbin/udevadm settle --timeout=20" process to enforce it to proceed.

For each parted step it hangs with those processes:

RESCUE f74:~ # ps aw | grep -A1 parted
 2130 tty1     S+     0:00 parted -s /dev/sda mklabel msdos
 2131 tty1     S+     0:00 /sbin/udevadm settle --timeout=20

RESCUE f74:~ # kill 2131

RESCUE f74:~ # ps aw | grep -A1 parted
 2144 tty1     S+     0:00 parted -s /dev/sda mkpart "primary" 1048576B 1570766847B
 2145 tty1     S+     0:00 /sbin/udevadm settle --timeout=20

RESCUE f74:~ # kill 2145

RESCUE f74:~ # ps aw | grep -A1 parted
 2150 tty1     S+     0:00 parted -s /dev/sda mkpart "primary" 1570766848B 12313427967B
 2151 tty1     S+     0:00 /sbin/udevadm settle --timeout=20

RESCUE f74:~ # kill 2151

RESCUE f74:~ # ps aw | grep -A1 parted
 2156 tty1     S+     0:00 parted -s /dev/sda mkpart "primary" 12313427968B 21474836479B
 2157 tty1     S+     0:00 /sbin/udevadm settle --timeout=20

RESCUE f74:~ # kill 2157

RESCUE f74:~ # ps aw | grep -A1 partprobe
 2161 tty1     S+     0:00 partprobe -s /dev/sda
 2162 tty1     S+     0:00 /sbin/udevadm settle --timeout=20

RESCUE f74:~ # kill 2162

gdha commented at 2015-01-21 09:18:

That is a bit weird as in rear we never defined --timeout=20 - where is that coming from - could you try to find this out:

$ grep -r udevadm . | grep settle
./usr/share/rear/skel/Fedora/17/usr/lib/systemd/system/udev-settle.service:ExecStart=/usr/bin/udevadm settle
./usr/share/rear/skel/Fedora/16/lib/systemd/system/udev-settle.service:ExecStart=/sbin/udevadm settle
./usr/share/rear/skel/default/etc/scripts/system-setup.d/00-functions.sh:        type -p udevadm >/dev/null && udevadm settle --timeout=10 $@ || udevsettle $@
./usr/share/rear/skel/default/usr/lib/systemd/system/udev-settle.service:ExecStart=/usr/bin/udevadm settle
./usr/share/rear/lib/linux-functions.sh:                udevadm settle $@

jsmeix commented at 2015-01-21 10:08:

What is even more strange is that this '--timeout=20' stuff does not work because it seems to hang endlessly (at least more than about 5 minutes).

jsmeix commented at 2015-01-21 10:21:

The udevadm is a child process of parted.

Currently I don't know how this child gets launched.

How it looks when it hangs at 'Creating partitions for disk /dev/sda (msdos)':

RESCUE f74:~ # pstree -plau
systemd,1
  |-agetty,1358 tty2 38400
  |-agetty,1359 tty3 38400
  |-agetty,1361 tty4 38400
  |-bash,1357
  |   `-rear,1373 /bin/rear -d -D recover
  |       `-rear,2201 /bin/rear -d -D recover
  |           `-parted,2217 -s /dev/sda mklabel msdos
  |               `-udevadm,2218 settle --timeout=20
  |-rpc.statd,1540
  |-rpcbind,1535,rpc
  |-sshd,41 -D
  |   `-sshd,1368
  |       `-bash,1370
  |           `-pstree,2220 -plau
  |-systemd-journal,44
  `-systemd-udevd,47
      |-systemd-udevd,1515
      `-systemd-udevd,1516
RESCUE f74:~ # 

FYI:

In my /etc/rear/local.conf I have

PROGS=( ${PROGS[@]} pstree )

jsmeix commented at 2015-01-21 10:34:

PID 2218 (udevadm settle --timeout=20) hangs endlessly in:

RESCUE f74:~ # strace -p2218
Process 2218 attached
restart_syscall(<... resuming interrupted call ...>) = 0
access("/run/udev/queue", F_OK)         = 0
poll([{fd=4, events=POLLIN}], 1, 1000)  = 0 (Timeout)
access("/run/udev/queue", F_OK)         = 0
poll([{fd=4, events=POLLIN}], 1, 1000)  = 0 (Timeout)
access("/run/udev/queue", F_OK)         = 0
poll([{fd=4, events=POLLIN}], 1, 1000^CProcess 2218 detached
 
RESCUE f74:~ #

Also plain "udevadm settle" (with and without '--timeout=...')
called on the comand line hangs endlessly in the same way
in the recovery system.

In the openSUSE 13.2 original system "udevadm settle" (with and without '--timeout=...') returns immediately.

It seems somehow udev related stuff is broken in the rear recovery system?

But I cannot debug udev related issues (udev stuff is way too strange for my brain - I do know that from various bad experiences in the past).

jsmeix commented at 2015-01-21 10:39:

Found how the udevadm child process of parted gets launched:

RESCUE f74:~ # find / | xargs grep -- '--timeout=20' 2>/dev/null
Binary file /usr/lib64/libparted.so.2.0.0 matches
RESCUE f74:~ # strings /usr/lib64/libparted.so.2.0.0 | grep -- '--timeout=20'
/sbin/udevadm settle --timeout=20

jsmeix commented at 2015-01-21 14:42:

When I undo the udev related changes in 10_include_partition_code.sh
it works again for me.

I mean this changes in 10_include_partition_code.sh that happened from rear-1.16.1-git201412031625.tar.gz (where it had worked for me) to rear-1.16.1-git201501071534.tar.gz (which does no longer work for me):

-if [ -d "/dev/md" ] && ls /dev/md?* &>/dev/null; then
-    Log "Stop mdadm and pause udev"
-    mdadm --stop /dev/md?* >&2
-
-    type -p udevadm >/dev/null
-    if [ $? -eq 0 ]; then
-        udevadm control --stop-exec-queue
-    else
-        udevcontrol stop_exec_queue
-    fi
+Log "Stop mdadm and pause udev"
+if grep -q md /proc/mdstat &>/dev/null; then
+    mdadm --stop -s >&2 || echo "stop mdadm failed"
+fi
+
+type -p udevadm >/dev/null
+if [ $? -eq 0 ]; then
+    udevadm control --stop-exec-queue >&2 || echo "pause udev via udevadm failed"
+else
+    udevcontrol stop_exec_queue >&2 || echo "pause udev via udevcontrol failed"

and

-if [ -d "/dev/md" ] && ls /dev/md?* &>/dev/null; then
-    Log "Resume udev"
-    type -p udevadm >/dev/null
-        if [ $? -eq 0 ]; then
-        udevadm control --stop-exec-queue
-    else
-        udevcontrol stop_exec_queue
-    fi
+Log "Resume udev"
+type -p udevadm >/dev/null
+if [ $? -eq 0 ]; then
+    udevadm control --start-exec-queue >&2 || echo "resume udev via udevadm failed"
+else
+    udevcontrol start_exec_queue >&2 || echo "resume udev via udevcontrol failed"

Reverting this changes makes it again work for me.

jsmeix commented at 2015-01-21 14:57:

Regarding mdadm related stuff:
I never tested rear with MD devices (a.k.a. Linux Software RAID).

gdha commented at 2015-01-21 15:23:

my guess is when we remove >&2 from the udevadm it won't break anymore

tbsky commented at 2015-01-22 05:00:

hi:
I didn't have opensuse 13.2 to try. but follow the discussion above this is my guess:
my patch try to use "udevadm --stop-exec-queue" to pause udev because without it, parted under RHEL6/RHEL7 sometimes will fail because udev event.

but now it seems the "parted" program notice the problem so it try to fix the problem itself by calling "udevadm settle". but since udev was paused before parted, so parted can never settle the udev queue.

I try to grep "udveadm" for libparted.so* under RHEL6 and RHEL7 and find nothing. so maybe opensuse 13.2 have a newer version of parted.

to fit the new behavior of parted I need to modify the patch, maybe settle the queue before every parted command. the old way was trying to pause queue at the beginning and enable the queue at the end.

or any other good suggestions?

tbsky commented at 2015-01-22 07:10:

@gdha

I download parted source code and didn't find code about "udev settle". then I download
opensuse 13.2 parted source rpm. and found the "udevadm settle" code at "do-not-create-dm-nodes.patch" and "more-reliable-informing-the-kernel.patch". the patches were added at 2010 and 2011, I hope they were upstreamed then I won't hit by udev & parted at the first place...

there are many "parted" command in 10_include_partition_code.sh. I don't know if it is proper to add "udev settle" command before every one. but I notice there is already a "my_udevsettle" function
in 10_include_partition_code.sh for RHEL4, since no one complain about it, maybe it is ok to add "udev settle" for every parted?

I will try to modify the patch is you think it is ok..

jsmeix commented at 2015-01-22 09:32:

tbsky,
very many thanks for your analysis.

This issue shows perfectly how bad it is when there are Linux distribution specific patches that are not integrated by upstream.

I submitted an issue report for openSUSE about SUSE's different behaviour of parted regarding udev:

https://bugzilla.opensuse.org/show_bug.cgi?id=914245

jsmeix commented at 2015-01-22 14:05:

For me the following works in 10_include_partition_code.sh

create_disk() {
...
    cat >> "$LAYOUT_CODE" <<EOF
Log "Stop mdadm"
if grep -q md /proc/mdstat &>/dev/null; then
    mdadm --stop -s >&2 || echo "stop mdadm failed"
fi
Log "Erasing MBR of disk $disk"
dd if=/dev/zero of=$disk bs=512 count=1
sync
EOF
    create_partitions "$disk" "$label"
    cat >> "$LAYOUT_CODE" <<EOF
# Wait some time before advancing
sleep 10
# Make sure device nodes are visible (eg. in RHEL4)
my_udevtrigger
my_udevsettle
EOF
.
.
.
create_partitions() {
...
    cat >> "$LAYOUT_CODE" <<EOF
LogPrint "Creating partitions for disk $device ($label)"
# ensure the udev event queue is empty before parted is run
# because 'udevadm ... --timeout' does not exit udevadm in any case
# a simple traditional hardcoded 5 seconds timeout is implemented
# that is waited in any case that kills the udevadm sub-process afterwards
# the outermost subshell avoids job control messages like "[1] job_pid" and "[1]+ Done..." or "[1]+ Terminated..."
# in POSIX shells wait returns the exit code of the job even if it had already terminated when wait was started
( ( udevadm settle ) & UDEVADM_PID=\$! ; sleep 5s ; kill \$UDEVADM_PID &>/dev/null ; wait \$UDEVADM_PID ) || true
# run parted
parted -s $device mklabel $label >&2
# after parted was run, again ensure the udev event queue is empty before proceeding
( ( udevadm settle ) & UDEVADM_PID=\$! ; sleep 5s ; kill \$UDEVADM_PID &>/dev/null ; wait \$UDEVADM_PID ) || true
EOF
...
            cat >> "$LAYOUT_CODE" <<EOF
( ( udevadm settle ) & UDEVADM_PID=\$! ; sleep 5s ; kill \$UDEVADM_PID &>/dev/null ; wait \$UDEVADM_PID ) || true
parted -s $device mkpart '"$name"' ${start}B $end >&2
( ( udevadm settle ) & UDEVADM_PID=\$! ; sleep 5s ; kill \$UDEVADM_PID &>/dev/null ; wait \$UDEVADM_PID ) || true
EOF
...

and so forth - i.e. each parted call has such an

( ( udevadm settle ) & UDEVADM_PID=\$! ; sleep 5s ; kill \$UDEVADM_PID &>/dev/null ; wait \$UDEVADM_PID ) || true

call before and after it.

It is questionable if "udevadm settle" is really needed before and after each parted call because for me this results in the recovery system that /var/lib/rear/layout/diskrestore.sh is generated as follows:

if create_component "/dev/sda" "disk" ; then
# Create /dev/sda (disk)
Log "Stop mdadm"
if grep -q md /proc/mdstat &>/dev/null; then
    mdadm --stop -s >&2 || echo "stop mdadm failed"
fi
Log "Erasing MBR of disk /dev/sda"
dd if=/dev/zero of=/dev/sda bs=512 count=1
sync
LogPrint "Creating partitions for disk /dev/sda (msdos)"
# ensure the udev event queue is empty before parted is run
# because 'udevadm ... --timeout' does not exit udevadm in any case
# a simple traditional hardcoded 5 seconds timeout is implemented
# that is waited in any case that kills the udevadm sub-process afterwards
# the outermost subshell avoids job control messages like "[1] job_pid" and "[1]+ Done..." or "[1]+ Terminated..."
# in POSIX shells wait returns the exit code of the job even if it had already terminated when wait was started
( ( udevadm settle ) & UDEVADM_PID=$! ; sleep 5s ; kill $UDEVADM_PID &>/dev/null ; wait $UDEVADM_PID ) || true
# run parted
parted -s /dev/sda mklabel msdos >&2
# after parted was run, again ensure the udev event queue is empty before proceeding
( ( udevadm settle ) & UDEVADM_PID=$! ; sleep 5s ; kill $UDEVADM_PID &>/dev/null ; wait $UDEVADM_PID ) || true
( ( udevadm settle ) & UDEVADM_PID=$! ; sleep 5s ; kill $UDEVADM_PID &>/dev/null ; wait $UDEVADM_PID ) || true
parted -s /dev/sda mkpart '"primary"' 1048576B 1570766847B >&2
( ( udevadm settle ) & UDEVADM_PID=$! ; sleep 5s ; kill $UDEVADM_PID &>/dev/null ; wait $UDEVADM_PID ) || true
( ( udevadm settle ) & UDEVADM_PID=$! ; sleep 5s ; kill $UDEVADM_PID &>/dev/null ; wait $UDEVADM_PID ) || true
parted -s /dev/sda mkpart '"primary"' 1570766848B 12313427967B >&2
( ( udevadm settle ) & UDEVADM_PID=$! ; sleep 5s ; kill $UDEVADM_PID &>/dev/null ; wait $UDEVADM_PID ) || true
( ( udevadm settle ) & UDEVADM_PID=$! ; sleep 5s ; kill $UDEVADM_PID &>/dev/null ; wait $UDEVADM_PID ) || true
parted -s /dev/sda set 2 boot on >&2
( ( udevadm settle ) & UDEVADM_PID=$! ; sleep 5s ; kill $UDEVADM_PID &>/dev/null ; wait $UDEVADM_PID ) || true
( ( udevadm settle ) & UDEVADM_PID=$! ; sleep 5s ; kill $UDEVADM_PID &>/dev/null ; wait $UDEVADM_PID ) || true
parted -s /dev/sda mkpart '"primary"' 12313427968B 21474836479B >&2
( ( udevadm settle ) & UDEVADM_PID=$! ; sleep 5s ; kill $UDEVADM_PID &>/dev/null ; wait $UDEVADM_PID ) || true
( ( udevadm settle ) & UDEVADM_PID=$! ; sleep 5s ; kill $UDEVADM_PID &>/dev/null ; wait $UDEVADM_PID ) || true
partprobe -s /dev/sda >&2
( ( udevadm settle ) & UDEVADM_PID=$! ; sleep 5s ; kill $UDEVADM_PID &>/dev/null ; wait $UDEVADM_PID ) || true
# Wait some time before advancing
sleep 10
# Make sure device nodes are visible (eg. in RHEL4)
my_udevtrigger
my_udevsettle
component_created "/dev/sda" "disk"
else
    LogPrint "Skipping /dev/sda (disk) as it has already been created."
fi

The duplicated consecutive "udevadm settle" are here certainly not needed (but also do not really harm - except wasting time in particular on systems with many disks).

But currently I don't know if under special circumstances the diskrestore.sh could be generated differently so that then a "udevadm settle" call is missing so that for now this way it should be (hopefully) at least fail-safe.

tbsky,
I would very much appreciate it if you could try out if my above way also works for you on RHEL6/RHEL7 and provide feedback.

In the end I would like to have one same generic code that works both on Fedora/ RHEL and on openSUSE / SLES.

jsmeix commented at 2015-01-22 14:11:

FWIW:

Only a side note regarding "mdadm":

In 10_include_partition_code.sh there is "mdadm --stop" but I don't find where it is started again.

Is this missing or is starting it again not needed?

If starting is not needed I would like to know why (I really don't know about MD stuff).

jsmeix commented at 2015-01-22 14:44:

Only FYI wherefrom my above timeout+kill implementation comes, see

https://features.opensuse.org/312491

openSUSE 13.2 has this kind of timeout+kill implementation in
/usr/lib/YaST2/bin/test_remote_socket

Its code is at

https://build.opensuse.org/package/view_file/Printing/yast2-printer-SLE12/test_remote_socket.without_ping?expand=1

tbsky commented at 2015-01-22 14:58:

@jsmeix

the command "udevadm settle" will break RHEL4 since RHEL4 didn't have "udevadm" command. you can check the function "my_udevsettle". since no one complain about it, and it seems works fine under opensuse 13.2. (the conclusion is made by your testing). so I think maybe "my_udevesettle" before every parted and partprobe (yes, I forgot partprobe, it will fail by udev event also) is acceptable.

before I made the "pause udev" patch, the old way to prevent udev event in 10_include_partition_code.sh is "sleep 1" before "parted". I have test "rear recover" about 20 times with it. and every time the "sleep 1" is ok for "parted". but some "partprobe" or "parted" (sorry I forgot which one) didn't have "sleep 1" before it at generated diskrestore.sh. and these lines will sometimes fail.

so "sleep 1" or "udev settle" is both fine for me. although I don't know if "sleep 1" is enough under everyone's condition.

about "mdadm --stop". if you do "rear recover" under brand new hard disks you are safe. but if you do "rear recover" under disks which already have old software raid at it, the mdadm will startup before "rear recover" (I saw this under RHEL6/7 systems. mabye this behavior can be disabled somewhere, but I didn't try to figure it out). and if mdadm is running you can not "parted" the disks, since the disks are busy. that's why we need to stop mdadm if it is running. and we don't need to restart it. since the disks will be re-partition, and the software raid will re-made and re-assemble later, at that time mdadm will startup again.

jsmeix commented at 2015-01-22 16:09:

Meanwhile I think my special timeout+kill implementation for "udevadm settle" is oversophisticated because it is a workaround for a bug in "udevadm" that does not exit after the specified timeout.

Therefore I submitted this issue report at openSUSE:
https://bugzilla.opensuse.org/show_bug.cgi?id=914311

I think rear should not make its own code overcomplicated with oversophisticated workarounds for bugs in standard tools.

Therefore using the existing "my_udevesettle" function is perfectly right from my current point of view.

If really needed we could enhance the "my_udevesettle" function.

But currently this is not needed when in 10_include_partition_code.sh the "stop-exec-queue" and "start_exec_queue" commands gets removed because then plain "udevadm settle" would work - and if not it does correctly indicate an error during recovery.

jsmeix commented at 2015-01-22 16:23:

I find two "my_udevesettle" function implementations:

A simple one in usr/share/rear/skel/default/etc/scripts/system-setup.d/00-functions.sh and a more complicated one in usr/share/rear/lib/linux-functions.sh

Gratien,

are the two different "my_udevesettle" implementations intentional?
If yes it is at least confusing because it makes it complicated to find out which implementation is called when.

gdha commented at 2015-01-22 16:53:

@schlomo do you know why there are 2 functions available (my_udevsettle)? I guess the one in 00-functions.sh is only meant for the startup script as the one in lib/linux-functions.sh relies on other functions as well.
Oh well, rear will only use the one defined in lib/linux-functions.sh. Nothing to worry about.

gdha commented at 2015-01-22 16:55:

@tbsky are you willing to update script 10_include_partition_code.sh with the new knowledge after all the experiments done by yourself and @jsmeix

tbsky commented at 2015-01-23 02:43:

@gdha @jsmeix

ok. I will try to do use my_udevsettle and see the affect under RHEL6/RHEL7. hope this time things will really work out :) please give me some time for testing. my environment include mdadm,lvm and drbd, I hope make them all happy..

jsmeix commented at 2015-01-23 11:21:

Below my current simplified changes for 10_include_partition_code.sh that make it work for me:

I removed all "sleep" commands because I think those are only weak band-aid workarounds that fail whenever something takes actually longer then the sleep time.

In particular removing the "sleep 10" (with its meaningless "Wait some time before advancing" comment without a reason why it has to wait and why 10 seconds are the right value here) makes "rear recover" work noticeably faster for me at this point.

I have my_udevsettle before and after each parted and partprobe call which result duplicated consecutive my_udevsettle calls in diskrestore.sh but it is fail-safe this way and two consecutive "udevadm settle" do not cause any harm and at least the second one returns immediately.

I do not specify a non-default timeout for "udevadm settle" because the default timeout (120 seconds according to my udevadm man page) is certainly "the right thing" from the udev authors who know much better than I what the most reasonable default is.

If a special device needs 119 seconds until its udev events are processed then it is just what that device needs.

If "udevadm settle" fails, the whole diskrestore.sh fails because it has "set -e" which is also correct because "rear recover" must not blindly proceed in case of errors.

@@ -45,16 +45,8 @@

     cat >> "$LAYOUT_CODE" <<EOF
-Log "Stop mdadm and pause udev"
+Log "Stop mdadm"
 if grep -q md /proc/mdstat &>/dev/null; then
     mdadm --stop -s >&2 || echo "stop mdadm failed"
 fi
-
-type -p udevadm >/dev/null
-if [ $? -eq 0 ]; then
-    udevadm control --stop-exec-queue >&2 || echo "pause udev via udevadm failed"
-else
-    udevcontrol stop_exec_queue >&2 || echo "pause udev via udevcontrol failed"
-fi
-
 Log "Erasing MBR of disk $disk"
 dd if=/dev/zero of=$disk bs=512 count=1
@@ -65,15 +57,4 @@

     cat >> "$LAYOUT_CODE" <<EOF
-Log "Resume udev"
-type -p udevadm >/dev/null
-if [ $? -eq 0 ]; then
-    udevadm control --start-exec-queue >&2 || echo "resume udev via udevadm failed"
-else
-    udevcontrol start_exec_queue >&2 || echo "resume udev via udevcontrol failed"
-fi
-
-# Wait some time before advancing
-sleep 10
-
 # Make sure device nodes are visible (eg. in RHEL4)
 my_udevtrigger
@@ -116,6 +97,7 @@
     cat >> "$LAYOUT_CODE" <<EOF
 LogPrint "Creating partitions for disk $device ($label)"
+my_udevsettle
 parted -s $device mklabel $label >&2
-sleep 1
+my_udevsettle
 EOF

@@ -180,6 +162,7 @@
             fi
             cat >> "$LAYOUT_CODE" <<EOF
+my_udevsettle
 parted -s $device mkpart '"$name"' ${start}B $end >&2
-sleep 1
+my_udevsettle
 EOF
         else
@@ -192,6 +175,7 @@
             end_mb=$(( end/1024/1024 ))
             cat  >> "$LAYOUT_CODE" <<EOF
+my_udevsettle
 parted -s $device mkpart '"$name"' $start_mb $end_mb >&2
-sleep 1
+my_udevsettle
 EOF
         fi
@@ -220,14 +204,26 @@
                 continue
             fi
-            echo "parted -s $device set $number $flag on >&2" >> $LAYOUT_CODE
+            (
+            echo "my_udevsettle"
+            echo "parted -s $device set $number $flag on >&2"
+            echo "my_udevsettle"
+            ) >> $LAYOUT_CODE
         done

         # Explicitly name GPT partitions.
         if [[ "$label" = "gpt" ]] && [[ "$name" != "rear-noname" ]] ; then
-            echo "parted -s $device name $number '\"$name\"' >&2" >> $LAYOUT_CODE
+            (
+            echo "my_udevsettle"
+            echo "parted -s $device name $number '\"$name\"' >&2"
+            echo "my_udevsettle"
+            ) >> $LAYOUT_CODE
         fi
     done < <(grep "^part $device " $LAYOUT_FILE)

     # Ensure we have the new partitioning on the device.
-    echo "partprobe -s $device >&2" >> "$LAYOUT_CODE"
+    (
+    echo "my_udevsettle"
+    echo "partprobe -s $device >&2"
+    echo "my_udevsettle"
+    ) >> "$LAYOUT_CODE"
 }

jhoekx commented at 2015-01-23 11:41:

In particular removing the "sleep 10" (with its meaningless "Wait some time before advancing" comment without a reason why it has to wait and why 10 seconds are the right value here) makes "rear recover" work noticeably faster for me at this point.

The sleeps were there because it was the only way we could get things (in general :-)) to work on RHEL 4. The udev version there was not able to help us.

I believe we don't really have to care about that anymore, so removing the sleeps would be great.

gdha commented at 2015-01-23 12:36:

@jsmeix will you prepare the pull request when you are satisfied with the fix?

jsmeix commented at 2015-01-23 12:52:

I think first tbsky needs to verify if it still works on RHEL6/RHEL7 using only 'settle' instead of 'stop_exec_queue'/'start_exec_queue'.

jsmeix commented at 2015-01-23 13:17:

Gratien,
my current changes are in
https://github.com/jsmeix/rear
but I have not yet tested it on SLES12.

schlomo commented at 2015-01-23 13:22:

Hi,

maybe this is a good time to officially discontinue support for RHEL4 and
SLES9?

We should also announce this on the mailing list etc. and stop to publish
updated RPMs for those distros.

IIRC there are actually still some people out there who use those old
distros, at least some time ago somebody contacted me with a SLES9 and a
question.

Regards,
Schlomo

On 23 January 2015 at 14:17, Johannes Meixner notifications@github.com
wrote:

Gratien,
my current changes are in
https://github.com/jsmeix/rear
but I have not yet tested it on SLES12.


Reply to this email directly or view it on GitHub
https://github.com/rear/rear/issues/533#issuecomment-71191685.

jsmeix commented at 2015-01-23 13:28:

From my point of view even officially discontinue support for SLES10.

Reasoning:

Why should an admin of such an old distribution want to upgrade rear when he already has a working disaster recovery procedure with whatever older rear version?

If an admin of such an old distribution does not yet have a working disaster recovery set up and finds out right now ( out of a sudden ;-) that he might need one, then he can try out the current rear - perhaps it works for him - but officially those old distributions are no longer supported by upstream rear.

schlomo commented at 2015-01-23 14:03:

Sure. I will support whatever decision provided it is announced and
discussed on the mailing list and that we do not build releases for
platforms that are not supported by those releases (Hint: OpenSUSE Build
Service needs to be adjusted)

On 23 January 2015 at 14:28, Johannes Meixner notifications@github.com
wrote:

From my point of view even officially discontinue support for SLES10.

Reasoning:

Why should an admin of such an old distribution want to upgrade rear when
he already has a working disaster recovery procedure with whatever older
rear version?

If an admin of such an old distribution does not yet have a working
disaster recovery set up and finds out right now out of a sudden that he
might need one, then he can try out the current rear - perhaps it works for
him - but officialy those old distributions are no longer supported by
upstream rear.


Reply to this email directly or view it on GitHub
https://github.com/rear/rear/issues/533#issuecomment-71192780.

gdha commented at 2015-01-23 14:10:

When we release rear-1.17 we could discontinue RHEL 4 and SLES 9 officially. I would keep SLES 10 at least for rear-1.17 (still a huge user basis around).
This means we should create a support matrix on our web site (and in the release notes)

jsmeix commented at 2015-01-23 14:59:

My current changes work for me on openSUSE 13.2, SLES 12, and Fedora 21: All with their default way how to use btrfs. All with a single btrfs filesystem with their default subvolumes and with their default other filesystems (/home on xfs on openSUSE 13.2 and /boot on ext4 on Fedora 21) on a single harddisk.

On Fedora 21 I need to do in the rear recovery system

ln -s /bin/true /sbin/udevd
ln -sf /bin/true /bin/rpc.statd

see https://github.com/rear/rear/issues/531
and https://github.com/rear/rear/issues/532

For my tests I used the RPM package rear1161btrfs-1.16.1.git201501071534-5.1.noarch.rpm available from
http://download.opensuse.org/repositories/home:/jsmeix/

This RPM is built in the openSUSE build service project home:jsmeix therein the source package rear1161git201501071534btrfsFedora that contains rear-1.16.1-git201501071534.tar.gz and my changes in rear-1.16.1-git201501071534.btrfs_generic_fedora.diff - see
https://build.opensuse.org/package/show/home:jsmeix/rear1161git201501071534btrfsFedora

I did not test my currect changes on top of the current rear master in
https://github.com/jsmeix/rear

Curently I don't know a simple way how I could test directly from GitHub.

I think for testing it I need it in a RPM package to be able to correctly play around with different versions (i.e. correctly remove old stuff and/or update to something newer).

tbsky commented at 2015-01-24 16:40:

hi:
I am sorry for the delay. I was busy and can not find a suitable physical machine for testing. now I test it under vm. unfortunately the testing was failed.

I was testing on a vm which already has software raid on it. the script will disable mdadm at first. but when parted create a raid partition on disk, the udev will wake up mdadm and lock the disk, so the further parted command with the disk will fail since the disk is busy now. I remember I saw this behavior before, that's why I choose to pause udev and not settle the queue when I made the patch.

so the situation is harder for me now. maybe I should find the way to stop mdadm when os booting..
or I need to erase the disk about mdadm label and reboot it again when doing rear recover.

tbsky commented at 2015-01-24 17:03:

hi:
I need to wake up my memory. when I try rear last year, I already notice the situation and ask at the email list. the udev will wake up mdadm via /lib/udev/rules.d/65-md-incremental.rules at RHEL6. I don't know if we can minimize udev rules when do rear recover. also I don't know if there are other udev rules which will prevent parted's work. btw, RHEL7 seems don't have this mdadm rule.

tbsky commented at 2015-01-25 08:00:

hi:
after more testing, /lib/udev/rules.d/65-md-incremental.rules comes with package "mdadm". both RHEL6 and RHEL7 have this rule. but only under RHEL6 this rule will wake up mdadm and prevent "parted" from working. I test several times under RHEL7 and "rear recover" works fine.

tbsky commented at 2015-01-25 16:46:

@jsmeix

I think just remove the "65-md-incremental.rules" when madam is detected is good enough. since rear will assemble the software raid manually, so we don't need that rule. could you please add the lines below to your patch and make the pull request? I have test it with your patch under RHEL6/7 and seems fine.

@@ -47,6 +47,10 @@
 Log "Stop mdadm"
 if grep -q md /proc/mdstat &>/dev/null; then
     mdadm --stop -s >&2 || echo "stop mdadm failed"
+    # Prevent udev waking up mdadm later
+    if [ -e /lib/udev/rules.d/65-md-incremental.rules ] ; then
+        rm -f /lib/udev/rules.d/65-md-incremental.rules || echo "rm 65-md-incremental.rules failed"
+    fi
 fi
 Log "Erasing MBR of disk $disk"
 dd if=/dev/zero of=$disk bs=512 count=1

jsmeix commented at 2015-01-26 09:03:

tbsky,
again many thanks for your testing and your descriptive analysis what goes on on Red Hat.

Regarding your patch "rm -f /lib/udev/rules.d/65-md-incremental.rules":

First and foremost: I know basically nothing about MD devices and mdadm.

Nevertheless I do not like in general how it is currently implemented.

Reasoning:

According to your analysis the "rm -f /lib/udev/rules.d/65-md-incremental.rules" is a RHEL6-specific hack to make it work there but nothing in the code makes this clear, the comment "Prevent udev waking up mdadm later" even indicates the "rm -f /lib/udev/rules.d/65-md-incremental.rules" is a generic solution for every Linux distribution.

I guess for SUSE mdadm stuff is implemeted differently (perhaps as I learned above even as built-in hacks into parted or elsewhere ;-) which requires different SUSE-specific hacks and then come Debian, Ubuntu, and all the various other Linux distributions...

You described the real problem as follows: "when parted create a raid partition on disk, the udev will wake up mdadm and lock the disk, so the further parted command with the disk will fail since the disk is busy now".

Accordingly I would like to implement a generic solution that deals with the real problem.

As far as I understand it, the real problem is that while the various parted commands are run, stuff that deals with MD devices (i.e. "mdadm") must not run at the same time.

Also pausing whole udev is not an adequate solution for this problem because pausing whole udev is a too big hammer (in particular parted needs udev to do the device node stuff that is required after parted did its job).

Therefore - as far as I understand it - the real generic solution should be to find a way how to reliably pause specifically only all kind of stuff that deals with MD devices during the time while the the various parted commands are run.

Just an idea (from someone who doesn't know about MD devices):

If any MD devices stuff is done via /sbin/mdadm then linking it to be /bin/true could be used to fool anything that wants to do MD devices stuff e.g. something like:

# Pause any kind of stuff that deals with MD devices
# while the various parted commands are run.
# Reason: When parted creates a raid partition on disk
# and mdadm is active, then mdadm locks the disk
# so further parted commands with the disk will fail
# since the disk is busy now.
MDADM=$( type -P mdadm )
$MDADM --stop -s >&2 || echo "stop mdadm failed"
mv $MDADM $MDADM.away_while_parted_runs
ln -s $( type -P true ) $MDADM
...
parted ...
...
parted ...
...
partprobe ...
...
# Re-enable stuff that deals with MD devices:
mv -f $MDADM.away_while_parted_runs $MDADM

Again: This is only an idea from someone who doesn't know about MD devices.

jsmeix commented at 2015-01-26 09:12:

FYI:
On SLES12 we have
/lib/udev/rules.d/63-md-raid-arrays.rules
/lib/udev/rules.d/64-md-raid-assembly.rules
Both belong to our mdadm RPM package.
(I do not understand anything about what those rules do.)

jsmeix commented at 2015-01-26 10:30:

To all,

perhaps I found out the real root cause of the trouble here.

I asked a colleague who is an expert for the storage setup in the SUSE setup tools YaST and AutoYaST.

He told me that the root cause why MD stuff becomes active could be the following:

When recovery is done on a harddisk that was already used before (in particular when recovery is done on the same machine), then parted would re-create same partitions as already have been on the disk.

When a new partition was set up there are udev rules that run whatever stuff that inspects the new created partition.

When the new created partition is the same as it was before on the disk, the content of the data blocks in that partition is the same as it was before on the disk.

MD devices and similar stuff like LVM and some others maintain information that is specific for their particular special kind of usage in some kind of "super blocks" which are stored in a specific way for each kind of special usage.

When udev runs MD stuff that inspects the new created partition, it may find its specific "MD super block" and read from it that this partition belongs e.g. to a MD RAID and then MD tools do "the usually right thing" in this case.

If my colleague is right with his assumption, we have a very generic problem to solve here which is:

Before parted is run, we must make sure the partition which will be created by parted does no longer contain any data that could mislead other tools which inspect the partition after it was created anew.

My colleague told me that we (i.e. SUSE) have the same problem in AutoYaST (a tool for automated installations via YaST) where users ofter do an automated re-installation onto the same harddisk.

I was told that there is the "wipefs" tool that is specifically made to wipe those special "super blocks".

Perhaps rear can also use wipefs?

The problem is that rear would have to wipe those special "super blocks" in a partition before parted creates the partition because after parted had created the partition udev would immediately start the partitioning analyzing tools.

All this is currently only FYI it is not yet verified by me.

tbsky,
can you re-test on RHEL6 on a new harddisk (e.g. no a new virtual machine) whether or not then MD stuff does still interfere with parted?

jsmeix commented at 2015-01-26 10:44:

Perhaps what cause the MD tools to run is this parted command in
10_include_partition_code.sh

        # Get the partition number from the name
        local number=$(get_partition_number "$partition")
        local flags="$(echo $flags | tr ',' ' ')"
        local flag
        for flag in $flags ; do
            if [[ "$flag" = "none" ]] ; then
                continue
            fi
            (
            echo "my_udevsettle"
            echo "parted -s $device set $number $flag on >&2"
            echo "my_udevsettle"
            ) >> $LAYOUT_CODE
        done

From "man parted" I read

  set partition flag state
    Change  the  state  of  the  flag  on  partition to state.
    Supported flags are: "boot", "root", "swap", "hidden",
    "raid", "lvm", "lba", "legacy_boot" and "palo".
    state should be  either  "on" or "off".

Perhaps 10_include_partition_code.sh runs something like

parted -s $device set $number raid on

and the "raid on" flag causes that udev runs MD stuff?

tbsky,
can you inspect the /var/lib/rear/layout/diskrestore.sh file in your recovery system what exact parted commands rear generates in your case on RHEL6 when it does not work?

jsmeix commented at 2015-01-26 11:50:

tbsky,
perhaps the following might be helpful
http://karelzak.blogspot.de/2009/11/wipefs8.html
regarding how to find out what exactly is on your partitions
in the original system so that we could then better understand
what goes on internally why MD stuff may interfere with parted.

tbsky commented at 2015-01-26 12:03:

@jsmeix

  1. as I said in the previous post, udev wake up mdadm only happens when you use disks already have software raid on it. you are safe if you use brand new disks.
  2. I was trying to find a general solution not only to mdadm, which is pausing udev, but it failed under suse. I don't think parted need udev when running, but maybe I am wrong.
  3. so I try to focus on mdadm now (since this is the problem I met), so I deal with the rule. if you check the rule, you will find if the system want to start mdadm manually then the rule should be skipped. the rule will detect anaconda (the redhat installer), and if it find itself running under anaconda, it will not run. and remove the rule has the same affect. I don't know if other non-redhat system has this kind of rule. since it detect anaconda, I think maybe it is specific to redhat.
  4. link mdadm to /bin/true maybe a generic way. I would like to try it. but are any other users using software raid to try it?

jsmeix commented at 2015-01-26 12:31:

tbsky,

your item 3. above shows that /lib/udev/rules.d/65-md-incremental.rules seems to be specifically only for Red Hat + anaconda so that for rear this rule should not be there and other distros probably do not have this udev rules file which makes your "rm -f /lib/udev/rules.d/65-md-incremental.rules" the right solution for rear.

All what is mising is a comment in your patch that describes (as you wrote above) why "rm -f /lib/udev/rules.d/65-md-incremental.rules" is the right solution for rear to make it work on Red Hat and elsewhere too (because others do not have that rule).

FYI:
I am currently setting up a SLES12 system with software RAID0 on a virtual machine (using raw partitions /dev/sda5 and /dev/sda6 as /dev/md0 with ext4 on it) - my first usage if it - let's see if rear can recover it...

jsmeix commented at 2015-01-26 13:31:

For me it "just worked" to recreate a SLES12 system with software RAID0 on a virtual machine (using raw partitions /dev/sda5 and /dev/sda6 as /dev/md0 with ext4 on it).

I recreated it first on a new virtual KVM machine and then a second time onto that recreated system (i.e. on the already used disk).

I used rear with my lastest changes as in https://github.com/rear/rear/pull/538 but I used my RPM rear1161btrfs-1.16.1.git201501071534-5.1.noarch.rpm from
http://download.opensuse.org/repositories/home:/jsmeix/SLE_12/noarch/

tbsky commented at 2015-01-26 13:37:

hi:
my testing shows link mdadm to /bin/true will prevent udev wake up mdadm. but I don't know where to rename mdadm back. if you rename it back at 10_include_partition_code.sh, the udev will soon wake up mdadm and parted may fail when doing next disk. (at least in my testing the timing is bad so the mdadm wakeup when doing some other disks). if you use sda5 and sda6 maybe you won't see the problem, since these two partitions belong to the same disk.

tbsky commented at 2015-01-26 15:51:

hi:
another method I think maybe general is to pause udev and link udevadm to /bin/true and rename it back. it may resolve unknown problems which behavior like mdadm, but it of course my cause other unknown problems.

btw, I test "udevadm control --stop-exec-queue; udevadm settle --timeout=20". rhel 6/7 is happy and return immediately. so linux distributions are sometimes very different..

schlomo commented at 2015-01-26 16:02:

wipefs seems like a nice idea, and really simple to call as wipefs -a <blockdevice>.

However, if I understand this thread correctly we should call it before
wiping the hard disk on all previously existing partitions? Or maybe just
delete the first and last couple of MB on each previously existing
partition?

Maybe another idea would be to just wipe the areas before and after the
partition boundaries before calling parted to create those partitions (yes,
I don't have an idea right now how to do that without extending parted).

Maybe we should have a feature that simply completely wipes a harddisk
before we use it? Maybe in case of a real recovery (as opposed to a test
run), the actual recovery speed is less important than having a truly
reliable recovery?

On 26 January 2015 at 14:37, tbsky notifications@github.com wrote:

hi:
my testing shows link mdadm to /bin/true will prevent udev wake up mdadm.
but I don't know where to rename mdadm back. if you rename it back at
10_include_partition_code.sh, the udev will soon wake up mdadm and parted
may fail when doing next disk. (at least in my testing the timing is bad so
the mdadm wakeup when doing some other disks).


Reply to this email directly or view it on GitHub
https://github.com/rear/rear/issues/533#issuecomment-71461249.

jsmeix commented at 2015-01-27 08:35:

I agree that rear needs a generic "cleanupdisk" function that basically makes an already used harddisk behave as if it was a new harddisk.

I think such a function cannot be implemented for the next rear 1.17 but perhaps it is possible for rear 1.18 or even later depending on how complicated it becomes in practice (I have new harddisks in mind that might have ex factory a somewhat hidden special partition which must not be wiped or something like this - isn't there something like this for UEFI boot, see http://en.wikipedia.org/wiki/EFI_System_partition ).

Regardless how complicated it is in practice, one dedicated function that cleans up the disk before anything else is done makes it much cleaner how rear works (instead of various workarounds here and there as needed).

For rear 1.17 I will make a submitrequest with tbsky's "rm -f /lib/udev/rules.d/65-md-incremental.rules" workaround for RHEL6 with an explanatory comment in the code.

Regarding speed versus reliable recovery:
Here one of Johannes Meixner's famous fundamentals ;-)
"Speed is less important than correctness."

Nevertheless I think "completely wipe a harddisk" is at least not possible by default on nowadays hundreds of GB or even several TB harddisks because it would take too long.

jsmeix commented at 2015-01-27 12:43:

I think the original issue here is now fixed:

Summary:

My first
commit b8ade0f98c50c19947f71208528d4c6e193dd0a6
(Removed all "sleep" and the "udevadm --stop-exec-queue ... --start-exec-queue" - instead call my_udevsettle function) fixed the initial issue that "udevadm settle --timeout=20" hangs endlessly in recovery system for SUSE because udev is kept active now.

But that caused the regression on RHEL6 where the active udev (via /lib/udev/rules.d/65-md-incremental.rules) wakes up mdadm which lets further parted commands fail which is fixed by
commit ff1bb730d38eb42e2abf866736cf188bef0b8b9b
that removes /lib/udev/rules.d/65-md-incremental.rules

tbsky commented at 2015-01-27 12:48:

@jsmeix:

thanks a lot for your help :)

jsmeix commented at 2015-01-27 12:56:

I submitted the new separated issue
https://github.com/rear/rear/issues/540
"Implement a generic 'cleanupdisk' function."

jsmeix commented at 2015-01-27 13:01:

tbsky,

I thank you a lot for all your testing efforts and your descriptive feedback that made me understand what goes on on RHEL6.

Regarding "your testing efforts":
I would very much appreciate it if you could verify with the current git master code that it now actually works also on RHEL6.

tbsky commented at 2015-01-28 04:12:

@jsmeix

tried git master under RHEL6/7. it is working fine :)

gdha commented at 2015-02-02 15:39:

@jsmeix @tbsky Can this issue be closed? It is getting too long to follow.

tbsky commented at 2015-02-03 02:58:

@gdha
my testing was ok. so it's fine for me to close it.

jsmeix commented at 2015-02-04 11:48:

@gdha
it is also fine for me to close it.

Cf. my above comment
https://github.com/rear/rear/issues/533#issuecomment-71642020

gdha commented at 2015-02-16 18:37:

added to the release notes so we can close this issue

gdha commented at 2015-05-26 12:16:

A last update from https://bugzilla.novell.com/show_bug.cgi?id=914245:
Factory/parted has been updated to 3.2 and the mentioned patch has been removed. Closing as fixed.


[Export of Github issue for rear/rear.]