#793 Issue closed: partprobe fails

Labels: bug, cleanup, support / question, fixed / solved / done

thefrenchone opened issue at 2016-03-08 18:27:

  • rear version : 1.17.2 and recent Git
  • OS version : RHEL 7
  • rear configuration files: N/A

udevadm settle doesn't appear to function correctly in RHEL 7. During the disklayout stage of recovery partprobe -s /dev/sda fails 100% of the time with resource is busy.

Error: Error informing the kernel about modifications
to partition /dev/sda1 - Device or resouce busy.
This means linux won't know about any changes
you have made to the /dev/sda1 until you reboot - so
you shuldn't mount it or use it in any way before rebooting.

There are a a few workaround I have tried but most of them don't work 100% of the time. Sometimes including a sleep before partprobe works and sometimes sleeping for a minute doesn't work. The only reliable workaround I've found is to just call partprobe multiple times:

from: /usr/share/rear/layout/prepate/GNU/Linux/10_include_partition_code.sh
echo "partprobe -s $device >&2 || true"
echo "partprobe -s $device >&2 || true""
echo "partprobe -s $device >&2"

The "|| true" is to bypass the error checking.

Is there a better workaround planned? The main issue appears to be udevadm settle not doing what it's supposed to. I have noticed issue #791 points out issues with udevadm settle when partitioning which I have also seen on a regular basis.

jsmeix commented at 2016-03-09 09:50:

I think hammering on the poor kernel/udev/systemd/whatever
with consecutive "partprobe" calls is not the very best thing.

But following the basic idea from https://github.com/rear/rear/issues/791

The right way is to wait for the actual "thingy" that is needed by the subsequent commands.

the following could be the right way in 10_include_partition_code.sh

    # Ensure we have the new partitioning on the device
    # by waiting for the actual "thingy" that is needed afterwards
    # (cf. https://github.com/rear/rear/issues/793)
    # which means we wait for a successful run of "partprobe"
    # to ensure the new partitioning can be used afterwards.
    # If there was no successful run of "partprobe"
    # within 120 seconds we give up with an error:
    (
    echo "for countdown in \$( seq 120 -1 0 ) ; do"
    echo "    test '0' = \$countdown && Error 'partprobe $device failed' "
    echo "    partprobe -s $device >&2 && break || true"
    echo "    LogPrint \"waiting for 'partprobe $device' (\$countdown)\""
    echo "    sleep 1"
    echo "done"
    ) >> "$LAYOUT_CODE"

Please test if it works reilably this way.

Because you mentioned
"sometimes sleeping for a minute doesn't work"
I use 120 seconds as maximum retry time.
Perhaps you need even more.

thefrenchone commented at 2016-03-09 19:41:

Your code worked for me. The countdown is a lot cleaner than what I had.

To clarify what I mean with sleep not working: it appears sleeping alone isn't enough for udev to get it's act together. Sometimes a sleep 1 works. Usually a sleep 10 works. Occasionally udev just needs to be kicked before it'll ever work. So sleep was never a reliable method for some reason.

One alternative that also worked reliably was using vgchange to put the root lvm group down then up before the partprobe. This depended on the system using lvm but lvm appears to just wait until the device wasn't busy instead of failing which was what I wanted.

The 120 second timeout may end up being far more than 120 seconds because partprobe takes time to run. Plus time was never the actual issue, it's more of udev needing a kick before partprobe is called. So a final fix may not need sleep, although it's probably not a bad idea.

jsmeix commented at 2016-03-10 09:31:

Regarding "udev just needs to be kicked":

As far as I was told "kicking udev" does not make sense.

Reason:

Again the crucial point is that the whole suff works based on events.

When "nothing happens" in udev it means there are no events.

When there are no events it does not change anything to "kick udev" because udev will do nothing when there are no events.

Again the right way is the same idea as in https://github.com/rear/rear/issues/791 which is here

"kick" the actual "thingy" that generates events for udev

The actual "thingy" that generates events for udev is the kernel.

When a partitioning tool has written whatever data on a harddisk
that is meant to be used by the kernel as partitioning information,
then the kernel does not "magically" know that those blocks
which were written right now are partitioning data. The kernel
blindly writes the blocks onto the harddisk. Afterwards the kernel
must be explicitly told to read the new partitioning information
from the harddisk. Traditionally this was done by calling "partprobe"
after using a partitioning tool and that is the reason why there is
that "partprobe" call at the end of 10_include_partition_code.sh

But I was told that nowadays partitioning tools have been enhanced
that when finishing they do automatically what "partprobe" does
which means nowadays the explicit "partprobe" call should be
unnecessary.

Actually nowadays the explicit "partprobe" call could even lead
to some kind of confusion or race conditioned failures as follows:

When a nowadays partitioning tool does automatically what
"partprobe" does and additionally an explicit "partprobe" call
is done then the "partprobe" call could fail because the harddisk
device is blocked/busy because the kernel is still busy with the
perceding task that was triggered automatically by the partitioning
tool. Probably this describes the reason why in your case
"partprobe -s /dev/sda fails 100% of the time with resource is busy".

Therefore when a nowadays partitioning tool (parted) is used,
the explicit "partprobe" call should be skipped.

The problem in rear is:

How to autodetect whether or not the particular parted program
that is installed on a particular user's system does automatically
what partprobe does so that the explicit partprobe call can be
skipped?

I am totally against testing opearting system verdor and version
because that is totally against the very basic idea behind https://github.com/rear/rear/issues/791 which is here:

The right way is to test for the actual "thingy".

Because I cannot imagine how I could test a particular parted program
whether or not it does automatically what partprobe does,
I think the best idea is a backward compatible and fail safe approach
as follows:

After the last parted call do a hardcoded sleep of 1 second
so that the kernel/udev get a bit of time to process possibly
automated triggers of parted (i.e. for nowadays parted),
afterwards call partprobe explicitly, and wait 10 seconds if it failed,
finally call partprobe again explicitly and ignore if that also failed:

    # Try to ensure the kernel uses the new partitioning
    # see https://github.com/rear/rear/issues/793
    # First do a hardcoded sleep of 1 second so that
    # the kernel and udev get a bit of time to process
    # automated "read partition table changes" triggers
    # of nowadays parted.
    # Then to be backward compatible with traditional parted
    # call partprobe explicitly to trigger the kernel
    # to "read partition table changes" and if that fails
    # wait 10 seconds before a first retry and if that fails
    # wait 60 seconds before a final retry and if that fails
    # ignore that failure and proceed "bona fide" because
    # nowadays it should "just work" regardless of partprobe.
    (
    echo "sleep 1"
    echo "if ! partprobe -s $device >&2 ; then"
    echo "    LogPrint 'retrying partprobe $device after 10 seconds' "
    echo "    sleep 10"
    echo "    if ! partprobe -s $device >&2 ; then"
    echo "        LogPrint 'retrying partprobe $device after 1 minute' "
    echo "        sleep 60"
    echo "        if ! partprobe -s $device >&2 ; then"
    echo "            LogPrint 'partprobe $device failed, proceeding bona fide' "
    echo "        fi"
    echo "    fi"
    echo "fi"
    ) >> "$LAYOUT_CODE"

jsmeix commented at 2016-03-10 09:40:

FYI regarding the "I was told" in my comment above:

I am not at all a kernel or udev or partitioning tools expert.

I only report here what "I was told".

gdha commented at 2016-03-10 15:28:

@jsmeix is this a bug? And, do you want to commit before we do a release?

jsmeix commented at 2016-03-10 15:41:

@gdha good question - I don't know.

For my tests all had "just worked" on my SLE12 test systems but as I explained in https://github.com/rear/rear/issues/791 I think the whole udev related waiting/triggering code in rear should be reviewed/reconsidered but I think this should happen without any time pressure (i.e. after the 1.18 release), cf. my "Milestone" setting in https://github.com/rear/rear/issues/791

On the other hand if udev-related things often fail in newer/newest Red Hat versions the issue might become a blocker bug for the 1.18 release - and then I could of course do a pull request.

jsmeix commented at 2016-03-11 14:55:

@gdha have a look at my https://github.com/rear/rear/pull/797

If you like it please merge it.

At least from my point of view it is backward compatible
so that there should be no regressions if you merge it.

@thefrenchone please test my https://github.com/rear/rear/pull/797 if that also makes it work for your case.

thefrenchone commented at 2016-03-11 17:10:

@jsmeix I've tested some variations of your code

    echo "for countdown in \$( seq 120 -1 0 ) ; do"
    echo "    test '0' = \$countdown && Error 'partprobe $device failed' "
    echo "    partprobe -s $device >&2 && break || true"
    echo "    LogPrint \"waiting for 'partprobe $device' (\$countdown)\""
    echo "    sleep 1"
    echo "done"
    ) >> "$LAYOUT_CODE"

There are a few modification I ended up making for my testing. At most this is 10 attempts of partprobe with sleep getting longer on each attempt.

    echo "for countdown in \$( seq 0 10 ) ; do"
    echo "    test '10' = \$countdown && Error 'partprobe $device failed' "
    echo "    partprobe -s $device >&2 && break || true"
    echo "    LogPrint \"waiting for 'partprobe $device' (\$countdown)\""
    echo "    sleep \$countdown"
    echo "done"
    ) >> "$LAYOUT_CODE"

Testing this I found that partprobe always succeeded on the second attempt.

Looking at your commit I don't see anything that will cause any issues. For my use case I expect the first partprobe will always fail and the second to always succeed. In my testing of your commit the second partprobe always succeeded. Anything less that 'sleep 3' and the second partprobe would fail. At least that's the case today.

The timing difference between the two methods shows how inconsistent this problem is. Using the loop method I only have a 'sleep 0' between the first and second partprobe. The if/else method required at least a sleep 3 between the first and second partprobe.

For my uses I'm probably going to keep the countdown with more partprobe attempts. Sleeping for 10 seconds before the second attempt is safe but I prefer to make more attempts in that time frame.

Thank you @jsmeix for #797

jsmeix commented at 2016-03-14 08:21:

@thefrenchone many thanks for your feedback!

jsmeix commented at 2016-05-23 08:59:

The error message in the initial comment
https://github.com/rear/rear/issues/793#issue-139355299
matches exactly what is described here:
https://github.com/cockpit-project/cockpit/issues/3177
I guess this one describes the root cause.

jsmeix commented at 2016-06-29 09:15:

Now it happened for the very first time to me,
see https://github.com/rear/rear/issues/897

jsmeix commented at 2016-06-29 09:37:

No https://github.com/rear/rear/issues/897 is false alarm.

I wished I could really reproduce such a parted failure.

pcahyna commented at 2017-10-13 17:41:

I think I might have a more elegant fix. Since udevd locks the device using flock(), what about avoiding such races by

diff --git a/usr/share/rear/conf/Linux-i386.conf b/usr/share/rear/conf/Linux-i386.conf
index b1a3002..7d1c364 100644
--- a/usr/share/rear/conf/Linux-i386.conf
+++ b/usr/share/rear/conf/Linux-i386.conf
@@ -8,6 +8,7 @@ PROGS=(
 "${PROGS[@]}"
 grub
 partprobe
+flock
 lilo
 fdisk
 cfdisk
diff --git a/usr/share/rear/layout/prepare/GNU/Linux/10_include_partition_code.sh b/usr/share/rear/layout/prepare/GNU/Linux/10_include_partition_code.sh
index 9dad6a0..b84354f 100644
--- a/usr/share/rear/layout/prepare/GNU/Linux/10_include_partition_code.sh
+++ b/usr/share/rear/layout/prepare/GNU/Linux/10_include_partition_code.sh
@@ -235,7 +235,7 @@ EOF
     # Ensure we have the new partitioning on the device.
     (
     echo "my_udevsettle"
-    echo "partprobe -s $device >&2"
+    echo "flock $device partprobe -s $device >&2"
     echo "my_udevsettle"
     ) >> "$LAYOUT_CODE"
 }

jsmeix commented at 2017-10-16 10:14:

@pcahyna
on first galnce I don't like to introduce one more tool "flock"
to work around the fundamentally inappropriate way
how that things are implemented in ReaR - at least
as far as I understand how that actually works, cf.
https://github.com/rear/rear/issues/791

We would need to ensure udev works same
for all (even relatively old) Linux distributions that are
supported by ReaR and that for something where
ReaR should not at all interfere with udev, cf.
https://github.com/rear/rear/issues/791

Something else might suddenly fail in whatever
unexpected ways when ReaR locks devices.

I wonder if partprobe needs to use the device exclusively,
why then partprobe itself does not lock the device
but the caller of partprobe must do the locking?

Finally - as far as I know - partprobe seems to be
only needed in older Linux distributions, cf.
https://github.com/rear/rear/issues/791#issuecomment-195323149

I think the whole "partitioning code" in ReaR would need
to be overhauled to make it work in compliance with
https://github.com/rear/rear/issues/791

But - as always - I find no time to do such major work
because - as always - this or that particular issue
of this or that particular user gets in the way.

pcahyna commented at 2017-10-16 17:12:

@jsmeix

on first galnce I don't like to introduce one more tool "flock"

Is it a problem? flock has been a part of the util-linux package for many years (it was there already in the initial import to Git in 2006) and util-linux is one of the most basic packages. ReaR most likely depends on it in many other places.

to work around the fundamentally inappropriate way how that things are implemented in ReaR

Any "appropriate way" to implement this will need some kind of synchronization to prevent races (especially, as you say, with udev on the relatively old Linux distributions) and so I don't see how using "flock" is a workaround rather than a step towards the right solution. Retrying in a loop (as done here and also in #1370 which is a very similar kind of issue) is what I would call a workaround. Or do you have a better suggestion how to do this synchronization? "udevadm settle" does not seem to work reliably (I think you pointed out as well that it is not the right solution).

Something else might suddenly fail in whatever
unexpected ways when ReaR locks devices.

One can never exclude unexpected problems, but I believe the chance is very slim. "Something else" would need to use locking as well to be affected by locking in ReaR and in this case this "something else" must be prepared to handle the case when the resource is locked because this is why locking is used in the first place.

I wonder if partprobe needs to use the device exclusively,
why then partprobe itself does not lock the device
but the caller of partprobe must do the locking?

I don't know, most likely it is because the authors haven't thought about interaction with udev. Anyway it is the case and we should handle it somehow.

Finally - as far as I know - partprobe seems to be
only needed in older Linux distributions

Indeed, parted nowadays does the same job as partprobe (it is the same code base, so the code is literally identical) and therefore partprobe is unneeded, but this means that parted may encounter exactly the same problem as partprobe does, so IMO the problem merely shifts from one place to another. (This means that parted invocation should be protected by the same lock as partprobe's.)

gozora commented at 2017-10-16 18:05:

I don't have broad experience with flock(1), but as far as I remember Linux does not have mandatory locking hence it make sense to use flock(1) only with cooperating processes.
Reading partprobe source, I can't find any evidence that it cares about file locking, but I might be wrong of course ;-).

V.

pcahyna commented at 2017-10-16 18:22:

@gozora

as far as I remember Linux does not have mandatory locking

Indeed.

hence it make sense to use flock(1) only with cooperating processes.

flock(1) exists to be used with processes which do not use locking themselves, as it is able to execute them with the lock taken, even if they do not take the lock themselves.

(Maybe I don't understand what you mean by "cooperating processes".)

Reading partprobe source, I can't find any evidence that it cares about file locking, but I might be wrong of course ;-).

Exactly. partprobe/parted won't lock the device node, that's why using flock(1) is needed to take the lock on their behalf. (If we want to synchronize with udev using a lock, of course.)

gozora commented at 2017-10-16 18:33:

Don't take me for an expert here, I'm only curious, but if you have system without mandatory locking how can one ensure that locking of whatever file will be followed?

I mean if you flock $device for partprobe I guess your intention is to disallow any other process to access $device, but any other process that wants to access $device is free to do so (e.g. vi $device).
That is where cooperating processes comes in play. I'd expect that flock have sense to use only with other flock call (e.g flock $device vi $device) or with any application that cares about locks (fcntl (..., F_GETFL, 0)).

V.

pcahyna commented at 2017-10-16 18:41:

@gozora the intention is not to disallow any other process to access $device (this is indeed impossible as you point out), but to disallow only udev to do so (or rather wait for udev to be finished with the device). And udev is a "cooperating process" in your sense (it uses flock(2) to lock the device nodes). So the proposal is that ReaR (or rather its subprocesses, more precisely) becomes a cooperating process as well.

pcahyna commented at 2017-10-16 18:45:

@gozora small nit: you should not use fcntl(..., F_GETFL, 0) but flock(...) (flock(2)) to cooperate with flock(1), because flock(1) calls flock(2) and the two locking mechanisms (fcntl(2) and flock(2)) are independent.

gozora commented at 2017-10-16 19:03:

@pcahyna thank you kind sir ;-)!
If udev really know flock then all you've told make perfect sense to me now!

Now the question remains how reliable/bulletproof this locking is.
How it behaves if things go wrong?
Can it be that that deadlocks occurs?
It would not be much of a fun rebooting server because my DR solution does not allow udev to operate with HW ;-)
As fat as I remember current solution, maybe not so elegant, but is pretty peaceful.

V.

gdha commented at 2017-10-17 08:03:

I've been using the flock trick for many years in my scripts for creating Business Copy pairs as our scheduler had the bad behaviour to launch many similar processes at the same second. We will change the scheduler software soon ;-)

jsmeix commented at 2017-10-17 09:39:

@pcahyna
many thanks for your detailed background explanation.
It helps at least me a lot to understand your idea behind.

In particular your
https://github.com/rear/rear/issues/793#issuecomment-336989952
to avoid races with udev while "we" (i.e. ReaR) are running parted
looks very useful in particular because of your
https://github.com/rear/rear/issues/793#issuecomment-336982358
that parted itself does not care about locking
so that "we" (i.e. ReaR) must care.

@gozora
regarding other processes that may also want to access the device:
While "rear recover" runs there are not much other processes.
In particular "rear recover" is the only user process.
So we only need to care about possible system processes
and - fortunately - udev is a "cooperating process" that
is prepared when ReaR uses "flock(1)".

@pcahyna
I wonder if the flock default "waiting indefinitely for the lock
to become available" is right.
I think after a reasonable timeout ReaR should report something
to the user e.g. like

until flock --timeout=3 $device partprobe -s $device ; do
    echo "Retrying 'flock $device partprobe -s $device'"
    sleep 1
done

The 'sleep 1' is crucial to avoid a thight runnning loop
that can happen e.g. when $device does not exist.
The '--timeout=3' is the only supported way for old flock
e.g. in SLES10 where "man flock" only shows those options:

SYNOPSIS
    flock [ --shared | --timeout=seconds ] lockfile command ..

and '--timeout=seconds' also works on SLES12.

pcahyna commented at 2017-10-18 16:34:

@gozora

Now the question remains how reliable/bulletproof this locking is.
How it behaves if things go wrong?
Can it be that that deadlocks occurs?
It would not be much of a fun rebooting server because my DR solution does not allow udev to operate with HW ;-)

Good question. Deadlocks can IMO not occur because udev uses a non-blocking lock. But one must be careful because udev tries to get a shared (read) lock when processing an event related to the device, and if it can not obtain the lock immediately, it skips processing of the event, which causes events to be effectively lost (not a good design IMO, but we can not do anything about it). See https://github.com/openSUSE/multipath-tools/commit/841977fc9c3432702c296d6239e4a54291a6007a. In this case, problems should not cause "rebooting server because my DR solution does not allow udev to operate with HW" as here we are in restore, not in normal server operation, as @jsmeix pointed out. But still, we should use a read (shared) lock for safety here (flock --shared), which will not interfere with udev processing of the events but still serializes partition table rereading (udev uses another lock - an exclusive one - when rereading the partition table.)

pcahyna commented at 2017-10-18 16:46:

@jsmeix

I wonder if the flock default "waiting indefinitely for the lock
to become available" is right.
I think after a reasonable timeout ReaR should report something
to the user ...

I am not sure about it. Similar issue was found and fixed in ceph, see https://github.com/ceph/ceph/pull/9330 and I think it would be the safest to just do the same as they are doing (call udevadm settle first and then retry partprobe under flock -s (see my previous comment about the use of a shared lock) several times if it fails). They sleep between retries just as in your solution.


[Export of Github issue for rear/rear.]