#791 Issue closed: Waiting for udev and "kicking udev" are wrong (both miss the point)

Labels: enhancement, cleanup, severe improvement

jsmeix opened issue at 2016-03-07 09:28:

Waiting for udev and "kicking udev" are wrong (both miss the point).

Currently rear waits for udev either via something like

udevadm settle

or by simple sleeping for a while.

Both miss the point and therefore both will never work reliably so that in the end both is wrong.

That sleeping for a while cannot work reliably is clear.

Below I explain why even testing if all current udev events are handled via "udevadm settle" cannot work reliably in any case (hint: the crucial word is "current").

The right way is to wait for the actual "thingy" that is needed by the subsequent commands.

Example:

Wrong:

# Wait for udev to have the harddisk available:
udevadm settle
# Create partitions on the harddisk:
parted ... /dev/sda
# Wait for udev to have the harddisk partitions available:
udevadm settle
# Make filesystem on harddisk partition:
mkfs ... /dev/sda1

Right:

# Wait until harddisk device node is available:
for countdown in $( seq 60 -1 0 ) ; do
    test -b /dev/sda && break
    LogPrint "Waiting for /dev/sda ($countdown)"
    sleep 1
done
test -b /dev/sda || Error "No /dev/sda available"
# Create partitions on the harddisk:
parted ... /dev/sda
# Wait until harddisk partition device node is available:
for countdown in $( seq 60 -1 0 ) ; do
    test -b /dev/sda1 && break
    LogPrint "Waiting for /dev/sda1 ($countdown)"
    sleep 1
done
test -b /dev/sda1 || Error "No /dev/sda1 available"
# Make filesystem on harddisk partition:
mkfs ... /dev/sda1

Reason:

udev is event driven.

This means udev events happen at arbitrary time.

For example the hardware plus the kernel may lead to "delayed" or "very late" udev events.

Therefore code like

parted ... /dev/sda
udevadm settle
mkfs ... /dev/sda1

cannot work reliably because at the time when "udevadm settle" is called there might not yet be any udev event so that "udevadm settle" exits regardless that there is not yet a /dev/sda1 because its udev event is for whatever reason "too late".
Actually a udev event is never "too late" but just late (for whatever reason).
Of course it is unlikely that such a udev event is "too late" but that only means it is unlikely that such code fails.
But that also means when such code fails for one particular user it is unlikely that others can reproduce it (and understand what actually is wrong) which means it becomes unlikely that such issues will be fixed properly.

jsmeix commented at 2016-03-11 11:09:

Also "kicking udev" is wrong (at least usually), see https://github.com/rear/rear/issues/793#issuecomment-194753486 (excerpts):

As far as I was told "kicking udev" does not make sense.

Reason:

Again the crucial point is that the whole suff works based on events.

When "nothing happens" in udev it means there are no events.

When there are no events it does not change anything to "kick udev" because udev will do nothing when there are no events.

Again the right way is the same idea as above which is in this case:

"Kick" the actual "thingy" that generates events for udev.

The actual "thingy" that generates events for udev is the kernel.

For example:

When a partitioning tool has written whatever data on a harddisk
that is meant to be used by the kernel as partitioning information,
then the kernel does not "magically" know that those blocks
which were written right now are partitioning data.
The kernel blindly writes the blocks onto the harddisk.
Afterwards the kernel must be explicitly told to read
the new partitioning information from the harddisk.
Traditionally this was done by explicitly calling "partprobe"
after using a partitioning tool but I was told that nowadays
partitioning tools have been enhanced that when finishing
they automatically tell the kernel to read the new partitioning
information from the harddisk.

FYI regarding the "I was told" above:
I am not at all a kernel or udev or partitioning tools expert.
I only report here what "I was told".

thefrenchone commented at 2016-03-11 17:55:

Just a note that I do observe the current code using my_udevsettle being an issue on RHEL 7. It's rare when this process fails but in a few circumstance parted will fail with the device being busy.

While I was testing my partprobe issue #793 I often had to insert sleeps in front of the parted commands or clear the disks before restarting "rear recover".
The issue is reproducible if I was to exit rear where partprobe would fail and restart "rear recover". In general this was reproducible if I interrupted rear after disk layout and restarted recovery.

My workaround was to insert sleeps when parted would fail or I would just forcefully remove the root LVM volume group.

Under normal use, when rear doesn't need to be restarted, parted did not fail.

jsmeix commented at 2016-03-14 09:07:

Only an idea reagarding "forcefully remove the root LVM volume group before restarting 'rear recover'", see https://github.com/rear/rear/issues/540 "Implement a generic 'cleanupdisk' function".

I guess when you have a higher stack of storage objects like filesystem on top of LVM on top of partitions on top of the harddisk (cf. https://github.com/rear/rear/issues/540#issuecomment-71773772) then it takes more time until kernel/udev/whatever_tools_run_by_udev all got triggered/launched/finished which could be the reason that in your case you need to wait at least 3 secounds until partprobe suceeds while in my test-case (plain filesystem on partition on harddisk) everything works "fast as lightning".

Furthermore when you have a higher stack of storage objects you may need to do an appropriately higher stack of actions to wipe the various kind of metadata information blocks from the harddisk that belong to each layer of storage objects.
If you miss to wipe metadata information from the harddisk that belong to old/outdated layers of storage objects, then when you re-use that harddisk this or that tools that belong to each layer of storage objects get confused when they read old/outdated metadata information from the harddisk.
As far as I know this or that tools that belong to each layer of storage objects could get launched automatically by udev.
For example if you re-use a harddisk that had before LVM on it, it may happen that after creating partitions from scatch on that hsrddisk, udev may also trigger to run LVM tools (cf. https://github.com/rear/rear/issues/540#issuecomment-71814659 and https://github.com/rear/rear/issues/533#issuecomment-71441012 that are about MD tools but should describe the general udev behaviour). When those LVM tools detect remaining old/outdated LVM metadata information on the harddisk, there could be arbitrarily unexpected results (e.g. all of a sudden LVM issues may get in your way regardless that you only had called parted to create partitions).

I you are interested you may also have a look at https://github.com/rear/rear/issues/533 if you like to get some more information about nowadays mess in the udev related area where I somehow try to keep rear working nevertheless.

jsmeix commented at 2016-03-18 14:02:

@mattihautameki
I noticed your https://github.com/rear/rear/pull/790 and I think you could be interested about what I wrote here.

jsmeix commented at 2016-04-18 10:28:

Some interesting information what I found in
an openSUSE bug about parted behaviour:

In https://bugzilla.opensuse.org/show_bug.cgi?id=967375
there is (excerpt):

yast fails to find existing partitions
YaST is right, the entry for /dev/sda14
is missing in /proc/partitions.
Likely this is due to the fact that
calling parted for printing the partition table
triggers a udev rescan of the complete disk
and this can cause partitions to disappear
and reappear in /proc/partitions
(and /sys/block/...)
Maybe scattering more calls to "udevadm settle"
can solve the problem but changing parted to
not trigger a udev rescan looks like a better way,
esp. since the rescan with many disks is also slow.
... needles udev storm triggered by RO operation ...

That openSUSE bug links to
http://parted-devel.alioth.debian.narkive.com/BoSuz2Nz/patch-libparted-use-read-only-when-probing-devices-on-linux-1245144
which contains (excerpt):

libparted:
Use read only when probing devices on linux
When a device is opened for RW
closing it can trigger other actions,
like udev scanning it for partition changes.
Resolves: rhbz#1245144

When (older) parted may cause a "udev storm"
that may lead to partitions to disappear and reappear
it makes it questionable if it is really correct
to only wait for the actual "thingy" because
it may happen that the test for the actual "thingy"
is succesful but a bit later when that "thingy" is
actually needed by the subsequent command
it has disappeared (for a short while).

Perhaps what is needed to make it really work reilably is

  • "udevadm settle" plus
  • waiting for the actual "thingy" plus
  • retries of commands that need the actual "thingy"

Regarding "retries" see https://github.com/rear/rear/issues/793 (excerpt):

partprobe fails
The only reliable workaround I've found is to
just call partprobe multiple times

jsmeix commented at 2016-05-23 08:54:

Right now I noticed another issue that is of interest here:
https://github.com/cockpit-project/cockpit/issues/3177
and
https://bugzilla.redhat.com/show_bug.cgi?id=1283112

In general one my Google for something like

udevd vs. parted

That even more indicates that it is in parctice almost
impossible to implement a generically reliably working way
how to wait for the actual "thingy" that is needed
by the subsequent commands
cf. my last comment https://github.com/rear/rear/issues/791#issuecomment-211318620

That it is "Fixed in parted-3.1-23.el7" does not help rear
because rear must also work with older parted.
FYI: In SLE11 there is parted-2.3,
in SLE12 and openSUSE Leap 42.1 there is parted-3.1
but I don't know if that fix is also included in the
SLE12/Leap42.1 parted.

Because the more I learn about it
the more I get confused how to make it working
I postpone this issue to any later unspecified
"future rear version".

jsmeix commented at 2016-06-03 14:10:

I found out and got more information:

On my SLES11 system "man udev" reads:

udev provides a dynamic device directory containing
only the files for actually present devices. It creates
or removes device node files in the /dev directory

In contrast on my openSUSE Leap 42.1 system
(which is basically the same as a SLES12 system)
"man udev" reads:

udev supplies the system software with device events,
manages permissions of device nodes and may create additional
symlinks in the /dev directory, or renames network interfaces.
The kernel usually just assigns unpredictable device names
based on the order of discovery.

Accordingly udev changed from creating
device node files in the /dev directory under SLE11
to nowadays where it seems the kernel itself creates
device node files in the /dev directory.

I asked one of the SUSE systemd/udev maintainers
and he confirmed it:

Since SLE12, devtmpfs pseudo filesystem is used.
Therefore the kernel creates the device nodes and
udev is simply running the rules to set,
amongst others, device permissions.

I told him that one same code in "rear" must work
both for SLES11 and for nowadays systems and
I asked him how to enhance plain code like

parted ... /dev/sda
mkfs ... /dev/sda1

so that it will always (i.e. on SLES11 and SLES12)
work reliably - in particular how to wait for /dev/sda
and /dev/sda1 in the right way so that it always works
and he replied:

The same code for SLE11 and SLE12 will be different.
SLE11 will require more code to wait for udev creating
the block device.
OTOH, I don't think you need extra code to make
this reliable.
As soon as parted will do an ioctl(BLKRRPART)
to the block device, the kernel will re-read the
partition table and recreate the block device node
atomically, so I _think_ this part is synchronous.
But since it's handled by the kernel I'm not sure.
You should ask to the kernel folks to be 100% sure.

Hmm...
"one same code in rear that works both for SLES11
and for nowadays systems"
versus
"The same code for SLE11 and SLE12 will be different."
proves that this issue cannot be solved.

I just give up here.

jsmeix commented at 2016-06-06 10:10:

I reopen it so that others can notice it.

I added the labels "documentation" and "discuss"
so that others could contribute here.

Perhaps there is a reasonable way how to wait
for device nodes like /dev/sda and /dev/sda1
so that it works both on older systems where udev
creates the device nodes and on newer systems
where the kernel creates the device nodes.

Perhaps my https://github.com/rear/rear/issues/791#issuecomment-211318620 might point into the right direction:

"udevadm settle" plus
waiting for the actual "thingy" plus
retries of commands that need the actual "thingy"

Keep it simple and stupid and just do all together
to make it work on older and newer systems?

I also added "looking for sponsorship" as a request
that others may contribute code (i.e. pull requests)
that implement waiting for device nodes correctly.

jsmeix commented at 2016-06-07 09:42:

And now some more scaring news:

On special hardware (e.g. like IBM z Systems - formerly
System/390 or S/390) there could be thousands of devices
so that a "udevadm settle" for each of them results delays
that sum up to several minutes.

Therefore any kind of generic "wait_for_device()" function
would need some parameter variables in default.conf
so that the admin can specify the exact waiting behaviour
for his particular use-case.

gdha commented at 2018-07-19 16:25:

Perhaps it is better to remove all udev calls and if discovery of devices fail we can blame other tools like parted, partprobe, whatever ;-) ?? Or, we keep it like it is?

jsmeix commented at 2018-07-20 07:30:

@gdha
yes, to some degree it is about to remove the udevadm settle calls
but actually it is not about to remove them without any replacement.

Actually it is about to replace useless udevadm settle calls
by useful wait for the actually needed thing code.

This is one of several fundamental issues that are on my
"vague private todo list" to overhaul our current layout code
at whatever time in the future as time permits because
currently things work mostly o.k. (not always perfect but o.k.)
so that currently there is no hurry (therefore ReaR future).

FYI:

My "vague private todo list" to overhaul our current layout code
is about things like
https://github.com/rear/rear/issues/791 (this issue here)
https://github.com/rear/rear/issues/799
https://github.com/rear/rear/issues/987
https://github.com/rear/rear/issues/1750
https://github.com/rear/rear/issues/1771
and probably some more...

In general in relation to the whole disk layout code
cf.
https://github.com/rear/rear/issues/1195#issuecomment-368499304
and
https://github.com/rear/rear/pull/1603#issuecomment-347860869
in particular my currenly latest reasoning
about a "clean design" (of the disk layout code) in
https://github.com/rear/rear/pull/1091#issuecomment-263819775

Meanwhile I even have a "secret vague idea"
how we could completely overhaul the layout code
without getting in conflict with out current layout code:

My idea is to re-implement it as a new separated stage
for example the new stuff could be called "storage"
in its own usr/share/rear/storage sub-directory
plus a config variable like STORAGE_CODE=storage
which defaults to STORAGE_CODE=layout as long
as the new storage code is under development
so that the usr/share/rear/lib/*-workflow.sh scripts
could call the matching stage for example
in lib/recover-workflow.sh something like (excerpt)

    case "$STORAGE_CODE" in
        (layout)
            SourceStage "layout/prepare"
            SourceStage "layout/recreate"
            ;;
        (storage)
            SourceStage "storage/recreate/cleanup_disks"
            SourceStage "storage/recreate/create_block_devices"
            SourceStage "storage/recreate/create_filesystems"
            SourceStage "storage/recreate/create_subvolumes"
            SourceStage "storage/recreate/mount_fstree"
            ;;
        (*)
            Error "Unsupported STORAGE_CODE=$STORAGE_CODE"
            ;;
    esac

This way we could even safely drop all the old stuff
in the new storage code and keep the layout code
as long as our users need it, cf.
https://github.com/rear/rear/issues/1390

github-actions commented at 2020-07-03 01:33:

Stale issue message

gdha commented at 2020-07-10 06:02:

@jsmeix Do you still have any plans with this issue? If not, then do nothing and it will be closed within a few days. Otherwise, we remove the 'no-issue-activity' label to keep it alive.

jsmeix commented at 2020-07-10 11:51:

All issues that are labeled as

Dedicated Priority Support
critical/security bug
severe improvement
blocker

should never be closed automatically, cf.
https://github.com/rear/rear/blob/master/.github/stale.yml and see in particular
https://github.com/rear/rear/commit/3cf3f5862185787dba622201d1445fcc00d8e29e

Issues that are labeled as

critical/security bug
blocker

should usually not be affected by automated close because we usually work on them.

But issues that are labeled as

Dedicated Priority Support

could be affected by automated close in particular when the "Dedicated Priority Support"
activity happens not within the issue but e.g. via personal communication with a customer
so the public GitHub issue may show no activity for a longer time.

And most of all issues that are labeled as

severe improvement

are affected by automated close because "severe improvement" tasks
usually take a lot of time and usually move forward very slowly,
cf. https://github.com/rear/rear/issues/799#issuecomment-531247109
so that such issues often have no activity for a longer time
but they are severe issues that should never be closed automatically
regardless how long it takes to get them solved.

Veetaha commented at 2023-04-12 23:45:

Hey @jsmeix I am seeing your issue in the google search for my problem and it appears to describe exactly the case that I have where I partition the disk and immediately mkfs and mount it.

I tried your approach with loop and test -b $partition inside of it. But... It looks like it doesn't work. I am not sure why, but if you are using the same approach beware that it is still not reliable. From the cloud-init logs I can tell that even if lsblk says that the partition exists, it doesn't mean it's safe to mkfs. Here are the logs from my own cloud-init script:

Waiting for block device at /dev/nvme1n1p1
Block device is available at /dev/nvme1n1p1
> lsblk
NAME          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme0n1       259:0    0    30G  0 disk
├─nvme0n1p1   259:2    0    30G  0 part /
└─nvme0n1p128 259:3    0     1M  0 part
nvme1n1       259:1    0 441.5G  0 disk
└─nvme1n1p1   259:4    0 441.5G  0 part
> mkfs.xfs -f /dev/nvme1n1p1
Error accessing specified device /dev/nvme1n1p1: No such file or directory

The script is the variation of yours under the spoiler below

Details function wait_for_block_device { path="$1" log "Waiting for block device at $path" for countdown in $( seq 600 -1 1 ) ; do if [ -b $path ]; then log "Block device is available at $path" return 0 fi log "Waiting for block device $path ($${countdown}00ms left)" sleep 0.1s done error "Block device is not available at $path" }

That's just insane... but I am still seeking a workaround for the problem

Maybe just retrying mkfs while it returns No such file or directory could solve this, but that's a crazy workaround one would need to repeat each time when such a common thing as partitioning + mkfs + mount would need to be scripted...

jsmeix commented at 2023-04-17 08:18:

@Veetaha
see my above
https://github.com/rear/rear/issues/791#issuecomment-211318620
how it could happen that device nodes disappear and reappear
so one must also retry commands that need the actual "thingy".

By the way:
For the (not so) fun of it
you may also have a look at
https://github.com/rear/rear/issues/2908
therein in particular
https://github.com/rear/rear/issues/2908#issuecomment-1378811748
so

# mount /dev/something /mountpoint
# cp /some/file /mountpoint
# umount /mountpoint

may no longer work reliably because
umount could fail with "target is busy"

Veetaha commented at 2023-04-24 00:26:

@jsmeix Thank you! Your issues are a great mine of information. I wish these common problems and solutions were described somewhere in udev/parted and other linux device management commands help messages.

I am now using the approach of retrying the command that needs the actual "thingy", and it should be the ultimate solution.


[Export of Github issue for rear/rear.]