#3329 PR merged: Fix partition naming on RHEL when migrating devices

Labels: bug, fixed / solved / done

rmetrich opened issue at 2024-10-10 12:26:

Relax-and-Recover (ReaR) Pull Request Template

Please fill in the following items before submitting a new pull request:

Pull Request Details:
  • Type: Bug Fix

  • Impact: Normal

  • Reference to related issue (URL):

  • How was this pull request tested? Code tested in the following cases.

With QEMU SCSI disk ID "0000a":

rear> lsblk
NAME                           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda                              8:0    0    20G  0 disk
|-sda1                           8:1    0     1G  0 part
|-sda2                           8:2    0    19G  0 part
`-0QEMU_QEMU_HARDDISK_0000a    253:0    0    20G  0 mpath
  |-0QEMU_QEMU_HARDDISK_0000a1 253:1    0     1G  0 part
  `-0QEMU_QEMU_HARDDISK_0000a2 253:2    0    19G  0 part
sdb                              8:16   0    20G  0 disk
|-sdb1                           8:17   0     1G  0 part
|-sdb2                           8:18   0    19G  0 part
`-0QEMU_QEMU_HARDDISK_0000a    253:0    0    20G  0 mpath
  |-0QEMU_QEMU_HARDDISK_0000a1 253:1    0     1G  0 part
  `-0QEMU_QEMU_HARDDISK_0000a2 253:2    0    19G  0 part
sr0                             11:0    1 614.7M  0 rom

With QEMU SCSI disk ID "0000":

rear> lsblk
NAME                           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda                              8:0    0    20G  0 disk
|-sda1                           8:1    0     1G  0 part
|-sda2                           8:2    0    19G  0 part
`-0QEMU_QEMU_HARDDISK_0000     253:0    0    20G  0 mpath
  |-0QEMU_QEMU_HARDDISK_0000p1 253:1    0     1G  0 part
  `-0QEMU_QEMU_HARDDISK_0000p2 253:2    0    19G  0 part
sdb                              8:16   0    20G  0 disk
|-sdb1                           8:17   0     1G  0 part
|-sdb2                           8:18   0    19G  0 part
`-0QEMU_QEMU_HARDDISK_0000     253:0    0    20G  0 mpath
  |-0QEMU_QEMU_HARDDISK_0000p1 253:1    0     1G  0 part
  `-0QEMU_QEMU_HARDDISK_0000p2 253:2    0    19G  0 part
sr0                             11:0    1 614.7M  0 rom

With default friendly name ("mpatha"):

rear> lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda           8:0    0    20G  0 disk
|-sda1        8:1    0     1G  0 part
|-sda2        8:2    0    19G  0 part
`-mpatha    253:0    0    20G  0 mpath
  |-mpatha1 253:1    0     1G  0 part
  `-mpatha2 253:2    0    19G  0 part
sdb           8:16   0    20G  0 disk
|-sdb1        8:17   0     1G  0 part
|-sdb2        8:18   0    19G  0 part
`-mpatha    253:0    0    20G  0 mpath
  |-mpatha1 253:1    0     1G  0 part
  `-mpatha2 253:2    0    19G  0 part
sr0          11:0    1 614.7M  0 rom

With friendly name ending with a digit ("disk0"):

rear> lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda           8:0    0    20G  0 disk
|-sda1        8:1    0     1G  0 part
|-sda2        8:2    0    19G  0 part
`-disk0     253:0    0    20G  0 mpath
  |-disk0p1 253:1    0     1G  0 part
  `-disk0p2 253:2    0    19G  0 part
sdb           8:16   0    20G  0 disk
|-sdb1        8:17   0     1G  0 part
|-sdb2        8:18   0    19G  0 part
`-disk0     253:0    0    20G  0 mpath
  |-disk0p1 253:1    0     1G  0 part
  `-disk0p2 253:2    0    19G  0 part
sr0          11:0    1 614.7M  0 rom
  • Description of the changes in this pull request:

The previous code was appending a "p" to the device name unconditionally, which ended up having partition names such as 'wwidp0' instead of 'wwid1', when the device name (e.g. 'wwid') ended with a letter and not a digit.
The new code applies the proper naming, which is 'wwid1' when device doesn't end with a digit (e.g. 'wwid'), and 'wwid0000p1' when the device (e.g. 'wwid0000') ends with a digit.

See also When WWID does not end in digit, no p is added to the partition name of a multipath device in RHEL7

jsmeix commented at 2024-10-11 07:29:

@pcahyna
please have a look here and merge it as you like

jsmeix commented at 2024-10-11 07:36:

@rmetrich
it seems you could simulate and test multipath things via QEMU.
If yes, is there an explanatory description how to do that?
I.e. how to setup a VM to simulate a multipath system
on a usual workstation that has no multipath hardware.
I do not setup KVM/QEMU VMs via command line
but I use only the graphical UI of virt-manager.
Is it also possible to setup a VM with virt-manager
that simulates a multipath system?

rmetrich commented at 2024-10-11 07:47:

Yes you can, it's easy, at least on RHEL :-)

When you create the VM, specify to use a RAW disk instead of QCOW2 and make sure SCSI controller is used instead of VirtIO. Also make sure to not use Cache mode, set a Serial number for the disk and select Shareable.
Finally add an additional disk, reuse the previously created disk and set the cache mode to None and set Serial number to same value, and you get your Multipath.

Example of XML:

    <disk type="file" device="disk">
      <driver name="qemu" type="raw" cache="none" discard="unmap"/>
      <source file="/var/lib/libvirt/images/multipath9-recovery.img"/>
      <target dev="sda" bus="scsi"/>
      <shareable/>
      <serial>0000</serial>
      <address type="drive" controller="0" bus="0" target="0" unit="0"/>
    </disk>
    <disk type="file" device="disk">
      <driver name="qemu" type="raw" cache="none" discard="unmap"/>
      <source file="/var/lib/libvirt/images/multipath9-recovery.img"/>
      <target dev="sdb" bus="scsi"/>
      <shareable/>
      <serial>0000</serial>
      <address type="drive" controller="0" bus="0" target="0" unit="1"/>
    </disk>

jsmeix commented at 2024-10-11 13:58:

@rmetrich
thank you so much for your description!

I will try it out as soon as possible
i.e. when time permits - which may take some time
because currently I am messed up with overhyped
security excitement about cups-browsed when some
fools run it accessible from untrusted networks
like the public Internet, cf.
https://en.opensuse.org/SDB:CUPS_and_SANE_Firewall_settings#Automated_print_queue_setup_via_cups-browsed

pcahyna commented at 2024-10-16 16:42:

I think it is ok as a quick fix, but in longer term I would very much prefer to get rid of fragile text manipulations like this (note how many code branches there are and how many conditions they depend on). I believe the format of the partition device name can be determined from the layout, e.g. the layout contains

part /dev/mapper/362cea7f0a24d79002c16131a09f376a1 2199022206976 1048576 primary none /dev/mapper/362cea7f0a24d79002c16131a09f376a1p1

so we know that the partition prefix is /dev/mapper/362cea7f0a24d79002c16131a09f376a1p , if we know what suffix to substract, and this can be done quite easily by adding the partition number (1) to the layout as an additional field.

jsmeix commented at 2024-10-17 08:26:

Regarding a future generic cleanup to get rid of fragile code:

@rmetrich @pcahyna
could you please attach your disklayout.conf files
(or excerpts of the relevant data in disklayout.conf)
for each of the different cases that you mentioned here
because I fail to imagine how your disklayout.conf
exactly looks like for all those different cases.

I like to understand to what extent we already know
in disklayout.conf for each child (partition) device
what its parent (disk/mpath) device is.

Because if we knew them like

parent (disk/mpath) device         child (partition) device
0QEMU_QEMU_HARDDISK_0000a          0QEMU_QEMU_HARDDISK_0000a1
0QEMU_QEMU_HARDDISK_0000a          0QEMU_QEMU_HARDDISK_0000a2
0QEMU_QEMU_HARDDISK_0000           0QEMU_QEMU_HARDDISK_0000p1
0QEMU_QEMU_HARDDISK_0000           0QEMU_QEMU_HARDDISK_0000p2
mpatha                             mpatha1
mpatha                             mpatha2
disk0                              disk0p1
disk0                              disk0p2
362cea7f0a24d79002c16131a09f376a1  362cea7f0a24d79002c16131a09f376a1p1

we could generically determine what the used
"partition separator string" is (which could be empty)
by subtracting from the child (partition) device
all trailing numbers (i.e. the partition number) and
also subtracting the leading parent (disk/mpath) device.

When we know what "partition separator string" is used
(the "partition separator string" could be empty)
we can append it in any case to parent (disk/mpath) devices.

rmetrich commented at 2024-10-17 09:03:

disklayout.txt

rmetrich commented at 2024-10-17 09:04:

All is based on this disklayout, then I boot a different the recovery system in migration mode.

jsmeix commented at 2024-10-17 10:33:

Ah!
So it seems the problem appears when the ReaR recovery system
with its disklayout.conf of the original system
is booted on different replacement hardware and then
it can happen that on different replacement hardware
a different "partition separator string" is used
than what was used on the original system
which is in this case here

multipath /dev/mapper/0QEMU_QEMU_HARDDISK_0000c 21474836480 msdos /dev/sda,/dev/sdb
part /dev/mapper/0QEMU_QEMU_HARDDISK_0000c 1073741824 1048576 primary boot /dev/mapper/0QEMU_QEMU_HARDDISK_0000c1
part /dev/mapper/0QEMU_QEMU_HARDDISK_0000c 20400046080 1074790400 primary lvm /dev/mapper/0QEMU_QEMU_HARDDISK_0000c2

so the "partition separator string" on the original system
was empty in this case but on the replacement hardware
a different "partition separator string" gets used
like 'p' or '-part' or '_part' or whatever.

Do I understand it correctly?

If I understand it correctly I wonder
what the reason is why on different replacement hardware
a different "partition separator string" may get used?

I.e. what conditions or settings or automatisms
determine what "partition separator string" gets used?

I understand that a "partition separator string"
needs to be inserted when the parent (disk/mpath) device
ends with a number to separate the partition number but
I do not yet understand how on different replacement hardware
the parent (disk/mpath) device naming scheme can be different
(i.e. with or without a number at its end)
compared to the original system?

jsmeix commented at 2024-10-21 07:17:

@pcahyna
is there a reason why you don't merge it, cf. my
https://github.com/rear/rear/pull/3329#issuecomment-2406723276
or should I "just merge" it soon?

@rear/contributors
unless there are objections
I would like to merge it tomorrow afternoon.

pcahyna commented at 2024-10-21 10:29:

@jsmeix

is there a reason why you don't merge it

It's just that I was on a short holiday - but after that I found some things that are unclear with the code.

rmetrich commented at 2024-10-21 11:20:

It's done on line 1050, line 1049 is a condition, it will match either OS < RHEL7 (unsupported now) and cases where device name ends with a digit.

pcahyna commented at 2024-10-21 11:54:

@rmetrich and what happens if neither condition matches? I don't see any else there.

rmetrich commented at 2024-10-21 12:11:

Then it uses the default initialization, which is on line 1002: no device name change

jsmeix commented at 2024-10-21 12:27:

A side note for the fun of it:

What a pleasure for me to see how unclear code can be
for others who did not make a specific piece of code
(even when they are experienced with a project's code)
without comprehensive and explanatory comments ;-)

So the basic reasoning in
https://github.com/rear/rear/wiki/Coding-Style#code-must-be-easy-to-understand-answer-the-why

Even if all is totally obvious for you, others ...
may understand nothing at all about your code ...

is (at least to some extent) true.

pcahyna commented at 2024-10-21 12:44:

Then it uses the default initialization, which is on line 1002: no device name change

Ah ok, thanks. Line 1085 (former 1071) is quite confusing then.

pcahyna commented at 2024-10-21 12:47:

@jsmeix

A side note for the fun of it:

What a pleasure for me to see how unclear code can be for others who did not make a specific piece of code (even when they are experienced with a project's code) without comprehensive and explanatory comments ;-)

It is not a question of comments in this case, the problem is that the structure of the code - the other branches and especially line
https://github.com/rear/rear/blob/ba835cd47f6abb97ff08814e24f6eb8fa016c1eb/usr/share/rear/lib/layout-functions.sh#L1071
indicates that all cases need to be covered (there is always an else or a default case) and the new code breaks this pattern.

rmetrich commented at 2024-10-21 12:48:

Then it uses the default initialization, which is on line 1002: no device name change

Ah ok, thanks. Line 1085 (former 1071) is quite confusing then.

It's debian code, I didn't change this code because i have no debian.

pcahyna commented at 2024-10-21 12:50:

Then it uses the default initialization, which is on line 1002: no device name change

Ah ok, thanks. Line 1085 (former 1071) is quite confusing then.

It's debian code, I didn't change this code because i have no debian.

Sorry, I meant line 1071 (former 1085) - this is not for Debian, it is the default case.

jsmeix commented at 2024-10-21 13:14:

Yes, I see,
it is not a question of comments in this case
but in general a question of the basic ideas in
https://github.com/rear/rear/wiki/Coding-Style

I think in this case it is (at least also) a question
of basically meaningless variable names like

part_name="$device_name"

where that line of code on its own tells
(at least to me) nothing what it is meant to do.
So one must reverse engineer major code parts
to build up some idea what it may actually do.

But this is not something @rmetrich introduced
so he should not be made responsible here
to clean up and overhaul old code.


[Export of Github issue for rear/rear.]