#2002 Issue closed: SLES12SP3 on Power - Getting disk layout recreation failed error and disk mapping to /dev/sda.

Labels: support / question, fixed / solved / done

ccjung opened issue at 2018-12-10 19:37:

Relax-and-Recover (ReaR) Issue Template

Fill in the following items before submitting a new issue
(quick response is not guaranteed with free support):

  • ReaR version ("/usr/sbin/rear -V"): Relax-and-Recover 2.4-git.0.6ec9075.unknown / 2018-12-05

  • OS version ("cat /etc/rear/os.conf" or "lsb_release -a" or "cat /etc/os-release"):

rear> cat /etc/rear/os.conf
OS_VENDOR=SUSE_LINUX
OS_VERSION=12.3
# The following information was added automatically by the mkbackup workflow:
ARCH='Linux-ppc64le'
OS='GNU/Linux'
OS_VERSION='12.3'
OS_VENDOR='SUSE_LINUX'
OS_VENDOR_VERSION='SUSE_LINUX/12.3'
OS_VENDOR_ARCH='SUSE_LINUX/ppc64le'
# End of what was added automatically by the mkbackup workflow.
  • ReaR configuration files ("cat /etc/rear/site.conf" and/or "cat /etc/rear/local.conf"):
rear> cat /etc/rear/local.conf
# Default is to create Relax-and-Recover rescue media as ISO image
# set OUTPUT to change that
# set BACKUP to activate an automated (backup and) restore of your data
# Possible configuration values can be found in /usr/share/rear/conf/default.conf
#
# This file (local.conf) is intended for manual configuration. For configuration
# through packages and other automated means we recommend creating a new
# file named site.conf next to this file and to leave the local.conf as it is.
# Our packages will never ship with a site.conf.
AUTOEXCLUDE_MULTIPATH=n
BOOT_OVER_SAN=y
REAR_INITRD_COMPRESSION=lzma
OUTPUT=ISO
ISO_MAX_SIZE=4000
BACKUP=NETFS
BACKUP_URL=iso:///iso_fs/REAR_BACKUP
ISO_DIR=/iso_fs/REAR_ISO
TMPDIR=/iso_fs/REAR_TEMP
OUTPUT_URL=null
BOOT_FROM_SAN=y
EXCLUDE_MOUNTPOINTS=( /iso_fs /opt/IBM/ITM /opt/commvault /opt/splunkforwarder /opt/teamquest /var/opt/BESClient /usr/sap /hana/data /hana/log /hana/shared /usr/sap/basis /usr/sap/srm /PA_backup )
EXCLUDE_COMPONENTS=( /dev/mapper/20017380033e700f3 /dev/mapper/20017380033e700f4 /dev/mapper/20017380033e700f5 /dev/mapper/20017380033e700f6 /dev/mapper/20017380033e700f7 /dev/mapper/20017380033e700f8 /dev/mapper/20017380033e700f9 /dev/mapper/20017380033e700fa /dev/mapper/20017380033e700ff /dev/mapper/20017380033e70100 /dev/mapper/20017380033e70101 /dev/mapper/20017380033e70102 )
  • Hardware (PC or PowerNV BareMetal or ARM) or virtual machine (KVM guest or PoverVM LPAR): PowerVM LPAR

  • System architecture (x86 compatible or PPC64/PPC64LE or what exact ARM device): PPC64LE

  • Firmware (BIOS or UEFI or Open Firmware) and bootloader (GRUB or ELILO or Petitboot): GRUB

  • Storage (lokal disk or SSD) and/or SAN (FC or iSCSI or FCoE) and/or multipath (DM or NVMe): SAN FC

  • Description of the issue (ideally so that others can reproduce it): Needed to boot multiple times to get the disk serial number for mapping and also get disk layout recreation script failed

Comparing disks
Ambiguous possible target disks need manual configuration (more than one with same size found)
Switching to manual disk layout configuration
Using /dev/sda (same size) for recreating /dev/mapper/20017380033e700f2
Current disk mapping table (source => target):
  /dev/mapper/20017380033e700f2 => /dev/sda


Creating partitions for disk /dev/mapper/20017380033e7012f (msdos)
UserInput -I LAYOUT_CODE_RUN needed in /usr/share/rear/layout/recreate/default/200_run_layout_code.sh line 127
The disk layout recreation script failed
  • Workaround, if any:

  • Attachments, as applicable ("rear -D mkrescue/mkbackup/recover" debug log files):

20181210_RearRestore.log.log

jsmeix commented at 2018-12-11 09:04:

The
https://github.com/rear/rear/files/2664654/20181210_RearRestore.log.log
contains (excerpts):

++ source /var/lib/rear/layout/diskrestore.sh
+++ LogPrint 'Start system layout restoration.'
...
+++ lvm vgchange -a n
  /run/lvm/lvmetad.socket: connect failed: No such file or directory
  WARNING: Failed to connect to lvmetad. Falling back to internal scanning.
  Found duplicate PV emSwP2fCyavSeeBOCALUdOqYCw41Lc01: using /dev/mapper/SIBM_2810XIV_78033E7012F-part3 not /dev/mapper/20017380033e7012f-part3
  Using duplicate PV /dev/mapper/SIBM_2810XIV_78033E7012F-part3 which is last seen, replacing /dev/mapper/20017380033e7012f-part3
  Found duplicate PV emSwP2fCyavSeeBOCALUdOqYCw41Lc01: using /dev/mapper/SIBM_2810XIV_78033E7012F-part3 not /dev/sdd3
  Using duplicate PV /dev/mapper/SIBM_2810XIV_78033E7012F-part3 from subsystem DM, ignoring /dev/sdd3
...
+++ lvm vgchange -a n system_vg
  /run/lvm/lvmetad.socket: connect failed: No such file or directory
  WARNING: Failed to connect to lvmetad. Falling back to internal scanning.
  Found duplicate PV emSwP2fCyavSeeBOCALUdOqYCw41Lc01: using /dev/mapper/SIBM_2810XIV_78033E7012F-part3 not /dev/mapper/20017380033e7012f-part3
  Using duplicate PV /dev/mapper/SIBM_2810XIV_78033E7012F-part3 which is last seen, replacing /dev/mapper/20017380033e7012f-part3
  Found duplicate PV emSwP2fCyavSeeBOCALUdOqYCw41Lc01: using /dev/mapper/SIBM_2810XIV_78033E7012F-part3 not /dev/sdd3
  Using duplicate PV /dev/mapper/SIBM_2810XIV_78033E7012F-part3 from subsystem DM, ignoring /dev/sdd3
  0 logical volume(s) in volume group "system_vg" now active
+++ lvm pvcreate -ff --yes -v --uuid emSwP2-fCya-vSee-BOCA-LUdO-qYCw-41Lc01 --norestorefile /dev/mapper/20017380033e7012f-part3
  /run/lvm/lvmetad.socket: connect failed: No such file or directory
  WARNING: Failed to connect to lvmetad. Falling back to internal scanning.
  Found duplicate PV emSwP2fCyavSeeBOCALUdOqYCw41Lc01: using /dev/mapper/SIBM_2810XIV_78033E7012F-part3 not /dev/mapper/20017380033e7012f-part3
  Using duplicate PV /dev/mapper/SIBM_2810XIV_78033E7012F-part3 which is last seen, replacing /dev/mapper/20017380033e7012f-part3
  Found duplicate PV emSwP2fCyavSeeBOCALUdOqYCw41Lc01: using /dev/mapper/SIBM_2810XIV_78033E7012F-part3 not /dev/sdd3
  Using duplicate PV /dev/mapper/SIBM_2810XIV_78033E7012F-part3 from subsystem DM, ignoring /dev/sdd3
  uuid emSwP2-fCya-vSee-BOCA-LUdO-qYCw-41Lc01 already in use on "/dev/mapper/SIBM_2810XIV_78033E7012F-part3"
++ ((  1 == 0  ))
++ true
+++ UserInput -I LAYOUT_CODE_RUN -p 'The disk layout recreation script failed' ...

so that it seems the command

lvm pvcreate -ff --yes -v --uuid emSwP2-fCya-vSee-BOCA-LUdO-qYCw-41Lc01 --norestorefile /dev/mapper/20017380033e7012f-part3

in diskrestore.sh exits with non-zero exit code because of

uuid emSwP2-fCya-vSee-BOCA-LUdO-qYCw-41Lc01 already in use on "/dev/mapper/SIBM_2810XIV_78033E7012F-part3"

and that lets diskrestore.sh abort because it is run with set -e.

@schabrolles
could you have a look what goes on here because I have no experience
with possible complications because of "duplicates" in case of multipath.

jsmeix commented at 2018-12-11 09:07:

@suseusr168
can you also attach your disklayout.conf and your diskrestore.sh that fails,
cf. "Debugging issues with Relax-and-Recover" at
https://en.opensuse.org/SDB:Disaster_Recovery

schabrolles commented at 2018-12-11 09:24:

@suseusr168

It seems you have a problem with your multipath configuration.
It seems you have one 112G device but multipath output shows 2

create: 20017380033e7012f undef IBM,2810XIV
size=112G features='0' hwhandler='0' wp=undef
`-+- policy='service-time 0' prio=50 status=undef
  |- 2:0:0:1 sda 8:0  undef ready running
  `- 2:0:1:1 sdb 8:16 undef ready running
create: SIBM_2810XIV_78033E7012F undef IBM,2810XIV
size=112G features='0' hwhandler='0' wp=undef
`-+- policy='service-time 0' prio=50 status=undef
  `- 3:0:0:1 sdc 8:32 undef ready running

I think 20017380033e7012f detected via HBA 2 points to the same than SIBM_2810XIV_78033E7012F detected by HBA 3 (same 78033E7012F ID).

Could you check your SAN zoning configuration?
send me the content of your multipath.conf

ccjung commented at 2018-12-11 12:36:

I use the find command starting from / and cannot find the multipath.conf on the target system (eniesdbs107). I don't see much files in /mnt/local.

Where is the multipath.conf file located on the target system?

This is the multipath.conf file on the source system (which was the cloned from system):

eniesdbs101:/ # cat /usr/lib/modules-load.d/multipath.conf
# Load device-handler and multipath module at boot
scsi_dh_alua
scsi_dh_emc
scsi_dh_rdac
dm_multipath

Everytime I boot up the LPAR, I get different destination disk names such as

/dev/sda
/dev/SIBM_2810XIV_78033E7012F
/dev/mapper/20017380033e7012f

I manually edited the /var/lib/rear/layout/disk_mappings with the disk serial number
/dev/mapper/20017380033e7012f and it fails on the disk layout recreation script.

Attached is the diskrestore.sh and disklayout.conf. I added .log extension in order to attach the files to the issue ticket.

diskrestore.sh.log
disklayout.conf.log

schabrolles commented at 2018-12-11 13:52:

@suseusr168, As said previously, I think you have a problem with your SAN configuration.

Do you try to restore the backup on the same system with the same Disks or a different one ?

When device SIBM_XXXX is detected, this usually tells there is a problem.
You should have this:

multipath -ll
create: 20017380033e7012f undef IBM,2810XIV
size=112G features='0' hwhandler='0' wp=undef
`-+- policy='service-time 0' prio=50 status=undef
  |- 2:0:0:1 sda 8:0  undef ready running
  |- 2:0:1:1 sdb 8:16 undef ready running
  `- 3:0:0:1 sdc 8:32 undef ready running

I suspect a SAN zoning configuration problem or something like that:

  • Are the 2 FC port connected to the same SAN zone ?
  • How many XIV controller module do you have ?

jsmeix commented at 2018-12-11 14:24:

@schabrolles
only out of curiosity I have a question regarding your above
https://github.com/rear/rear/issues/2002#issuecomment-446130883
where you wrote "same 78033E7012F ID":
I see ...7380033e7012f and ...78033E7012F where the trailing 033e7012f matches.
Is that trailing 033e7012f what you mean with "same 78033E7012F ID"?
Sorry if it is a stupid question - I am a total multipath noob.

ccjung commented at 2018-12-11 14:38:

Yes, it is the same trailing 033e7012f. I must have copied the wrong string with uppercase.

This same backup was restored to 4 out 5 differentLPARs with no problems. We haven't tried restoring to itself.

Lucky me, I got the one that doesn't work.

I'll ask the SAN team to check the zone configuration. I'll update the ticket with what I find.

Thanks everyone for your help.

schabrolles commented at 2018-12-11 15:21:

@jsmeix, you are right, there is the same END (which represent the LUNID part on the storage).
What I don't understand is the slight difference we can observe before (7380033e7012f vs 78033E7012F).
It is like the disk presented has a different name depending on the FC card or the Storage controller we use to get access to our remote storage (which is not normal).

@suseusr168, could you also send us the output of
multipath -ll
udevadm info -n sda
udevadm info -n sdb
udevadm info -n sdc

ccjung commented at 2018-12-11 16:04:

Here are the output requested:

RESCUE eniesdbs101:~ # multipath -ll
SIBM_2810XIV_78033E7012F dm-0 IBM,2810XIV
size=112G features='2 queue_if_no_path retain_attached_hw_handler' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  `- 3:0:1:1 sdd 8:48 active ready running

RESCUE eniesdbs101:~ # multipath -ll
SIBM_2810XIV_78033E7012F dm-0 IBM,2810XIV
size=112G features='2 queue_if_no_path retain_attached_hw_handler' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  `- 3:0:1:1 sdd 8:48 active ready running

RESCUE eniesdbs101:~ # udevadm info -n sda
P: /devices/vio/3000000c/host2/rport-2:0-0/target2:0:0/2:0:0:1/block/sda
N: sda
S: disk/by-id/scsi-20017380033e7012f
S: disk/by-path/fc-0x5001738033e70140-lun-1
E: COMPAT_SYMLINK_GENERATION=1
E: DEVLINKS=/dev/disk/by-path/fc-0x5001738033e70140-lun-1 /dev/disk/by-id/scsi-20017380033e7012f
E: DEVNAME=/dev/sda
E: DEVPATH=/devices/vio/3000000c/host2/rport-2:0-0/target2:0:0/2:0:0:1/block/sda
E: DEVTYPE=disk
E: ID_BUS=scsi
E: ID_MODEL=2810XIV
E: ID_MODEL_ENC=2810XIV\x20\x20\x20\x20\x20\x20\x20\x20\x20
E: ID_PART_TABLE_TYPE=dos
E: ID_PART_TABLE_UUID=0005141b
E: ID_PATH=fc-0x5001738033e70140-lun-1
E: ID_PATH_TAG=fc-0x5001738033e70140-lun-1
E: ID_REVISION=0000
E: ID_SCSI=1
E: ID_SCSI_SERIAL=78033E7012F
E: ID_SERIAL=20017380033e7012f
E: ID_SERIAL_SHORT=0017380033e7012f
E: ID_TARGET_PORT=0
E: ID_TYPE=disk
E: ID_VENDOR=IBM
E: ID_VENDOR_ENC=IBM\x20\x20\x20\x20\x20
E: MAJOR=8
E: MINOR=0
E: MPATH_SBIN_PATH=/sbin
E: SUBSYSTEM=block
E: TAGS=:systemd:
E: USEC_INITIALIZED=14505103

RESCUE eniesdbs101:~ # udevadm info -n sdb
P: /devices/vio/3000000c/host2/rport-2:0-1/target2:0:1/2:0:1:1/block/sdb
N: sdb
S: disk/by-id/scsi-20017380033e7012f
S: disk/by-path/fc-0x5001738033e70171-lun-1
E: COMPAT_SYMLINK_GENERATION=1
E: DEVLINKS=/dev/disk/by-id/scsi-20017380033e7012f /dev/disk/by-path/fc-0x5001738033e70171-lun-1
E: DEVNAME=/dev/sdb
E: DEVPATH=/devices/vio/3000000c/host2/rport-2:0-1/target2:0:1/2:0:1:1/block/sdb
E: DEVTYPE=disk
E: ID_BUS=scsi
E: ID_MODEL=2810XIV
E: ID_MODEL_ENC=2810XIV\x20\x20\x20\x20\x20\x20\x20\x20\x20
E: ID_PART_TABLE_TYPE=dos
E: ID_PART_TABLE_UUID=0005141b
E: ID_PATH=fc-0x5001738033e70171-lun-1
E: ID_PATH_TAG=fc-0x5001738033e70171-lun-1
E: ID_REVISION=0000
E: ID_SCSI=1
E: ID_SCSI_SERIAL=78033E7012F
E: ID_SERIAL=20017380033e7012f
E: ID_SERIAL_SHORT=0017380033e7012f
E: ID_TARGET_PORT=0
E: ID_TYPE=disk
E: ID_VENDOR=IBM
E: ID_VENDOR_ENC=IBM\x20\x20\x20\x20\x20
E: MAJOR=8
E: MINOR=16
E: MPATH_SBIN_PATH=/sbin
E: SUBSYSTEM=block
E: TAGS=:systemd:
E: USEC_INITIALIZED=14503334

RESCUE eniesdbs101:~ # udevadm info -n sdc
P: /devices/vio/3000000d/host3/rport-3:0-0/target3:0:0/3:0:0:1/block/sdc
N: sdc
S: disk/by-path/fc-0x5001738033e70182-lun-1
E: COMPAT_SYMLINK_GENERATION=1
E: DEVLINKS=/dev/disk/by-path/fc-0x5001738033e70182-lun-1
E: DEVNAME=/dev/sdc
E: DEVPATH=/devices/vio/3000000d/host3/rport-3:0-0/target3:0:0/3:0:0:1/block/sdc
E: DEVTYPE=disk
E: ID_PART_TABLE_TYPE=dos
E: ID_PART_TABLE_UUID=0005141b
E: ID_PATH=fc-0x5001738033e70182-lun-1
E: ID_PATH_TAG=fc-0x5001738033e70182-lun-1
E: MAJOR=8
E: MINOR=32
E: MPATH_SBIN_PATH=/sbin
E: SUBSYSTEM=block
E: TAGS=:systemd:
E: USEC_INITIALIZED=14511717

ccjung commented at 2018-12-11 16:05:

What should I ask the SAN team to check from their side?

ccjung commented at 2018-12-11 20:06:

We found the problem. Thank you for directing us in the right direction to look at the SAN.

Here is the info from the SAN admin:
Looking at eniesdbs107.... for the SAN zoning, everything seemed proper.... at first glance.

The FC's (C12/C13) show the lone XIV disk (100GB rootgv) as a single FC device being delivered thru 4 distinct disk paths.

However - I see the same XIV I/O port - 5001738033e70140 - being used as one of the 2 zoned ports for both:

Fcs0. slot C12 - fab A -OSHost - WWPN = c0507609f1840030

Fcs3. slot C15 - fab A - Host - WWPN = c0507609f1840036

They were specifically requested to be unique.. and this appears to be causing us an issue with our restore - we need this rectified, the sooner the better.

It appears all the aliases were using the same WWPN so the SAN team will fix the zoning.

schabrolles commented at 2018-12-12 09:26:

@suseusr168, thanks for the feedback.
Could you please tell us when everything is OK to close this issue.

ccjung commented at 2018-12-12 13:33:

@schabrolles ,
I was able to restore the LPAR now after the SAN team fixed the zoning.

One question, we noticed sometimes the target disk mapping is /dev/sda when we first boot up the LPAR for restore. If we reboot the LPAR again to start a restore, we see the /dev/mapper/.

Is there a way to fix this or we just need to reboot till we see the disk mapping to /dev/mapper/ (which we have been doing to work around this /dev/sda mapping)?

You can close out this issue.

Thanks again for your help.

schabrolles commented at 2018-12-12 15:49:

@suseusr168,

If it proposes /dev/sda instead of /dev/mapper (which should not happen), you can interrupt rear restore and run multipath -r to reload multipath.

jsmeix commented at 2018-12-12 16:18:

@schabrolles
thank you for solving this issue!

jsmeix commented at 2018-12-13 08:43:

Again this issue is one where using ReaR revealed
that there was something wrong with the original system
which shows an interesting "by the way feature" of ReaR:

Simply put:
Run "rear mkbackup" on your original system
and "rear recover" on your replacement hardware
to verify that your original system is basically o.k.
(and redo that after each change of your basic system).

Cf.
https://github.com/rear/rear/issues/1907#issuecomment-434218293
that actually belongs to https://github.com/rear/rear/issues/1927

See also
https://github.com/rear/rear/issues/1998#issuecomment-445188468

For another issue where using ReaR revealed two problems
on the original system see for the first problem
(the "Storix" software had installed a bad udev rule)
https://github.com/rear/rear/issues/1796#issuecomment-386996844
https://github.com/rear/rear/issues/1796#issuecomment-387461461
https://github.com/rear/rear/issues/1796#issuecomment-387695097
and for the second problem
(non-working btrfs structure for SUSE snapper usage)
https://github.com/rear/rear/issues/1796#issuecomment-388756113
https://github.com/rear/rear/issues/1796#issuecomment-391288035

I think in the end examples like the above ones prove that in practice
"Deployment via the recovery installer is a must" as described in
https://en.opensuse.org/SDB:Disaster_Recovery

I think I will document that "by the way feature" of ReaR more explicitly in
https://en.opensuse.org/SDB:Disaster_Recovery


[Export of Github issue for rear/rear.]