#2677 Issue closed: Recovery system has all needed kernel modules but something hinders loading them: tg3 and smartpqi on HPE ProLiant DL360 Gen10

Labels: bug, fixed / solved / done

hpannenb opened issue at 2021-09-14 07:12:

  • ReaR version ("/usr/sbin/rear -V"):
    Relax-and-Recover 2.6 / Git from the current git master branch.
  • OS version ("cat /etc/os-release" or "lsb_release -a" or "cat /etc/rear/os.conf"):
    OS_VENDOR=RedHatEnterpriseServer
    OS_VERSION=7.9
  • ReaR configuration files ("cat /etc/rear/site.conf" and/or "cat /etc/rear/local.conf"):
####################
##
# BACKUP=NSR (EMC Networker Client only; Legato)
#
# Note: NSR_CLIENT_MODE used
#
##
####################
export TMPDIR="/var/tmp"
export USER_INPUT_TIMEOUT=15

BACKUP=NSR
NSR_CLIENT_MODE=y
NSR_CLIENT_REQUESTRESTORE=n

OUTPUT=ISO
ISO_PREFIX="rear-git-nsr-$HOSTNAME"

OUTPUT_URL=nfs://ts2esesv802/var/backup/

# Static IP (no DHCP!)
USE_DHCLIENT=
USE_STATIC_NETWORKING=yes
USE_RESOLV_CONF=no

# Include only currently loaded modules
MODULES=( 'loaded_modules' )

# Set ROOT Password (optional)
SSH_ROOT_PASSWORD='root'

# Clone all users
CLONE_ALL_USERS=yes

# Execute recovery workaround after restore has been completed
# to set the STICKY bit for /tmp
#
POST_RECOVERY_SCRIPT="chmod +t $TARGET_FS_ROOT/tmp/"
  • Hardware (PC or PowerNV BareMetal or ARM) or virtual machine (KVM guest or PoverVM LPAR):
    BareMetal / HPE ProLiant DL360 Gen10
  • System architecture (x86 compatible or PPC64/PPC64LE or what exact ARM device):
    x86 compatible
  • Firmware (BIOS or UEFI or Open Firmware) and bootloader (GRUB or ELILO or Petitboot):
    BIOS / GRUB
  • Storage (local disk or SSD) and/or SAN (FC or iSCSI or FCoE) and/or multipath (DM or NVMe):
    Local disks
  • Storage layout ("lsblk -ipo NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,SIZE,MOUNTPOINT" or "lsblk" as makeshift):
NAME                                   KNAME     PKNAME    TRAN TYPE FSTYPE        SIZE MOUNTPOINT
/dev/sda                               /dev/sda            sas  disk             279.4G 
|-/dev/sda1                            /dev/sda1 /dev/sda       part ext3          500M /boot
|-/dev/sda2                            /dev/sda2 /dev/sda       part swap         43.8G [SWAP]
`-/dev/sda3                            /dev/sda3 /dev/sda       part LVM2_member 235.1G 
  |-/dev/mapper/tproot-lv_root         /dev/dm-0 /dev/sda3      lvm  ext3         10.3G /
  |-/dev/mapper/tproot-lv_var          /dev/dm-1 /dev/sda3      lvm  ext3         33.7G /var
  `-/dev/mapper/tproot-lv_var_textpass /dev/dm-2 /dev/sda3      lvm  ext3        101.2G /var/TextPass
/dev/sdb                               /dev/sdb            sas  disk             279.4G 
`-/dev/sdb1                            /dev/sdb1 /dev/sdb       part ext3        279.4G /data
/dev/sdc                               /dev/sdc            sas  disk             279.4G 
`-/dev/sdc1                            /dev/sdc1 /dev/sdc       part ext3        279.4G /dbamsstore
/dev/sdd                               /dev/sdd            sas  disk             279.4G 
`-/dev/sdd1                            /dev/sdd1 /dev/sdd       part ext3        279.4G /dbamslog
  • Description of the issue (ideally so that others can reproduce it):
    In case of the config in use booting the server into the recovery environment with the created ISO (via iLO) does neither enable the tg3 interfaces nor load the smartpqi module so disk drives cannot be accessed. The same config with ReaR 2.6 (release) does work perfectly fine.
  • Workaround, if any:
    Comment out the config line MODULES=( 'loaded_modules' ) so ALL modules will be loaded by default.
  • Attachments, as applicable ("rear -D mkrescue/mkbackup/recover" debug log files):
    rear-ts9esesv801.log.gz

The tg3 or smartpqi are both included in the "rootfs". rear.NvDzgi9heEURFEY is of ReaR 2.6 / Git and rear.Ox3B9qbBmXQipSL of ReaR 2.6 / 2020-06-17:

[root@ts9esesv801 tmp]# find ./rear.* -type f | egrep "smart|tg3"
./rear.NvDzgi9heEURFEY/rootfs/etc/dracut.conf.d/smartpqi.conf
./rear.NvDzgi9heEURFEY/rootfs/opt/smartstorageadmin/ssacli/bin/rmstr
./rear.NvDzgi9heEURFEY/rootfs/opt/smartstorageadmin/ssacli/bin/mklocks.sh
./rear.NvDzgi9heEURFEY/rootfs/opt/smartstorageadmin/ssacli/bin/ssacli
./rear.NvDzgi9heEURFEY/rootfs/opt/smartstorageadmin/ssacli/bin/ssacli.license
./rear.NvDzgi9heEURFEY/rootfs/opt/smartstorageadmin/ssacli/bin/ssascripting
./rear.NvDzgi9heEURFEY/rootfs/opt/smartstorageadmin/ssacli/bin/ssacli-5.10-44.0.x86_64.txt
./rear.NvDzgi9heEURFEY/rootfs/usr/lib/modules/3.10.0-1160.el7.x86_64/extra/smartpqi/smartpqi.ko
./rear.NvDzgi9heEURFEY/rootfs/usr/lib/modules/3.10.0-1160.el7.x86_64/extra/tg3/tg3.ko
./rear.NvDzgi9heEURFEY/rootfs/usr/lib/firmware/tigon/tg3_tso5.bin
./rear.NvDzgi9heEURFEY/rootfs/usr/lib/firmware/tigon/tg3_tso.bin
./rear.NvDzgi9heEURFEY/rootfs/usr/lib/firmware/tigon/tg357766.bin
./rear.NvDzgi9heEURFEY/rootfs/usr/lib/firmware/tigon/tg3.bin
./rear.Ox3B9qbBmXQipSL/rootfs/etc/dracut.conf.d/smartpqi.conf
./rear.Ox3B9qbBmXQipSL/rootfs/opt/smartstorageadmin/ssacli/bin/rmstr
./rear.Ox3B9qbBmXQipSL/rootfs/opt/smartstorageadmin/ssacli/bin/mklocks.sh
./rear.Ox3B9qbBmXQipSL/rootfs/opt/smartstorageadmin/ssacli/bin/ssacli
./rear.Ox3B9qbBmXQipSL/rootfs/opt/smartstorageadmin/ssacli/bin/ssacli.license
./rear.Ox3B9qbBmXQipSL/rootfs/opt/smartstorageadmin/ssacli/bin/ssascripting
./rear.Ox3B9qbBmXQipSL/rootfs/opt/smartstorageadmin/ssacli/bin/ssacli-5.10-44.0.x86_64.txt
./rear.Ox3B9qbBmXQipSL/rootfs/usr/lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/smartpqi/smartpqi.ko
./rear.Ox3B9qbBmXQipSL/rootfs/usr/lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/tg3/tg3.ko
./rear.Ox3B9qbBmXQipSL/rootfs/usr/lib/firmware/tigon/tg3_tso5.bin
./rear.Ox3B9qbBmXQipSL/rootfs/usr/lib/firmware/tigon/tg3_tso.bin
./rear.Ox3B9qbBmXQipSL/rootfs/usr/lib/firmware/tigon/tg357766.bin
./rear.Ox3B9qbBmXQipSL/rootfs/usr/lib/firmware/tigon/tg3.bin

jsmeix commented at 2021-09-14 07:49:

@hpannenb
I didn't check the details but from plain looking at your description
I wonder if there is perhaps a confusion between what
MODULES means versus what MODULES_LOAD means?

FYI: Via
https://github.com/rear/rear/commit/4480cc00ad0b9a46bbde6f56ca392051b134f949
I merged in particular that kernel modules that should be loaded
during recovery system startup (i.e. those in MODULES_LOAD)
get always copied into the recovery system, see the code in
https://github.com/rear/rear/blob/master/usr/share/rear/build/GNU/Linux/400_copy_modules.sh
so for debugging you should check your "rear -D mkrescue" log
what 400_copy_modules.sh actually does in your particular case
i.e. if 400_copy_modules.sh copies all what lsmod lists on your system.
The crucial part is the modinfo_filename() function which should return
the actual module file that needs to be copied for a module name.
I know that in some obscure cases the modinfo_filename() function
fails to determine the module file for a module name, see its code,
there are rather awkward things that are done to get a module file name.
I already had a lot of "not so much fun" with that.

jsmeix commented at 2021-09-14 08:06:

@hpannenb
does the same config work perfectly fine with ReaR 2.6 (release)
on exactly the same system?
I ask because if you run ReaR 2.6 (release) on a system with
a different kernel there may be differences in module dependencies
which may require more than what MODULES=( 'loaded_modules' ) copies,
cf. the description in default.conf why we have now MODULES=( 'all_modules' )
by default, see https://github.com/rear/rear/issues/1202
how missing kernel modules can result any kind of awkward unexpected failures
and see https://github.com/rear/rear/issues/1355
why in practice it is impossible to automatically determine
all also needed other modules for a particular module.

hpannenb commented at 2021-09-14 08:08:

@hpannenb
does the same config work perfectly fine with ReaR 2.6 (release)
on exactly the same system?

@jsmeix Yes. It is the same system where I am currently testing ReaR 2.6 / Git versus ReaR 2.6 (release).

As mentioned the "currently in use" modules tg3 and smartpqi are both copied into the "rootfs" but booting the server does not use/load them which does prevent activating the NICs as well as accessing the harddisk (via the Smart Controller).

jsmeix commented at 2021-09-14 08:33:

This is why I mentioned MODULES_LOAD.
So does MODULES=( 'loaded_modules' )
plus MODULES_LOAD=( tg3 smartpqi )
make it work?

Currently I have no good idea how to debug what happens
during the very initial ReaR recovery system startup phase
when the kernel/udev automagic module loading happens
i.e. before any ReaR recovery system startup scripts
usr/share/rear/skel/default/etc/scripts/system-setup*
are run. In particular
etc/scripts/system-setup.d/40-start-udev-or-load-modules.shReaR 2.6 (release)
loads the modules in etc/modules which contains
those listed in MODULES_LOAD, cf. near the end of
usr/share/rear/build/GNU/Linux/400_copy_modules.sh

@hpannenb
could you compare what etc/modules contains in your case
with ReaR 2.6 (release) versus with our GitHub master code?

hpannenb commented at 2021-09-14 09:15:

[...]

@hpannenb
could you compare what etc/modules contains in your case
with ReaR 2.6 (release) versus with our GitHub master code?

First block is ReaR 2.6 / Git and second block shows the 2.6 / 2020-06-17 (release):

[root@ts9esesv801 tmp]# cat ./rear.NvDzgi9heEURFEY/rootfs/etc/modules
smartpqi
[root@ts9esesv801 tmp]# cat ./rear.Ox3B9qbBmXQipSL/rootfs/etc/modules
smartpqi

I am currently checking the ReaR git version with MODULES_LOAD parameter set. It does not "work around" the situation: During boot I see FATAL errors the modules cannot be found.

jsmeix commented at 2021-09-14 10:54:

I.e. with current GitHub master code

MODULES=( 'loaded_modules' )
MODULES_LOAD=( tg3 smartpqi )

the specified module files are present in the ReaR recovery system
but during boot there are FATAL errors that the modules cannot be found.

In contrast
with GitHub master code and its plain default

MODULES=( 'all_modules' )

or with the ReaR 2.6 release and only

MODULES=( 'loaded_modules' )

things "just work".

Currently I am clueless.

I think someone from Red Hat should have a look here
because I don't use RHEL so I cannot reproduce things.

@pcahyna @rmetrich
could you (as time permits) please have a look here?

hpannenb commented at 2021-09-14 11:52:

I tested with the parameter MODULES=( 'tg3' 'smartpqi' ) which is also another workaround (for my current setup) since NIC and SmartController are accessible in the rescue environment.

jsmeix commented at 2021-09-14 12:40:

For me this issue gets more and more inexplicable.
When MODULES=( 'loaded_modules' ) is insufficient
but MODULES=( 'tg3' 'smartpqi' ) is sufficient
it means - as far as I can imagine things here - there are some
special kernel modules that are needed to load tg3 and smartpqi
but those special kernel modules are not shown by lsmod
afterwards in the running system - or something like that ???

hpannenb commented at 2021-09-14 13:14:

So far I cannot determine any other "magic" behind the scenes; but I am certainly lacking knowledge in this area.

[root@ts9esesv801 rear.git.master]# modprobe --ignore-install --set-version 3.10.0-1160.2.2.el7.x86_64 --show-depends smartpqi
insmod /lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/scsi/scsi_transport_sas.ko.xz 
insmod /lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/smartpqi/smartpqi.ko

[root@ts9esesv801 rear.git.master]# modprobe --ignore-install --set-version 3.10.0-1160.2.2.el7.x86_64 --show-depends tg3
insmod /lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/pps/pps_core.ko.xz 
insmod /lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/ptp/ptp.ko.xz 
insmod /lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/tg3/tg3.ko

[root@ts9esesv801 rear.git.master]# cat /etc/depmod.d/smartpqi.conf
override smartpqi 3.10.0-* weak-updates/smartpqi

hpannenb commented at 2021-09-14 14:20:

I performed a last test with my config from above. Booting into the rescue environment and manually executing the following makes the drives and the NICs available:

For smartpqi:

insmod /lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/scsi/scsi_transport_sas.ko.xz 
insmod /usr/lib/modules/3.10.0-1160.el7.x86_64/extra/smartpqi/smartpqi.ko

For tg3:

insmod /lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/pps/pps_core.ko.xz 
insmod /lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/ptp/ptp.ko.xz 
insmod /usr/lib/modules/3.10.0-1160.el7.x86_64/extra/tg3/tg3.ko

So overall the rescue system includes all required data but something hinders the proper load of the required (kernel) modules.

hpannenb commented at 2021-09-15 08:10:

I tried to bisect the ReaR changes a little - most likely related to the mentioned part
https://github.com/rear/rear/blob/master/usr/share/rear/build/GNU/Linux/400_copy_modules.sh
only - and discovered the issue seems to be introduced with commit d2588e8c05aebfcdb3ac5cf1be7732aac2d78eb2 .

jsmeix commented at 2021-09-15 08:11:

@hpannenb
thank you for your testing.
Now the actual issue is more clear.

jsmeix commented at 2021-09-15 08:15:

I still have no idea what that "something" could be
i.e. what the root cause could be.

build/GNU/Linux/400_copy_modules.sh
is about copying kernel modules into the recovery system
so if build/GNU/Linux/400_copy_modules.sh would be the root cause
it would mean kernel modules are missing in the recovery system
but that is not the case so build/GNU/Linux/400_copy_modules.sh
cannot be the root cause - as far as I can understand things currently.

hpannenb commented at 2021-09-15 08:43:

@jsmeix Additional info: The basis for my "bisecting" was the current ReaR from Git Master. The only thing I changed from that basis was the script 400_copy_modules.sh starting from August 2020 onwards with my initial config:

  • Applying the changes of 5c11e507365f6719f3f579abae119b2d7d02740c the rescue environment could load the modules.
  • Applying the changes of d2588e8c05aebfcdb3ac5cf1be7732aac2d78eb2 broke the proper loading.

jsmeix commented at 2021-09-15 08:52:

@hpannenb
could you provide two "rear -D mkrescue" debug log files (on the same system)
one with the changes of 5c11e50 and
the other one with the changes of d2588e8
and your /etc/rear/local.conf file that you used for both.

Thanks in advance!

hpannenb commented at 2021-09-15 09:43:

@jsmeix Attached both the log files.
rear-d2588e8-ts9esesv801.log.gz
rear-5c11e50-ts9esesv801.log.gz

In all cases I used the /etc/rear/site.conf from above.

hpannenb commented at 2021-09-15 10:03:

To have a faster testing opportunity I executed the following command for all different build areas:

Original ReaR 2.6

depmod -n -b /var/tmp/rear.26/rootfs/ -v 3.10.0-1160.2.2.el7.x86_64 | egrep "tg3|smartpqi"
/var/tmp/rear.26/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/smartpqi/smartpqi.ko needs "sas_port_alloc_num": /var/tmp/rear.26/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/scsi/scsi_transport_sas.ko.xz
/var/tmp/rear.26/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/tg3/tg3.ko needs "ptp_clock_index": /var/tmp/rear.26/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/ptp/ptp.ko.xz
...

ReaR Git Master with commit 5c11e50

depmod -n -b /var/tmp/rear.yPIcv85F6TI0yiP/rootfs/ -v 3.10.0-1160.2.2.el7.x86_64 | egrep "tg3|smartpqi"
/var/tmp/rear.yPIcv85F6TI0yiP/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/smartpqi/smartpqi.ko needs "sas_port_alloc_num": /var/tmp/rear.yPIcv85F6TI0yiP/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/scsi/scsi_transport_sas.ko.xz
/var/tmp/rear.yPIcv85F6TI0yiP/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/tg3/tg3.ko needs "ptp_clock_index": /var/tmp/rear.yPIcv85F6TI0yiP/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/ptp/ptp.ko.xz
...

ReaR Git Master with commit d2588e8
depmod -n -b /var/tmp/rear.OqgFTxNJRXbrtFI/rootfs/ -v 3.10.0-1160.2.2.el7.x86_64 | egrep "tg3|smartpqi"

ReaR Git Master with commit d2588e8 (commented out MODULES in /etc/rear/site.conf)
...

depmod -n -b /var/tmp/rear.PJo5MBBiGj9PDB6/rootfs/ -v 3.10.0-1160.2.2.el7.x86_64 | egrep "tg3|smartpqi"
/var/tmp/rear.PJo5MBBiGj9PDB6/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/smartpqi/smartpqi.ko needs "sas_port_alloc_num": /var/tmp/rear.PJo5MBBiGj9PDB6/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/scsi/scsi_transport_sas.ko.xz
/var/tmp/rear.PJo5MBBiGj9PDB6/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/tg3/tg3.ko needs "ptp_clock_index": /var/tmp/rear.PJo5MBBiGj9PDB6/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/ptp/ptp.ko.xz
...

jsmeix commented at 2021-09-15 10:15:

@hpannenb
what exactly is your the /etc/rear/site.conf from above?
You posted serveral things you had tried.
I need to know 100% exactly what config you used.

hpannenb commented at 2021-09-15 10:33:

@jsmeix The printout from the section "ReaR configuration files" when opening this issue.

jsmeix commented at 2021-09-15 12:37:

https://github.com/rear/rear/files/7168911/rear-5c11e50-ts9esesv801.log.gz
contains (excerpts):

++ loaded_modules='...
...
tg3
smartpqi
...'

...

++ loaded_modules_files=' ...
...
/lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/tg3/tg3.ko
/lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/smartpqi/smartpqi.ko
...'

where two times 85 module files are listed.

In contrast
https://github.com/rear/rear/files/7168910/rear-d2588e8-ts9esesv801.log.gz
contains (excerpt):

++ loaded_modules='...
...
tg3
smartpqi
...'

...

++ loaded_modules_files=' ...
...
/usr/lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/net/sunrpc/sunrpc.ko.xz
/usr/lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/virt/lib/irqbypass.ko.xz
/usr/lib/modules/3.10.0-1160.el7.x86_64/extra/smartpqi/smartpqi.ko
/usr/lib/modules/3.10.0-1160.el7.x86_64/extra/tg3/tg3.ko'

where two times only 80 module files are listed
i.e. 5 module files less than before and there are
different module files for smartpqi and tg3.

My blind guess is that the different module files
for smartpqi and tg3 is the actual root cause.

On my openSUSE Leap 15.2 homeoffice laptop I have

/lib/modules/5.3.18-lp152.87-default/kernel/drivers/net/ethernet/broadcom/tg3.ko
/lib/modules/5.3.18-lp152.87-default/kernel/drivers/scsi/smartpqi/smartpqi.ko

both as real files (no links).

I guess that the different module files come from
readlink -e $module_filename in the modinfo_filename function that is cuttently at
https://github.com/rear/rear/blob/master/usr/share/rear/build/GNU/Linux/400_copy_modules.sh#L64

jsmeix commented at 2021-09-15 12:47:

@hpannenb
please check what kind of files

/lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/tg3/tg3.ko
/lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/smartpqi/smartpqi.ko

/usr/lib/modules/3.10.0-1160.el7.x86_64/extra/smartpqi/smartpqi.ko
/usr/lib/modules/3.10.0-1160.el7.x86_64/extra/tg3/tg3.ko

are on your system
and if - just a blind guess - perhaps replacing
readlink -e $module_filename in the modinfo_filename function
by something like plain echo $module_filename makes it work for you?

hpannenb commented at 2021-09-15 12:49:

@jsmeix This is the current situation:

find /lib/modules/ -name "*smartpqi*" -type l | sort
/lib/modules/3.10.0-1127.el7.x86_64/weak-updates/smartpqi/smartpqi.ko
/lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/smartpqi/smartpqi.ko
/lib/modules/3.10.0-957.12.1.el7.x86_64/weak-updates/smartpqi/smartpqi.ko

find /lib/modules/ -name "*smartpqi*" -type f | sort
/lib/modules/3.10.0-1127.el7.x86_64/kernel/drivers/scsi/smartpqi/smartpqi.ko.xz
/lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/scsi/smartpqi/smartpqi.ko.xz
/lib/modules/3.10.0-1160.el7.x86_64/extra/smartpqi/smartpqi.ko
/lib/modules/3.10.0-957.12.1.el7.x86_64/kernel/drivers/scsi/smartpqi/smartpqi.ko.xz

readlink -e /lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/smartpqi/smartpqi.ko
/usr/lib/modules/3.10.0-1160.el7.x86_64/extra/smartpqi/smartpqi.ko

and

find /lib/modules/ -name "*smartpqi*" -type l | sort
/lib/modules/3.10.0-1127.el7.x86_64/weak-updates/smartpqi/smartpqi.ko
/lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/smartpqi/smartpqi.ko
/lib/modules/3.10.0-957.12.1.el7.x86_64/weak-updates/smartpqi/smartpqi.ko

find /lib/modules/ -name "*smartpqi*" -type f | sort
/lib/modules/3.10.0-1127.el7.x86_64/kernel/drivers/scsi/smartpqi/smartpqi.ko.xz
/lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/scsi/smartpqi/smartpqi.ko.xz
/lib/modules/3.10.0-1160.el7.x86_64/extra/smartpqi/smartpqi.ko
/lib/modules/3.10.0-957.12.1.el7.x86_64/kernel/drivers/scsi/smartpqi/smartpqi.ko.xz

readlink -e /lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/smartpqi/smartpqi.ko
/usr/lib/modules/3.10.0-1160.el7.x86_64/extra/smartpqi/smartpqi.ko

jsmeix commented at 2021-09-15 13:13:

I really hate symbolic links and what comes out when using them :-(

Here it seems things work when cp -L is used to deal with symbolic links
but things fail when readlink -e is used to deal with symbolic links.

Furthermore in the MODULES=( 'all_modules' ) case
we don't call cp -L but cp -a which preserves links
so I wonder why it worked with MODULES=( 'all_modules' )?

Additionally it looks as if there is some kernel module version inconsistency
because 3.10.0-1160.2.2.el7.x86_64 is not same as 3.10.0-1160.el7.x86_64
which might be the reason why

depmod -b /var/tmp/rear.OqgFTxNJRXbrtFI/rootfs -v 3.10.0-1160.2.2.el7.x86_64

could not generate sufficient module dependencies because
tg3 and smartpqi are under /usr/lib/modules/3.10.0-1160.el7.x86_64/
and not under /usr/lib/modules/3.10.0-1160.2.2.el7.x86_64/
where all other kernel modules are.

hpannenb commented at 2021-09-15 13:23:

@jmeix I do not know the details but there is no known inconsistency. The links etc. might have been a side effect of the weak-updates (???) performed with upgrading the kernel from RHEL 7.x to 7.9.

Using an echoinstead of a readlink -e in https://github.com/rear/rear/blob/master/usr/share/rear/build/GNU/Linux/400_copy_modules.sh#L64
shows the following in the build area which looks good to me:

depmod -n -b /var/tmp/rear.OP8tLd6LP49WAL8/rootfs/ -v 3.10.0-1160.2.2.el7.x86_64 | egrep "tg3|smartpqi" | head -10
/var/tmp/rear.OP8tLd6LP49WAL8/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/smartpqi/smartpqi.ko needs "sas_port_alloc_num": /var/tmp/rear.OP8tLd6LP49WAL8/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/scsi/scsi_transport_sas.ko.xz
/var/tmp/rear.OP8tLd6LP49WAL8/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/weak-updates/tg3/tg3.ko needs "ptp_clock_index": /var/tmp/rear.OP8tLd6LP49WAL8/rootfs//lib/modules/3.10.0-1160.2.2.el7.x86_64/kernel/drivers/ptp/ptp.ko.xz
weak-updates/smartpqi/smartpqi.ko: kernel/drivers/scsi/scsi_transport_sas.ko.xz
weak-updates/tg3/tg3.ko: kernel/drivers/ptp/ptp.ko.xz kernel/drivers/pps/pps_core.ko.xz
alias pci:v00009005d0000028Fsv*sd*bc*sc*i* smartpqi
alias pci:v00009005d0000028Fsv00001458sd00001000bc*sc*i* smartpqi
...

P.S.: @jsmeix After booting the created ISO into the rescue environment modules were loaded and NICs and harddisks were shown/enabled. So the echomade it work in my usecase.

jsmeix commented at 2021-09-15 13:28:

@pcahyna @rmetrich
could you please have a look here?

In particular whether or not the kernel modules files and directories
and their symlinks are correct here for RHEL 7.9.

On my openSUSE Leap 15.2 homeoffice laptop
I have neither symlinks in /lib/modules nor a /usr/lib/modules directory
so I cannot easily reproduce things.

jsmeix commented at 2021-09-15 13:34:

@hpannenb
I will further enhance the modinfo_filename function
to make things work fail-safe also in cases like this one.
I want to keep that the modinfo_filename function exits with non-zero exit code
if $module_filename is a danging symlink or otherwise invalid.
I think I call readlink -e $module_filename only as a test
and if that succeeds I output with plain echo $module_filename
and leave it to cp -L to follow symbolic links.

pcahyna commented at 2021-09-15 13:45:

@hpannenb where do /usr/lib/modules/3.10.0-1160.el7.x86_64/extra/smartpqi/smartpqi.ko and /usr/lib/modules/3.10.0-1160.el7.x86_64/extra/tg3/tg3.ko come from? I am surprised that you have modules in the extra directory. Are those vendor-supplied drivers, overriding those that are packaged with the kernel RPM?

hpannenb commented at 2021-09-15 13:54:

@pcahyna I suppose they were installed by the HPE SPP RPMs that I had to apply. So it is not RHEL based but side loaded "officially" from HPE's packages:

rpm -qa | egrep "tg3|smartpqi"
kmod-tg3-3.139b-1.rhel7u9.x86_64
kmod-smartpqi-2.1.8-040.rhel7u9.x86_64

and

rpm -ql kmod-tg3-3.139b-1.rhel7u9.x86_64
/etc/depmod.d/tg3.conf
/lib/modules/3.10.0-1160.el7.x86_64
/lib/modules/3.10.0-1160.el7.x86_64/extra
/lib/modules/3.10.0-1160.el7.x86_64/extra/tg3
/lib/modules/3.10.0-1160.el7.x86_64/extra/tg3/tg3.ko
/usr/share/smartupdate/kmod-tg3/component.xml

rpm -ql kmod-smartpqi-2.1.8-040.rhel7u9.x86_64
/etc/depmod.d/smartpqi.conf
/etc/dracut.conf.d/smartpqi.conf
/lib/modules/3.10.0-1160.el7.x86_64
/lib/modules/3.10.0-1160.el7.x86_64/extra
/lib/modules/3.10.0-1160.el7.x86_64/extra/smartpqi
/lib/modules/3.10.0-1160.el7.x86_64/extra/smartpqi/smartpqi.ko
/usr/share/smartupdate/kmod-smartpqi/component.xml

pcahyna commented at 2021-09-15 13:58:

I suspect that there might lie the problem, but I am not that familiar with drivers in RHEL, in particular with third-party ones overriding the distributed ones with the same name (indeed, there are modules of the same name in the kernel package). @rmetrich do you know whether that's supported, and if so how that's supposed to work, please?

hpannenb commented at 2021-09-15 14:42:

@hpannenb
I will further enhance the modinfo_filename function
to make things work fail-safe also in cases like this one.
I want to keep that the modinfo_filename function exits with non-zero exit code
if $module_filename is a danging symlink or otherwise invalid.
I think I call readlink -e $module_filename only as a test
and if that succeeds I output with plain echo $module_filename
and leave it to cp -L to follow symbolic links.

@jsmeix at least the readlink -e to echochange did the job for my use case.

github-actions commented at 2021-11-15 02:09:

Stale issue message

hpannenb commented at 2021-11-22 17:53:

Quiet a nasty bot closing my beloved but yet not fixed issue... :-)

jsmeix commented at 2021-11-23 07:07:

Quiet a friendly human reopening your beloved but yet not fixed issue... :-)

jsmeix commented at 2021-12-16 13:51:

FYI:
There are some more issues in the specific modules including code
see https://github.com/rear/rear/pull/2728#issuecomment-995799489

hpannenb commented at 2021-12-16 19:31:

...I think I call readlink -e $module_filename only as a test and if that succeeds I output with plain echo $module_filename and leave it to cp -L to follow symbolic links.

@jsmeix I would support Your approach since readlink -e would just test for a broken kernel module link instead of renaming the entire pathname in use for such kind of soft-linked kernel modules.

jsmeix commented at 2021-12-20 11:57:

@nolocimes reported
https://github.com/rear/rear/issues/2615#issuecomment-997058554
and that made me finally understand what is wrong with
using the readlink -e output:

# cd /tmp
# mkdir issue2677
# cd issue2677

# mkdir thisdir
# echo this >thisdir/thisfile
# chmod a+rwx thisdir/thisfile

# mkdir thatdir
# cd thatdir/
# ln -s ../thisdir/thisfile thisfilesymlink
# cd ..

# mkdir otherdir

# cp -vL thatdir/thisfilesymlink otherdir
'thatdir/thisfilesymlink' -> 'otherdir/thisfilesymlink'

# cp -v $( readlink -e thatdir/thisfilesymlink ) otherdir
'/tmp/issue2677/thisdir/thisfile' -> 'otherdir/thisfile'

# mkdir otherdir2

# cp -vaL thatdir/thisfilesymlink otherdir2
'thatdir/thisfilesymlink' -> 'otherdir2/thisfilesymlink'

# cp -vaL $( readlink -e thatdir/thisfilesymlink ) otherdir2
'/tmp/issue2677/thisdir/thisfile' -> 'otherdir2/thisfile'

# mkdir otherdir3

# cp -vaL --parents thatdir/thisfilesymlink otherdir3
thatdir -> otherdir3/thatdir
'thatdir/thisfilesymlink' -> 'otherdir3/thatdir/thisfilesymlink'

# cp -vaL --parents $( readlink -e thatdir/thisfilesymlink ) otherdir3
/tmp -> otherdir3/tmp
/tmp/issue2677 -> otherdir3/tmp/issue2677
/tmp/issue2677/thisdir -> otherdir3/tmp/issue2677/thisdir
'/tmp/issue2677/thisdir/thisfile' -> 'otherdir3/tmp/issue2677/thisdir/thisfile'

# find thisdir thatdir otherdir otherdir2 otherdir3 -ls
... drwxr-xr-x ... 4096 Dec 20 12:16 thisdir
... -rwxrwxrwx ...    5 Dec 20 12:16 thisdir/thisfile
... drwxr-xr-x ... 4096 Dec 20 12:17 thatdir
... lrwxrwxrwx ...   19 Dec 20 12:17 thatdir/thisfilesymlink -> ../thisdir/thisfile
... drwxr-xr-x ... 4096 Dec 20 12:51 otherdir
... -rwxr-xr-x ...    5 Dec 20 12:50 otherdir/thisfilesymlink
... -rwxr-xr-x ...    5 Dec 20 12:51 otherdir/thisfile
... drwxr-xr-x ... 4096 Dec 20 12:51 otherdir2
... -rwxrwxrwx ...    5 Dec 20 12:16 otherdir2/thisfilesymlink
... -rwxrwxrwx ...    5 Dec 20 12:16 otherdir2/thisfile
... drwxr-xr-x ... 4096 Dec 21 10:10 otherdir3
... drwxr-xr-x ... 4096 Dec 20 12:17 otherdir3/thatdir
... -rwxrwxrwx ...    5 Dec 20 12:16 otherdir3/thatdir/thisfilesymlink
... drwxrwxrwt ... 4096 Dec 21 10:00 otherdir3/tmp
... drwxr-xr-x ... 4096 Dec 21 10:09 otherdir3/tmp/issue2677
... drwxr-xr-x ... 4096 Dec 20 12:16 otherdir3/tmp/issue2677/thisdir
... -rwxrwxrwx ...    5 Dec 20 12:16 otherdir3/tmp/issue2677/thisdir/thisfile

Using cp -L copies the symlink taget content
as a new regular file with file name as the name of the symlink.

Using the readlink -e output copies the symlink target content
as a new regular file with file name as the symlink target file name.

So the difference is the file name of the copied content
where cp -L preserves its original name
while using the readlink -e output changes the name
so the copied content can no longer be found under its original name.

Using additionaly cp ... --parents makes the result even worse
when the readlink -e output is used because also the sub-directory
structure becomes different so the copied content can no longer
be found under its original sub-directory under its original name.

jsmeix commented at 2021-12-20 12:30:

https://github.com/rear/rear/pull/2731
is a not yet tested attempt to fix this issue.

hpannenb commented at 2021-12-20 14:26:

@jsmeix I successfully tested Your branch jsmeix-fix-modinfo_filename for my usecase: I could boot the rescue image having access to both NIC and SmartController / Disks. So IMHO those changes solve my issue.

P.S.: Under /usr/lib/modules/ the weak-updates for the side-loaded modules are available in the image now.

jsmeix commented at 2021-12-21 09:29:

I updated https://github.com/rear/rear/issues/2677#issuecomment-997859219
how using cp ... --parents makes it worse when the 'readlink -e' output is used.

jsmeix commented at 2021-12-21 14:20:

With https://github.com/rear/rear/pull/2731 merged
this particular issue is fixed.

There are still other issues with build/GNU/Linux/400_copy_modules.sh
see https://github.com/rear/rear/issues/2727
and https://github.com/rear/rear/pull/2728


[Export of Github issue for rear/rear.]