#2739 Issue closed: Fails to copy all kernel modules if there are dangling symlinks in /lib/modules/...

Labels: enhancement, fixed / solved / done

DamaniN opened issue at 2022-01-13 19:41:

Relax-and-Recover (ReaR) Issue Template

Fill in the following items before submitting a new issue
(quick response is not guaranteed with free support):

  • ReaR version ("/usr/sbin/rear -V"):
# rear -V
Relax-and-Recover 2.6-git.4664.48ebb13.fixusrlib64forcdm.changed / 2021-12-23
  • OS version ("cat /etc/os-release" or "lsb_release -a" or "cat /etc/rear/os.conf"):
# more /etc/redhat-release
Red Hat Enterprise Linux Server release 7.9 (Maipo)
  • ReaR configuration files ("cat /etc/rear/site.conf" and/or "cat /etc/rear/local.conf"):
# cat /etc/rear/site.conf
cat: /etc/rear/site.conf: No such file or directory

# cat /etc/rear/local.conf

# This file etc/rear/local.conf is intended for the user's
# manual configuration of Relax-and-Recover (ReaR).
# For configuration through packages and other automated means
# we recommend a separated file named site.conf next to this file
# and leave local.conf as is (ReaR upstream will never ship a site.conf).
# The default OUTPUT=ISO creates the ReaR rescue medium as ISO image.
# You need to specify your particular backup and restore method for your data
# as the default BACKUP=REQUESTRESTORE does not really do that (see "man rear").
# Configuration variables are documented in /usr/share/rear/conf/default.conf
# and the examples in /usr/share/rear/conf/examples/ can be used as templates.
# ReaR reads the configuration files via the bash builtin command 'source'
# so bash syntax like VARIABLE="value" (no spaces at '=') is mandatory.
# Because 'source' executes the content as bash scripts you can run commands
# within your configuration files, in particular commands to set different
# configuration values depending on certain conditions as you need like
# CONDITION_COMMAND && VARIABLE="special_value" || VARIABLE="usual_value"
# but that means CONDITION_COMMAND gets always executed when 'rear' is run
# so ensure nothing can go wrong if you run commands in configuration files.
OUTPUT=ISO
BACKUP=REQUESTRESTORE
  • Hardware vendor/product (PC or PowerNV BareMetal or ARM) or VM (KVM guest or PowerVM LPAR):
vSphere 6.7 virtual machine
  • System architecture (x86 compatible or PPC64/PPC64LE or what exact ARM device):
x86_64
  • Firmware (BIOS or UEFI or Open Firmware) and bootloader (GRUB or ELILO or Petitboot):
UEFI
GRUB
  • Storage (local disk or SSD) and/or SAN (FC or iSCSI or FCoE) and/or multipath (DM or NVMe):
VMware datastore VMDK.
  • Storage layout ("lsblk -ipo NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,LABEL,SIZE,MOUNTPOINT"):
# lsblk -ipo NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,LABEL,SIZE,MOUNTPOINT
NAME                                        KNAME     PKNAME    TRAN   TYPE FSTYPE      LABEL  SIZE MOUNTPOINT
/dev/sda                                    /dev/sda                   disk                   24.4G
|-/dev/sda1                                 /dev/sda1 /dev/sda         part vfat               200M /boot/efi
|-/dev/sda2                                 /dev/sda2 /dev/sda         part xfs                  1G /boot
`-/dev/sda3                                 /dev/sda3 /dev/sda         part LVM2_member       23.2G
  |-/dev/mapper/rhel_rhel7--9--packer2-root /dev/dm-0 /dev/sda3        lvm  xfs               20.8G /
  `-/dev/mapper/rhel_rhel7--9--packer2-swap /dev/dm-1 /dev/sda3        lvm  swap               2.5G [SWAP]
/dev/sr0                                    /dev/sr0            sata   rom                    1024M
  • Description of the issue (ideally so that others can reproduce it):

When running the rear -dvV mkrescue command, it gives the error:

Copying all kernel modules in /lib/modules/3.10.0-1160.49.1.el7.x86_64 (MODULES contains 'all_modules')
ERROR: Failed to copy all kernel modules in /lib/modules/3.10.0-1160.49.1.el7.x86_64
Some latest log messages since the last called script 400_copy_modules.sh:
  2022-01-13 10:48:35.463675697 Entering debugscript mode via 'set -x'.
  2022-01-13 10:48:35.477823907 Copying all kernel modules in /lib/modules/3.10.0-1160.49.1.el7.x86_64 (MODULES contains 'all_modules')
Aborting due to an error, check /var/log/rear/rear-rhel7-9-packer2.log for details

From the debug output, it looks like ReaR is running the command:

cp --verbose -t /var/tmp/rear.wFxcnNuhnHxRVPZ/rootfs -a -L --parents /lib/modules/3.10.0-1160.49.1.el7.x86_64

When I run this command manually, it gives this message:

# cp --verbose -t /var/tmp/rear.wFxcnNuhnHxRVPZ/rootfs -a -L --parents /lib/modules/3.10.0-1160.49.1.el7.x86_64 | grep cp:
cp: cannot stat ‘/lib/modules/3.10.0-1160.49.1.el7.x86_64/build’: No such file or directory
cp: cannot stat ‘/lib/modules/3.10.0-1160.49.1.el7.x86_64/source’: No such file or directory

NOTE: I added the grep command to remove all the successful copy messages.

If I run ls -la on those two files, I see this:

# ls -la /lib/modules/3.10.0-1160.49.1.el7.x86_64/build /lib/modules/3.10.0-1160.49.1.el7.x86_64/source
lrwxrwxrwx. 1 root root 44 Jan  4 11:54 /lib/modules/3.10.0-1160.49.1.el7.x86_64/build -> /usr/src/kernels/3.10.0-1160.49.1.el7.x86_64
lrwxrwxrwx. 1 root root  5 Jan  4 11:55 /lib/modules/3.10.0-1160.49.1.el7.x86_64/source -> build

If I run ls -la on /usr/src/kernels/3.10.0-1160.49.1.el7.x86_64 we see that it does not exist:

# ls -la /usr/src/kernels/3.10.0-1160.49.1.el7.x86_64
ls: cannot access /usr/src/kernels/3.10.0-1160.49.1.el7.x86_64: No such file or directory

It seems that this invocation of the cp command cannot deal with links that go nowhere.

  • Workaround, if any:

If I touch the missing file, the rear mkrescue command will work:

touch /usr/src/kernels/3.10.0-1160.49.1.el7.x86_64

  • Attachments, as applicable ("rear -D mkrescue/mkbackup/recover" debug log files):

Debug log:
rear-rhel7-9-packer2.log

To paste verbatim text like command output or file content,
include it between a leading and a closing line of three backticks like

# rear -Ddv mkrescue
Relax-and-Recover 2.6-git.4664.48ebb13.fixusrlib64forcdm.changed / 2021-12-23
Running rear mkrescue (PID 21018 date 2022-01-13 11:37:15)
Command line options: /sbin/rear -Ddv mkrescue
Using log file: /var/log/rear/rear-rhel7-9-packer2.log
Using build area: /var/tmp/rear.dT1K0LFoFNo8LCF
Running 'init' stage ======================
Running workflow mkrescue on the normal/original system
Running 'prep' stage ======================
Found EFI system partition /dev/sda1 on /boot/efi type vfat
Using UEFI Boot Loader for Linux (USING_UEFI_BOOTLOADER=1)
Using '/bin/mkisofs' to create ISO filesystem images
Using autodetected kernel '/boot/vmlinuz-3.10.0-1160.49.1.el7.x86_64' as kernel in the recovery system
Running 'layout/save' stage ======================
Creating disk layout
Overwriting existing disk layout file /var/lib/rear/layout/disklayout.conf
Disabling excluded components in /var/lib/rear/layout/disklayout.conf
Using guessed bootloader 'EFI' for 'rear recover' (found in first bytes on /dev/sda)
Verifying that the entries in /var/lib/rear/layout/disklayout.conf are correct
Created disk layout (check the results in /var/lib/rear/layout/disklayout.conf)
Running 'rescue' stage ======================
Creating recovery system root filesystem skeleton layout
Handling network interface 'ens192'
ens192 is a physical device
Handled network interface 'ens192'
Included current keyboard mapping (via 'dumpkeys -f')
Included default US keyboard mapping /lib/kbd/keymaps/legacy/i386/qwerty/defkeymap.map.gz
Included other keyboard mappings in /lib/kbd/keymaps
To log into the recovery system via ssh set up /root/.ssh/authorized_keys or specify SSH_ROOT_PASSWORD
Trying to find what to use as UEFI bootloader...
Trying to find a 'well known file' to be used as UEFI bootloader...
Using '/boot/efi/EFI/redhat/grubx64.efi' as UEFI bootloader file
Copying logfile /var/log/rear/rear-rhel7-9-packer2.log into initramfs as '/tmp/rear-rhel7-9-packer2-partial-2022-01-13T11:37:20-0800.log'
Running 'build' stage ======================
Copying files and directories
Copying binaries and libraries
Copying all kernel modules in /lib/modules/3.10.0-1160.49.1.el7.x86_64 (MODULES contains 'all_modules')
ERROR: Failed to copy all kernel modules in /lib/modules/3.10.0-1160.49.1.el7.x86_64
Some latest log messages since the last called script 400_copy_modules.sh:
  2022-01-13 11:37:41.686293147 Entering debugscript mode via 'set -x'.
  2022-01-13 11:37:41.699629083 Copying all kernel modules in /lib/modules/3.10.0-1160.49.1.el7.x86_64 (MODULES contains 'all_modules')
Aborting due to an error, check /var/log/rear/rear-rhel7-9-packer2.log for details
Exiting rear mkrescue (PID 21018) and its descendant processes ...
Running exit tasks
To remove the build area you may use (with caution): rm -Rf --one-file-system /var/tmp/rear.dT1K0LFoFNo8LCF
Terminated

hpannenb commented at 2022-01-14 08:40:

@DamaniN I could reproduce this on an actual CentOS 7.9 VM with current github version of ReaR. Your current workaround would be to use ReaR 2.6 release / Relax-and-Recover 2.6 / 2020-06-17 instead.
It also shows the broken symlinks but continues creating the rescue image.

jsmeix commented at 2022-01-14 09:26:

Simple reproducer:

# mkdir cp-L_test

# cd cp-L_test

# echo foo > foo

# ln -s foo foosymlink

# ln -s bar barsymlink

# file *
barsymlink: broken symbolic link to bar
foo:        ASCII text
foosymlink: symbolic link to foo

# mkdir /tmp/somewhere

# cp -t /tmp/somewhere -L *
cp: cannot stat 'barsymlink': No such file or directory

# echo $?
1

# file /tmp/somewhere/*
/tmp/somewhere/foo:        ASCII text
/tmp/somewhere/foosymlink: ASCII text

Unfortunately there is no cp option to let it skip/ignore broken symlinks and
unfortunately the cp exit code cannot be used to distinguish this case from real errors.

jsmeix commented at 2022-01-14 10:12:

This happens since
https://github.com/rear/rear/commit/bd8e32c06bf49a8630ade196be54193aa129d146

Update 400_copy_modules.sh
...
Additionally in case of MODULES=( 'all_modules' ) also use 'cp -L' to
copy the actual content to avoid dangling symlinks in the recovery system.
...
     if IsInArray "all_modules" "${MODULES[@]}" ; then
...
-        if ! cp $verbose -t $ROOTFS_DIR -a --parents ...
+        if ! cp $verbose -t $ROOTFS_DIR -a -L --parents ...

so the simpler and better workaround is to remove that '-L' there
in usr/share/rear/build/GNU/Linux/400_copy_modules.sh
so you can still use our GitHub master code and test it on your systems
and report issues to us which we need from you and which is much appreciated
because for me it "just works" (of course - otherwise I would not have commited it).

jsmeix commented at 2022-01-14 10:47:

I want to keep the '-L' functionality because of the reasoning in
https://github.com/rear/rear/issues/2677 therein see in particular
https://github.com/rear/rear/issues/2677#issuecomment-997859219

So now I have to also care about dangling symlinks on the original system.

hpannenb commented at 2022-01-14 11:28:

@DamaniN A simple workaround is to add a

MODULES=( 'loaded_modules' )

to Your site.conf if You do not need to use the rescue image for migrations. In my VM Your issue just occurs for the all_modules (default) use case.

hpannenb commented at 2022-01-14 16:16:

I want to keep the '-L' functionality because of the reasoning in #2677 therein see in particular #2677 (comment)

So now I have to also care about dangling symlinks on the original system.

Maybe there is an easier approach:
I suppose You can keep the -L in the MODULES=( 'loaded_modules' ) case resuming the struggle in #2677 was related to the readlink -e part determining the real path of the loaded modules.

For the case MODULES = ( 'all_modules' ) You might revert to the previous 2.6's cp -a solution since this did not cause an issue.

DamaniN commented at 2022-01-14 18:50:

@jsmeix, thanks for the workarounds. I would suggest that since adding the -L option to cp seems to be a breaking change to existing systems, users should have to opt-in to it. Providing some flag to enable -L on cp may be preferred for the GA release of the next version of ReaR. Otherwise, many working systems may be broken by the update.

jsmeix commented at 2022-01-17 09:21:

When on the original system there is
/lib/modules/symlink -> /somewhere/else/kernel_module
the recovery system must contain the actual kernel module
and not only the dangling/broken symlink /lib/modules/symlink.

This is why '-L' is also needed for MODULES = ( 'all_modules' ).
If '-L' would be omitted for MODULES = ( 'all_modules' )
and kernel_module is really needed (e.g. when it is loaded by the kernel)
then the recovery system would work with MODULES=( 'loaded_modules' )
but it would fail with the default MODULES = ( 'all_modules' )
which means ReaR would result by default a broken recovery system.

The latter would be a much more severe bug compared to the current
error exit during "rear mkrescue" because of a bug in the original system
(dangling/broken symlinks can happen but they cannot be "right")
that can easily be fixed by the admin of the original system
(remove the broken symlink if is is not needed or install
the actual content where the dangling symlink points to).

When Linux distributions have /lib/modules/ with broken symlinks
it is a bug in what they provide.
E.g. on my one openSUSE leap 15.3 system there is no symlink in /lib/modules/
(find /lib/modules/ -type l shows nothing)
and on my other openSUSE leap 15.3 system with Nvidia drivers I have

# find /lib/modules/ -type l | xargs file
/lib/modules/5.3.18-59.37-default/source:                                 symbolic link to /usr/src/linux-5.3.18-59.37
/lib/modules/5.3.18-59.37-default/weak-updates/updates/nvidia-uvm.ko:     symbolic link to /lib/modules/5.3.18-57-default/updates/nvidia-uvm.ko
/lib/modules/5.3.18-59.37-default/weak-updates/updates/nvidia-peermem.ko: symbolic link to /lib/modules/5.3.18-57-default/updates/nvidia-peermem.ko
/lib/modules/5.3.18-59.37-default/weak-updates/updates/nvidia-modeset.ko: symbolic link to /lib/modules/5.3.18-57-default/updates/nvidia-modeset.ko
/lib/modules/5.3.18-59.37-default/weak-updates/updates/nvidia-drm.ko:     symbolic link to /lib/modules/5.3.18-57-default/updates/nvidia-drm.ko
/lib/modules/5.3.18-59.37-default/weak-updates/updates/nvidia.ko:         symbolic link to /lib/modules/5.3.18-57-default/updates/nvidia.ko
/lib/modules/5.3.18-59.37-default/build:                                  symbolic link to /usr/src/linux-5.3.18-59.37-obj/x86_64/default

A problem with those particular symlinks

/lib/modules/5.3.18-59.37-default/source -> /usr/src/linux-5.3.18-59.37
/lib/modules/5.3.18-59.37-default/build -> /usr/src/linux-5.3.18-59.37-obj/x86_64/default

is that cp -a -L copies all in /usr/src/linux-5.3.18-59.37
and all in /usr/src/linux-5.3.18-59.37-obj/x86_64/default
which contain in my case 17549 and 9944 files that are no kernel modules
that needlessly increase the recovery system size by 123M (uncompressed)

# du -hs /usr/src/linux-5.3.18-59.37 /usr/src/linux-5.3.18-59.37-obj/x86_64/default
107M    /usr/src/linux-5.3.18-59.37
16M     /usr/src/linux-5.3.18-59.37-obj/x86_64/default

which increases the recovery system initrd size by about 16M (compressed).
But this is a separated problem.

In general:
ReaR is not there to silently ignore bugs in the original system.
ReaR is also not there to search for bugs in the original system.
But when ReaR is hit by a bug it must not silently ignore it,
cf. the section "Try hard to care about possible errors" in
https://github.com/rear/rear/wiki/Coding-Style

The question is if ReaR really needs to error out when
cp -a -L fails for MODULES = ( 'all_modules' )
or if only a LogPrintError user info message
could be reasonably sufficient in this case?

jsmeix commented at 2022-01-17 10:18:

With
https://github.com/rear/rear/commit/e929ae63c65ef604c678322d92ca13ff99675b63
we do no longer error out if 'cp -a -L' failed to to copy all contents of /lib/modules/...
but we tell the user about the issue so he could inspect his system and decide.

At least for now this issue should be reasonably mitigated.
Perhaps a real solution could be developed later.

The actual root cause is a broken original system (i.e. broken symlink).

hpannenb commented at 2022-01-17 10:27:

BTW, does not 490_fix_broken_links.sh take care about those broken links anyway?

jsmeix commented at 2022-01-17 10:34:

build/default/490_fix_broken_links.sh cares about broken symlinks inside the recovery system
but 'cp -L path/to/broken_symlink' won't copy the broken symlink into the recovery system
and build/default/490_fix_broken_links.sh can only fix broken symlinks in the recovery system
where the matching symlink on the original system is not broken (i.e. has an actual target).
Only the user can fix a broken symlink on his original system:
Either remove the broken symlink if is is not needed
or provide the right content where the dangling symlink points to.

jsmeix commented at 2022-01-17 12:34:

@hpannenb
or do you mean that we should in general not specfically care about symlinks
in the individual scripts i.e. that in general we just copy symlinks as symlinks
and leave it completely to build/default/490_fix_broken_links.sh
to fix the broken symlinks inside the recovery system?

On first glance this looks wrong because each code part should do its task right
but on a second glance fix only actually broken symlinks in a generic script at the end
could have general advantages.

Example:
Assume in the original system there are

/path/to/huge_file
/path1/to/symlink1 -> /path/to/huge_file
/path2/to/symlink2 -> /path/to/huge_file

then 'cp -L' results that in the recovery system
both /path1/to/symlink1 and /path2/to/symlink2 are regular files
with identical contents as /path/to/huge_file
cf. https://github.com/rear/rear/issues/2739#issuecomment-1012949307
so 'cp -L' can result needlessly duplicated content
in particular when /path/to/huge_file gets included anyway.

Another reason is that in practice it gets rather complicated
to deal correctly with symlinks (e.g. this issue here) at each code place
so it could be better in practice to not care at each code place
and only fix what actually needs to be fixed at the end.

So it could be better in practice to copy symlinks as symlinks
and only add actually missing symlink targets at the end.

hpannenb commented at 2022-01-17 12:48:

@hpannenb or do you mean that we should in general not specfically care about symlinks in the individual scripts i.e. that in general we just copy symlinks as symlinks and leave it completely to build/default/490_fix_broken_links.sh to fix the broken symlinks inside the recovery system?

In principle that is what I meant.

So it could be better in practice to copy symlinks as symlinks
and only add actually missing symlink targets at the end.

Yes.

jsmeix commented at 2022-01-17 13:08:

Whether or not to
copy symlinks as symlinks and add missing symlink targets at the end
needs to be discussed via https://github.com/rear/rear/issues/2740


[Export of Github issue for rear/rear.]