#2615 Issue closed: Regression: Opal PBA shuts down because of incomplete kernel modules related to MODULES=( 'loaded_modules' ) on Ubuntu 20.04.2

Labels: support / question, fixed / solved / done

tolga9009 opened issue at 2021-05-12 05:25:

Relax-and-Recover (ReaR) Issue Template

Fill in the following items before submitting a new issue
(quick response is not guaranteed with free support):

  • ReaR version ("/usr/sbin/rear -V"):
    master branch c6500f9c42f1e68a5aaf36d556b144cbb8e69369

  • OS version ("cat /etc/os-release" or "lsb_release -a" or "cat /etc/rear/os.conf"):
    Ubuntu 20.04.2 LTS

  • ReaR configuration files ("cat /etc/rear/site.conf" and/or "cat /etc/rear/local.conf"):
    /etc/rear/local.conf:

OUTPUT=RAWDISK
OUTPUT_URL="file:///var/lib/rear/output"
SECURE_BOOT_BOOTLOADER="/boot/efi/EFI/ubuntu/shimx64.efi"
  • Hardware (PC or PowerNV BareMetal or ARM) or virtual machine (KVM guest or PoverVM LPAR):
    Dell M3800 (Intel Core i7-4702HQ, Nvidia Quadro K1100M)

  • System architecture (x86 compatible or PPC64/PPC64LE or what exact ARM device):
    x86

  • Firmware (BIOS or UEFI or Open Firmware) and bootloader (GRUB or ELILO or Petitboot):
    UEFI, GRUB

  • Storage (local disk or SSD) and/or SAN (FC or iSCSI or FCoE) and/or multipath (DM or NVMe):
    Samsung SSD 860 EVO (SATA)

  • Description of the issue (ideally so that others can reproduce it):

  • Fresh, up-to-date Ubuntu 20.04.2 LTS install

  • install build-essential, devscripts,...
  • git clone rear, make deb, install it
  • sudo rear mkopalpba, copy to USB stick
  • reboot PC
  • boot screen only shows Dell Logo (nothing else)
  • after approx. 15 seconds "Shutting down..." and Ubuntu Logo appears on the lower part of the screen.

  • Workaround, if any:
    Issue happens in c6500f9c42f1e68a5aaf36d556b144cbb8e69369, but git checkout rear-2.6 works fine. So, downgrading fixes it. Seems to be a regression, caused somewhere between 10e049b76a4e7a19c90d34c65bd9ab8e05dd3083 and c6500f9c42f1e68a5aaf36d556b144cbb8e69369. If you have any ideas which commits might have caused it, please let me know. I can test it.

Cheers,
Tolga

jsmeix commented at 2021-05-12 06:47:

@tolga9009
usually I use in a git clone https://github.com/rear/rear.git directory
a command like

git log --format="%ae %H %ad%n%s :%n%b%n" --graph | fmt -w 120 -u -t | less

to get an overview of the commits.

In this case looking for Opal PBA related commit messages
point to the following issues here:

https://github.com/rear/rear/pull/2538

https://github.com/rear/rear/pull/2507 and
https://github.com/rear/rear/issues/2474

https://github.com/rear/rear/pull/2488 and
https://github.com/rear/rear/issues/2475

https://github.com/rear/rear/pull/2455 and
https://github.com/rear/rear/pull/2426 and
https://github.com/rear/rear/issues/2425
I guess this one could be most related because it is about
OPALPBA: Reboot after unlocking self-encrpyting disks may hang on some UEFI systems

https://github.com/rear/rear/pull/2448 and
https://github.com/rear/rear/issues/2436

OliverO2 commented at 2021-05-12 10:58:

@tolga9009 Thank you for the detailed issue description and the research you have done.

It looks like something is going wrong in the PBA's startup script usr/share/rear/skel/default/etc/scripts/unlock-opal-disks. It seems to terminate before asking you for a password. If have already looked at the commits since the ReaR 2.6 release and found nothing obvious.

So I'd ask you to use the latest master branch version of ReaR and turn on the lights for the PBA. I have prepared a small patch file: PBA-shutdown-debug.patch.txt

Please try:

  1. Check out the ReaR master branch c6500f9
  2. Copy PBA-shutdown-debug.patch.txt into the rear project directory.
  3. If necessary, run apt-get install patch
  4. In the rear project directory, run patch -p1 <PBA-shutdown-debug.patch.txt
  5. Create the PBA.
  6. Boot with the PBA.

You should now see a system booting in text mode, giving us details about the boot process. Finally, the system should display Entering emergency shell... where it would previously say Shutting down.... I would also expect an error message indicating the cause of the problem.

tolga9009 commented at 2021-05-12 12:32:

Thank you for the fast replies and the patch!

I started with Oliver's suggestion, as I already had everything set up. For some reason, after booting I was greeted with a black screen. Boot process was like this:

  1. Push Power Button
  2. Dell Logo
  3. Press F12 for Boot Menu, pick PBA USB Boot device
  4. Plain black screen, no text, no boot log

I was able to reboot by pressing Ctrl+Alt+Del, so the system was still responsive. Came up with the idea, that maybe the fonts were invisible and blindly typed "poweroff". And bingo: 3 seconds later, my PC powered off. So, additionally to the original bug, I seem to have invisible fonts now. Didn't happen with rear-2.6, I had boot log there, so this is a new bug aswell. Maybe the missing / invisible font is causing my shutdown issue (e.g. password prompt silently fails due to not beeing able to find font)?

//Update: I was able to unlock disks and enter functional shell with commit 055e3a1074df63d8981d37cb4cd1cee1e4a3f62d, so no problem there. I will try to bisect.

OliverO2 commented at 2021-05-12 12:58:

Thanks for the quick reply. If the screen stays completely blank, it could be that the kernel could not initialize your Nvidia graphics card properly. Maybe some required kernel module or configuration file is missing in the PBA.

Could you add

nomodeset i915.modeset=0 nouveau.modeset=0

to the KERNEL_CMDLINE setting in usr/share/rear/prep/OPALPBA/Linux-i386/001_configure_workflow.sh, recreate the PBA and try again?

If that doesn't help, maybe you could configure your firmware a.k.a. BIOS to disable the Nvidia graphics card on boot?

tolga9009 commented at 2021-05-12 13:04:

I can try it out, but I don't think that's the case. When booting with "quiet splash", I can see UEFI BGRT, Ubuntu Logo and the "Shutting down..." message. I think if this was a GPU issue, I should have a black screen instead, right?

jsmeix commented at 2021-05-12 13:11:

@tolga9009
only a blind guess (I don't have anything like your hardware):
I think it is not the font (because I assume things happen on 'console' where no special font is used).
I think it is that no output at all appears on your screen - for whatever reason.
I.e. the usual 'stdout' of normal programs does not appear on your screen.
I mean programs output normally on their 'stdout' but that gets not shown on your screen.
So as also @OliverO2 thinks the root cause is likely some low level issue with screen output
in general like Linux kernel graphics drivers or UEFI firmware stuff or whatnot.
In contrast input (i.e. the usual 'stdin' of normal programs) comes from your keyboard
which has nothing to do whether or not something is shown on your screen.

OliverO2 commented at 2021-05-12 13:13:

@tolga9009 If the Ubuntu logo appears, your graphics should be operational. The black screen might be caused by Ubuntu's vt.handoff, which switches to a blank virtual console at boot and might now interfere with the other changes. To disable it, comment out the line

    OPAL_PBA_KERNEL_CMDLINE+=" $vt_handoff"

in usr/share/rear/conf/Ubuntu.conf, then recreate the PBA. Hope this helps.

tolga9009 commented at 2021-05-12 13:24:

Thank you! I had console output now. There was an error message:
Failed to find module 'autofs4'

I also bisected until e7338e54426493d48b626f297f4d301fb759d10f. That was the last OPAL related commit. No problems so far, I will bisect further. I will now look for commits related to autofs4.

jsmeix commented at 2021-05-12 13:37:

@OliverO2
thank you so much for your https://github.com/rear/rear/issues/2615#issuecomment-839762211

Incredible what Ubuntu does:
https://help.ubuntu.com/community/vt.handoff
reads (excerpt)

vt.handoff (vt = virtualterminal) is a kernel boot parameter unique to Ubuntu,
and is not an upstream kernel boot parameter.

why do they think it benefits their users with such crazy deviations from upstream?

tolga9009 commented at 2021-05-12 13:47:

@jsmeix I think the title isn't quite fitting. I have display output in the default state. It shows the Dell and Ubuntu Logo, but continues to shutdown, without asking me for a password. I'm unable to unlock my Opal drives.

The black screen was caused by disabling quiet / splash for debugging. But I prefer the PBA booting silent and with splash for deployment.

I bisected further. The regression happened somewhere between 3058973cc4b4ba204b0cf17cc48bb9721d9bc9e1 and c38e61db066196c90e6118cab8887b76df58b20a.

jsmeix commented at 2021-05-12 13:52:

@tolga9009
don't worry - I adapt the title as needed and as things move forward.
I am not a Ubuntu user so I don't know about their special things.

OliverO2 commented at 2021-05-12 13:54:

Maybe it is the changes in usr/share/rear/build/GNU/Linux/400_copy_modules.sh by d2588e8 and 6a0013a.

jsmeix commented at 2021-05-12 14:00:

@OliverO2
those MODULES related changes should have no effect
when the default MODULES=( 'all_modules' ) is used
for "rear mkrescue/mkbackup".
But perhaps things are different in case of Opal PBA?

OliverO2 commented at 2021-05-12 14:02:

The PBA uses MODULES=( 'loaded_modules' ).

jsmeix commented at 2021-05-12 14:03:

Yes, I see it right now
in prep/OPALPBA/Linux-i386/001_configure_workflow.sh

tolga9009 commented at 2021-05-12 14:04:

The regression starts to happen exactly at 54f6fea96f1f22e5c9506fe55e1cfc626e541d59. I've quickly read through the code and it doesn't make sense to me. I will double check.

//Edit: double checked. It's the commit above what causes the issue. cf0c39d9d3dd7c40b53f2a71fc9d9516022ab546 works as expected.

jsmeix commented at 2021-05-12 14:09:

The default behaviour of
https://github.com/rear/rear/commit/54f6fea96f1f22e5c9506fe55e1cfc626e541d59
is

DISKS_TO_BE_WIPED='false'
...
is_false "$DISKS_TO_BE_WIPED" && return 0
...
is_false "$DISKS_TO_BE_WIPED" && return 0

so nothing should happen.

OliverO2 commented at 2021-05-12 14:14:

cf0c39d9 is on a merged branch, which is based on a much older version of master.

jsmeix commented at 2021-05-12 14:25:

As far as I see cf0c39d9d3dd7c40b53f2a71fc9d9516022ab546
was branched after 8f09ede7d617290e948dd779539d4da385d454e3
so the root cause should be
between 8f09ede7d617290e948dd779539d4da385d454e3
and 54f6fea96f1f22e5c9506fe55e1cfc626e541d59

jsmeix commented at 2021-05-12 14:31:

The only Opal PBA related commit message in
git log --format="%ae %H %ad%n%s :%n%b%n" --graph
between 54f6fea96f1f22e5c9506fe55e1cfc626e541d59
and 8f09ede7d617290e948dd779539d4da385d454e3
belongs to https://github.com/rear/rear/pull/2538
i.e. 6466012947e933ca3cb821cba225517fff2d961e

tolga9009 commented at 2021-05-12 14:34:

Sorry for the confusion.

I have checked out at c6500f9c42f1e68a5aaf36d556b144cbb8e69369, applied Oliver's patch from https://github.com/rear/rear/issues/2615#issuecomment-839678939 and got a more verbose output now. I will try to find a way to get the full log, but for now:

Could not detect TCG Opal 2-compliant disks.

Entering emergency shell...
[...]
The following error occured when executing /etc/scripts/unlock-opal-disks:
modprobe: FATAL: Module mac_hid not found in directory /lib/modules/5.8.0-50-generic
modprobe: FATAL: Module usbhid not found in directory /lib/modules/5.8.0-50-generic
modprobe: FATAL: Module hid not found in directory /lib/modules/5.8.0-50-generic
modprobe: FATAL: Module usb_storage not found in directory /lib/modules/5.8.0-50-generic
Could not detect TCG Opal 2-compliant disks.

jsmeix commented at 2021-05-12 14:41:

By the way:
Today is logical Friday for me
because tomorrow is public holiday in Germany
https://en.wikipedia.org/wiki/Feast_of_the_Ascension
and the day after tomorrow is vacation for me
so I wish you all a relaxed and recovering weekend!

tolga9009 commented at 2021-05-12 14:47:

I'm from Germany aswell, wish you a great holiday and good weekend ;)!

//Edit: I checked out at 6466012947e933ca3cb821cba225517fff2d961e, it's working.

OliverO2 commented at 2021-05-12 15:18:

Have a great extended weekend, too! I'll be watching this space even then and I'm somewhat curious about what the cause might be. c38e61d (reported to work correctly here) is on a branch based on master 4b43f439. So the cause must be after that.

I'd still try reverting the changes from d2588e8 and 6a0013a first as these commits seem to be the ones most likely influencing the situation. The problem is diagnosed by Could not detect TCG Opal 2-compliant disks which means that sedutil-cli could not detect compatible devices – possibly due to a missing kernel module. From the emergency shell of the above PBA, you could try:

sedutil-cli --scan

The normal output should be something like this:

# sedutil-cli --scan
Scanning for Opal compliant disks
/dev/sda    2  Samsung SSD 860 PRO 256GB                RVM01B6Q
No more disks present ending scan

tolga9009 commented at 2021-05-12 15:28:

Alright, I was able to find the culprit. Seems to be 4480cc00ad0b9a46bbde6f56ca392051b134f949. Just to double check: 9a6b9a109aa77afc6c96cf05bbd7988cf0310d61 works fine, 4480cc00ad0b9a46bbde6f56ca392051b134f949 does not.

I will try to revert it now, as you suggested.

//Edit: I've checked the command above

# sedutil-cli --scan
Scanning for Opal compliant disks
No more disks present ending scan

OliverO2 commented at 2021-05-12 15:37:

4480cc0 is a merge commit, merging just d2588e8c into master. So d2588e8c must be the real cause.

tolga9009 commented at 2021-05-12 15:46:

Yes! Thank you very much! Checked out at c6500f9c42f1e68a5aaf36d556b144cbb8e69369, reverted d2588e8c05aebfcdb3ac5cf1be7732aac2d78eb2 and everything works again!

OliverO2 commented at 2021-05-12 15:51:

Thank you for reporting and all of your research. Enjoy your PBA, finally! The issue will affect more parts of ReaR than just the PBA. It seems like the configuration option MODULES=( 'loaded_modules' ) no longer works as expected.

@jsmeix Could you look at the changes from d2588e8 again as time permits?

OliverO2 commented at 2021-05-13 13:08:

@tolga9009 @jsmeix Could you please adjust the title again so that it matches what we know now?

Something like this would be more precise and help people find answers:

[Regression] MODULES=( 'loaded_modules' ) is broken, making Opal PBA shut down without password prompt

Take-aways:

  • The issue is not Dell- or Nvidia-specific.

  • Ubuntu's vt.handoff setting is working as expected. In ReaR, it is only used in the PBA to achieve a purely graphical boot without intermediate switching to text mode. It did only affect this issue as we needed to see early kernel messages before the immediate shutdown happens. Under normal circumstances, there would be enough time for pressing ESC to switch from the graphical (logo) screen to the text console with boot messages.

  • Note that PBA intentionally chooses to shutdown immediately in case of non-recoverable problems. This avoids unauthenticated root access to unattended systems. (Yes, the disk is locked, but you could still change firmware settings...)

jsmeix commented at 2021-05-18 12:40:

I cannot reproduce it on my openSUSE Leap 15.2 system
with "rear mkrescue" and MODULES=( 'loaded_modules' )
(I am not a Opal PBA user).

But it should not make a real difference compared to "rear mkopalpba"
because both "rear -s mkrescue" and "rear -s mkopalpba" show
that build/GNU/Linux/400_copy_modules.sh is sourced which is
the script that copies kernel modules into the ReaR recovery system.

I get exactly the <module_name>.ko files in the recovery system
(in /tmp/rear.XXXX/rootfs/lib/modules/5.3.18-lp152.75-default/kernel/...)
that match what lsmod shows.

The tricky part how to verify it is that lsmod shows module names only with _
while the kernel module files can contain both _ and -
e.g. lsmod shows aes_x86_64 but its module file is
lib/modules/5.3.18-lp152.75-default/kernel/arch/x86/crypto/aes-x86_64.ko

jsmeix commented at 2021-05-18 12:44:

@tolga9009
how you could check what you get in your recovery system with your use case:

See the KEEP_BUILD_DIR description in usr/share/rear/conf/default.conf
When you run it in debug mode KEEP_BUILD_DIR=1 is set in usr/sbin/rear
I use basically always debugscript mode via rear -D ...
to get meaningful info in the log for debugging.

With current ReaR GitHub master code
you get with MODULES=( 'loaded_modules' )
this line output on the terminal (when you run it at least in verbose mode):

Copying only currently loaded kernel modules (MODULES contains 'loaded_modules') and those in MODULES_LOAD

In the log check what happens after Copying only currently loaded kernel modules appears there
up to the lines starting with loaded_modules_files='...
that show which module files have been found.

jsmeix commented at 2021-05-19 10:59:

@tolga9009 @OliverO2
because I am not a Opal PBA user (so I can only test "rear mkrescue"):

I would like to ask if you could verify (as time permits)
that "rear mkopalpba" really does not make a difference
compared to "rear mkrescue" with MODULES=( 'loaded_modules' )
regarding what kernel modules get included in the ReaR recovery system.

@OliverO2
or could I myself somehow run "rear mkopalpba" as a test
as non Opal PBA user (I don't have a self-encrypting device).
If yes what etc/rear/local.conf would be right for such a test?

OliverO2 commented at 2021-05-19 12:13:

@jsmeix Well, you can test an Opal PBA on a machine without self-encrypting devices by setting OPAL_PBA_DEBUG_DEVICE_COUNT=1 to simulate one such device. However, the binary sedutil-cli must be available in the PATH to build and test the Opal-related code.

As the issue apparently did not reproduce on your test installation, I have this on my list and will try to reproduce it here. I should be able to do so, as I have a system configuration available roughly matching Tolga's one. I'll try to test it, hopefully no later than the end of next week.

jsmeix commented at 2021-05-19 12:40:

I could run rear -D mkopalpba on my openSUSE Leap 15.2 system
with only this etc/rear/locl.conf (only this single line):

OUTPUT=RAWDISK

plus

# ln -s /usr/bin/true /usr/bin/sedutil-cli

but without setting OPAL_PBA_DEBUG_DEVICE_COUNT=1
(I did it before https://github.com/rear/rear/issues/2615#issuecomment-844044274).

I got all loaded modules in the ReaR recovery system
except kvm and kvm_intel which is expected because
prep/OPALPBA/Linux-i386/001_configure_workflow.sh
contains

local exclude_modules='kvm.*|nvidia.*|vbox.*'

So the issue is something specific on Ubuntu.

jsmeix commented at 2021-05-19 12:44:

By the way:
I did some rear -D mkopalpba tests and for the first one
those modules were missing in the ReaR recovery system:

fat
kvm
kvm_intel
loop
nls_cp437
nls_iso8859_1
vfat

Apparently they were loaded while rear -D mkopalpba was running
but after build/GNU/Linux/400_copy_modules.sh was run because
I compared the lsmod output after rear -D mkopalpba was run
with what modules there are in the ReaR recovery system.
Subsequent tests of rear -D mkopalpba where those modules were already loaded
copied them also into the ReaR recovery system (except kvm and kvm_intel).

jsmeix commented at 2021-05-19 12:53:

Regarding the modules listed in
https://github.com/rear/rear/issues/2615#issuecomment-839823349

I have those loaded and in the ReaR recovery system:

hid_generic
usbcore
usbhid

OliverO2 commented at 2021-05-28 11:56:

I have tried to reproduce the issue on Ubuntu 20.04.2 LTS Desktop with rear c6500f9c (2021-05-11) vs. e7338e54 (2020-12-20). It did not happen in either configuration. rear mkopalpba generated root file systems with an identical set of modules for both versions, one of which including commit d2588e8, the other excluding it. I was able to successfully boot a fully working PBA created by ReaR c6500f9c (including commit d2588e8).

So the current state of affairs is:

  • We have the observation that under specific conditions, a ReaR image is created, which contains an incomplete kernel configuration, in this case affecting the PBA.
  • We do not know the specific conditions so far, in particular:
    • We cannot say that it is just d2588e8 causing the problem.
    • We cannot say that the issue is related to Ubuntu 20.04.
    • We cannot say that the issue affects the PBA (or RAWDISK output) only.

I have created a small test script which checks for modules in the same way commit d2588e8 does. A reference output for Ubuntu 20.04.2 is at https://pastebin.com/P6kGSqbz.

@tolga9009 maybe you could try this on your system to avoid the issue reappearing with a new ReaR release:

#!/bin/bash

KERNEL_VERSION="$( uname -r )"

function Error() {
    echo "$*" >&2
}

function modinfo_filename () {
    local module_name=$1
    local module_filename=""
    local alias_module_name=( $( modprobe -n -R $module_name 2>/dev/null ) )
    test $alias_module_name && module_name=$alias_module_name
    module_filename="$( modinfo -k $KERNEL_VERSION -F filename $module_name )"
    if ! test "$module_filename" ; then
        test "$KERNEL_VERSION" = "$( uname -r )" || Error "modinfo_filename failed because KERNEL_VERSION does not match 'uname -r'"
        module_filename="$( modinfo -F filename $module_name )"
    fi
    grep -q '(builtin)' <<<"$module_filename" && echo '' || readlink -e $module_filename
}

loaded_modules+=" $( lsmod | tail -n +2 | cut -d ' ' -f 1 )"
loaded_modules_files="$( for loaded_module in $loaded_modules ; do modinfo_filename $loaded_module || Error "$loaded_module loaded or to be loaded but no module file?" ; done | sort -u )"

printf "%s\n" "$loaded_modules_files"

Is the output relevantly different from that of https://pastebin.com/P6kGSqbz?

jsmeix commented at 2021-05-28 12:59:

@OliverO2
WOW - thank you for your thorough and exhaustive analysis!

jsmeix commented at 2021-05-28 13:19:

@OliverO2 @tolga9009
I wish you a relaxed and recovering weekend!

OliverO2 commented at 2021-05-28 13:59:

Thank you! Have an excellent weekend, too!

github-actions commented at 2021-07-28 02:14:

Stale issue message

nolocimes commented at 2021-12-17 22:06:

I ran into this same issue on a new ubuntu 20.04.3 build.

This issue is related to the following line in 400_copy_modules.sh:
grep -q '(builtin)' <<<"$module_filename" && echo '' || readlink -e $module_filename

On my system, readlink -e $module_filename resolves to /usr/lib/XXXX instead of /lib/XXXX
The modules are subsequently copied to: /var/tmp/rear.XXXX/rootfs/usr/lib/modules/XXXX/kernel
as opposed to: /var/tmp/rear.XXXX/rootfs/lib/modules/XXXX/kernel

My temporary workaround was to replace 'readlink -e' with 'echo'.

jsmeix commented at 2021-12-20 11:19:

@nolocimes
thank you for your report and what you did to fix it!
Therefore it seems this issue is same as
https://github.com/rear/rear/issues/2677
I will continue there.

jsmeix commented at 2021-12-22 13:43:

I assume also this one is fixed by https://github.com/rear/rear/pull/2731

752377941 commented at 2022-05-27 07:08:

Sorry for the confusion.

I have checked out at c6500f9, applied Oliver's patch from #2615 (comment) and got a more verbose output now. I will try to find a way to get the full log, but for now:

Could not detect TCG Opal 2-compliant disks.

Entering emergency shell...
[...]
The following error occured when executing /etc/scripts/unlock-opal-disks:
modprobe: FATAL: Module mac_hid not found in directory /lib/modules/5.8.0-50-generic
modprobe: FATAL: Module usbhid not found in directory /lib/modules/5.8.0-50-generic
modprobe: FATAL: Module hid not found in directory /lib/modules/5.8.0-50-generic
modprobe: FATAL: Module usb_storage not found in directory /lib/modules/5.8.0-50-generic
Could not detect TCG Opal 2-compliant disks.

Could you tell me how you solve this problem?


[Export of Github issue for rear/rear.]