#2615 Issue closed
: Regression: Opal PBA shuts down because of incomplete kernel modules related to MODULES=( 'loaded_modules' ) on Ubuntu 20.04.2¶
Labels: support / question
, fixed / solved / done
tolga9009 opened issue at 2021-05-12 05:25:¶
Relax-and-Recover (ReaR) Issue Template¶
Fill in the following items before submitting a new issue
(quick response is not guaranteed with free support):
-
ReaR version ("/usr/sbin/rear -V"):
master branch c6500f9c42f1e68a5aaf36d556b144cbb8e69369 -
OS version ("cat /etc/os-release" or "lsb_release -a" or "cat /etc/rear/os.conf"):
Ubuntu 20.04.2 LTS -
ReaR configuration files ("cat /etc/rear/site.conf" and/or "cat /etc/rear/local.conf"):
/etc/rear/local.conf:
OUTPUT=RAWDISK
OUTPUT_URL="file:///var/lib/rear/output"
SECURE_BOOT_BOOTLOADER="/boot/efi/EFI/ubuntu/shimx64.efi"
-
Hardware (PC or PowerNV BareMetal or ARM) or virtual machine (KVM guest or PoverVM LPAR):
Dell M3800 (Intel Core i7-4702HQ, Nvidia Quadro K1100M) -
System architecture (x86 compatible or PPC64/PPC64LE or what exact ARM device):
x86 -
Firmware (BIOS or UEFI or Open Firmware) and bootloader (GRUB or ELILO or Petitboot):
UEFI, GRUB -
Storage (local disk or SSD) and/or SAN (FC or iSCSI or FCoE) and/or multipath (DM or NVMe):
Samsung SSD 860 EVO (SATA) -
Description of the issue (ideally so that others can reproduce it):
-
Fresh, up-to-date Ubuntu 20.04.2 LTS install
- install build-essential, devscripts,...
git clone rear
,make deb
, install itsudo rear mkopalpba
, copy to USB stick- reboot PC
- boot screen only shows Dell Logo (nothing else)
-
after approx. 15 seconds "Shutting down..." and Ubuntu Logo appears on the lower part of the screen.
-
Workaround, if any:
Issue happens in c6500f9c42f1e68a5aaf36d556b144cbb8e69369, butgit checkout rear-2.6
works fine. So, downgrading fixes it. Seems to be a regression, caused somewhere between 10e049b76a4e7a19c90d34c65bd9ab8e05dd3083 and c6500f9c42f1e68a5aaf36d556b144cbb8e69369. If you have any ideas which commits might have caused it, please let me know. I can test it.
Cheers,
Tolga
jsmeix commented at 2021-05-12 06:47:¶
@tolga9009
usually I use in a git clone https://github.com/rear/rear.git
directory
a command like
git log --format="%ae %H %ad%n%s :%n%b%n" --graph | fmt -w 120 -u -t | less
to get an overview of the commits.
In this case looking for Opal PBA related commit messages
point to the following issues here:
https://github.com/rear/rear/pull/2538
https://github.com/rear/rear/pull/2507
and
https://github.com/rear/rear/issues/2474
https://github.com/rear/rear/pull/2488
and
https://github.com/rear/rear/issues/2475
https://github.com/rear/rear/pull/2455
and
https://github.com/rear/rear/pull/2426
and
https://github.com/rear/rear/issues/2425
I guess this one could be most related because it is about
OPALPBA: Reboot after unlocking self-encrpyting disks may hang on some UEFI systems
https://github.com/rear/rear/pull/2448
and
https://github.com/rear/rear/issues/2436
OliverO2 commented at 2021-05-12 10:58:¶
@tolga9009 Thank you for the detailed issue description and the research you have done.
It looks like something is going wrong in the PBA's startup script
usr/share/rear/skel/default/etc/scripts/unlock-opal-disks
. It seems to
terminate before asking you for a password. If have already looked at
the commits since the ReaR 2.6 release and found nothing obvious.
So I'd ask you to use the latest master branch version of ReaR and turn on the lights for the PBA. I have prepared a small patch file: PBA-shutdown-debug.patch.txt
Please try:
- Check out the ReaR master branch c6500f9
- Copy
PBA-shutdown-debug.patch.txt
into the
rear
project directory. - If necessary, run
apt-get install patch
- In the
rear
project directory, runpatch -p1 <PBA-shutdown-debug.patch.txt
- Create the PBA.
- Boot with the PBA.
You should now see a system booting in text mode, giving us details
about the boot process. Finally, the system should display
Entering emergency shell...
where it would previously say
Shutting down...
. I would also expect an error message indicating the
cause of the problem.
tolga9009 commented at 2021-05-12 12:32:¶
Thank you for the fast replies and the patch!
I started with Oliver's suggestion, as I already had everything set up. For some reason, after booting I was greeted with a black screen. Boot process was like this:
- Push Power Button
- Dell Logo
- Press F12 for Boot Menu, pick PBA USB Boot device
- Plain black screen, no text, no boot log
I was able to reboot by pressing Ctrl+Alt+Del, so the system was still
responsive. Came up with the idea, that maybe the fonts were invisible
and blindly typed "poweroff". And bingo: 3 seconds later, my PC powered
off. So, additionally to the original bug, I seem to have invisible
fonts now. Didn't happen with rear-2.6
, I had boot log there, so this
is a new bug aswell. Maybe the missing / invisible font is causing my
shutdown issue (e.g. password prompt silently fails due to not beeing
able to find font)?
//Update: I was able to unlock disks and enter functional shell with commit 055e3a1074df63d8981d37cb4cd1cee1e4a3f62d, so no problem there. I will try to bisect.
OliverO2 commented at 2021-05-12 12:58:¶
Thanks for the quick reply. If the screen stays completely blank, it could be that the kernel could not initialize your Nvidia graphics card properly. Maybe some required kernel module or configuration file is missing in the PBA.
Could you add
nomodeset i915.modeset=0 nouveau.modeset=0
to the KERNEL_CMDLINE
setting in
usr/share/rear/prep/OPALPBA/Linux-i386/001_configure_workflow.sh
,
recreate the PBA and try again?
If that doesn't help, maybe you could configure your firmware a.k.a. BIOS to disable the Nvidia graphics card on boot?
tolga9009 commented at 2021-05-12 13:04:¶
I can try it out, but I don't think that's the case. When booting with "quiet splash", I can see UEFI BGRT, Ubuntu Logo and the "Shutting down..." message. I think if this was a GPU issue, I should have a black screen instead, right?
jsmeix commented at 2021-05-12 13:11:¶
@tolga9009
only a blind guess (I don't have anything like your hardware):
I think it is not the font (because I assume things happen on 'console'
where no special font is used).
I think it is that no output at all appears on your screen - for
whatever reason.
I.e. the usual 'stdout' of normal programs does not appear on your
screen.
I mean programs output normally on their 'stdout' but that gets not
shown on your screen.
So as also @OliverO2 thinks the root cause is likely some low level
issue with screen output
in general like Linux kernel graphics drivers or UEFI firmware stuff or
whatnot.
In contrast input (i.e. the usual 'stdin' of normal programs) comes from
your keyboard
which has nothing to do whether or not something is shown on your
screen.
OliverO2 commented at 2021-05-12 13:13:¶
@tolga9009 If the Ubuntu logo appears, your graphics should be
operational. The black screen might be caused by Ubuntu's vt.handoff
,
which switches to a blank virtual console at boot and might now
interfere with the other changes. To disable it, comment out the line
OPAL_PBA_KERNEL_CMDLINE+=" $vt_handoff"
in usr/share/rear/conf/Ubuntu.conf
, then recreate the PBA. Hope this
helps.
tolga9009 commented at 2021-05-12 13:24:¶
Thank you! I had console output now. There was an error message:
Failed to find module 'autofs4'
I also bisected until e7338e54426493d48b626f297f4d301fb759d10f. That was
the last OPAL related commit. No problems so far, I will bisect further.
I will now look for commits related to autofs4
.
jsmeix commented at 2021-05-12 13:37:¶
@OliverO2
thank you so much for your
https://github.com/rear/rear/issues/2615#issuecomment-839762211
Incredible what Ubuntu does:
https://help.ubuntu.com/community/vt.handoff
reads (excerpt)
vt.handoff (vt = virtualterminal) is a kernel boot parameter unique to Ubuntu,
and is not an upstream kernel boot parameter.
why do they think it benefits their users with such crazy deviations from upstream?
tolga9009 commented at 2021-05-12 13:47:¶
@jsmeix I think the title isn't quite fitting. I have display output in the default state. It shows the Dell and Ubuntu Logo, but continues to shutdown, without asking me for a password. I'm unable to unlock my Opal drives.
The black screen was caused by disabling quiet / splash for debugging. But I prefer the PBA booting silent and with splash for deployment.
I bisected further. The regression happened somewhere between 3058973cc4b4ba204b0cf17cc48bb9721d9bc9e1 and c38e61db066196c90e6118cab8887b76df58b20a.
jsmeix commented at 2021-05-12 13:52:¶
@tolga9009
don't worry - I adapt the title as needed and as things move forward.
I am not a Ubuntu user so I don't know about their special things.
OliverO2 commented at 2021-05-12 13:54:¶
Maybe it is the changes in
usr/share/rear/build/GNU/Linux/400_copy_modules.sh
by d2588e8 and
6a0013a.
jsmeix commented at 2021-05-12 14:00:¶
@OliverO2
those MODULES
related changes should have no effect
when the default MODULES=( 'all_modules' )
is used
for "rear mkrescue/mkbackup".
But perhaps things are different in case of Opal PBA?
OliverO2 commented at 2021-05-12 14:02:¶
The PBA uses MODULES=( 'loaded_modules' )
.
jsmeix commented at 2021-05-12 14:03:¶
Yes, I see it right now
in prep/OPALPBA/Linux-i386/001_configure_workflow.sh
tolga9009 commented at 2021-05-12 14:04:¶
The regression starts to happen exactly at 54f6fea96f1f22e5c9506fe55e1cfc626e541d59. I've quickly read through the code and it doesn't make sense to me. I will double check.
//Edit: double checked. It's the commit above what causes the issue. cf0c39d9d3dd7c40b53f2a71fc9d9516022ab546 works as expected.
jsmeix commented at 2021-05-12 14:09:¶
The default behaviour of
https://github.com/rear/rear/commit/54f6fea96f1f22e5c9506fe55e1cfc626e541d59
is
DISKS_TO_BE_WIPED='false'
...
is_false "$DISKS_TO_BE_WIPED" && return 0
...
is_false "$DISKS_TO_BE_WIPED" && return 0
so nothing should happen.
OliverO2 commented at 2021-05-12 14:14:¶
cf0c39d9 is on a merged branch, which is based on a much older version of master.
jsmeix commented at 2021-05-12 14:25:¶
As far as I see cf0c39d9d3dd7c40b53f2a71fc9d9516022ab546
was branched after 8f09ede7d617290e948dd779539d4da385d454e3
so the root cause should be
between 8f09ede7d617290e948dd779539d4da385d454e3
and 54f6fea96f1f22e5c9506fe55e1cfc626e541d59
jsmeix commented at 2021-05-12 14:31:¶
The only Opal PBA related commit message in
git log --format="%ae %H %ad%n%s :%n%b%n" --graph
between 54f6fea96f1f22e5c9506fe55e1cfc626e541d59
and 8f09ede7d617290e948dd779539d4da385d454e3
belongs to
https://github.com/rear/rear/pull/2538
i.e. 6466012947e933ca3cb821cba225517fff2d961e
tolga9009 commented at 2021-05-12 14:34:¶
Sorry for the confusion.
I have checked out at c6500f9c42f1e68a5aaf36d556b144cbb8e69369, applied Oliver's patch from https://github.com/rear/rear/issues/2615#issuecomment-839678939 and got a more verbose output now. I will try to find a way to get the full log, but for now:
Could not detect TCG Opal 2-compliant disks.
Entering emergency shell...
[...]
The following error occured when executing /etc/scripts/unlock-opal-disks:
modprobe: FATAL: Module mac_hid not found in directory /lib/modules/5.8.0-50-generic
modprobe: FATAL: Module usbhid not found in directory /lib/modules/5.8.0-50-generic
modprobe: FATAL: Module hid not found in directory /lib/modules/5.8.0-50-generic
modprobe: FATAL: Module usb_storage not found in directory /lib/modules/5.8.0-50-generic
Could not detect TCG Opal 2-compliant disks.
jsmeix commented at 2021-05-12 14:41:¶
By the way:
Today is logical Friday for me
because tomorrow is public holiday in Germany
https://en.wikipedia.org/wiki/Feast_of_the_Ascension
and the day after tomorrow is vacation for me
so I wish you all a relaxed and recovering weekend!
tolga9009 commented at 2021-05-12 14:47:¶
I'm from Germany aswell, wish you a great holiday and good weekend ;)!
//Edit: I checked out at 6466012947e933ca3cb821cba225517fff2d961e, it's working.
OliverO2 commented at 2021-05-12 15:18:¶
Have a great extended weekend, too! I'll be watching this space even then and I'm somewhat curious about what the cause might be. c38e61d (reported to work correctly here) is on a branch based on master 4b43f439. So the cause must be after that.
I'd still try reverting the changes from d2588e8 and 6a0013a first as
these commits seem to be the ones most likely influencing the situation.
The problem is diagnosed by
Could not detect TCG Opal 2-compliant disks
which means that
sedutil-cli
could not detect compatible devices – possibly due to a
missing kernel module. From the emergency shell of the above PBA, you
could try:
sedutil-cli --scan
The normal output should be something like this:
# sedutil-cli --scan
Scanning for Opal compliant disks
/dev/sda 2 Samsung SSD 860 PRO 256GB RVM01B6Q
No more disks present ending scan
tolga9009 commented at 2021-05-12 15:28:¶
Alright, I was able to find the culprit. Seems to be 4480cc00ad0b9a46bbde6f56ca392051b134f949. Just to double check: 9a6b9a109aa77afc6c96cf05bbd7988cf0310d61 works fine, 4480cc00ad0b9a46bbde6f56ca392051b134f949 does not.
I will try to revert it now, as you suggested.
//Edit: I've checked the command above
# sedutil-cli --scan
Scanning for Opal compliant disks
No more disks present ending scan
OliverO2 commented at 2021-05-12 15:37:¶
4480cc0 is a merge commit, merging just d2588e8c into master. So d2588e8c must be the real cause.
tolga9009 commented at 2021-05-12 15:46:¶
Yes! Thank you very much! Checked out at c6500f9c42f1e68a5aaf36d556b144cbb8e69369, reverted d2588e8c05aebfcdb3ac5cf1be7732aac2d78eb2 and everything works again!
OliverO2 commented at 2021-05-12 15:51:¶
Thank you for reporting and all of your research. Enjoy your PBA,
finally! The issue will affect more parts of ReaR than just the PBA. It
seems like the configuration option MODULES=( 'loaded_modules' )
no
longer works as expected.
@jsmeix Could you look at the changes from d2588e8 again as time permits?
OliverO2 commented at 2021-05-13 13:08:¶
@tolga9009 @jsmeix Could you please adjust the title again so that it matches what we know now?
Something like this would be more precise and help people find answers:
[Regression] MODULES=( 'loaded_modules' ) is broken, making Opal PBA shut down without password prompt
Take-aways:
-
The issue is not Dell- or Nvidia-specific.
-
Ubuntu's
vt.handoff
setting is working as expected. In ReaR, it is only used in the PBA to achieve a purely graphical boot without intermediate switching to text mode. It did only affect this issue as we needed to see early kernel messages before the immediate shutdown happens. Under normal circumstances, there would be enough time for pressingESC
to switch from the graphical (logo) screen to the text console with boot messages. -
Note that PBA intentionally chooses to shutdown immediately in case of non-recoverable problems. This avoids unauthenticated root access to unattended systems. (Yes, the disk is locked, but you could still change firmware settings...)
jsmeix commented at 2021-05-18 12:40:¶
I cannot reproduce it on my openSUSE Leap 15.2 system
with "rear mkrescue" and MODULES=( 'loaded_modules' )
(I am not a Opal PBA user).
But it should not make a real difference compared to "rear mkopalpba"
because both "rear -s mkrescue" and "rear -s mkopalpba" show
that build/GNU/Linux/400_copy_modules.sh
is sourced which is
the script that copies kernel modules into the ReaR recovery system.
I get exactly the <module_name>.ko files in the recovery system
(in
/tmp/rear.XXXX/rootfs/lib/modules/5.3.18-lp152.75-default/kernel/...)
that match what lsmod
shows.
The tricky part how to verify it is that lsmod
shows module names only
with _
while the kernel module files can contain both _
and -
e.g. lsmod
shows aes_x86_64
but its module file is
lib/modules/5.3.18-lp152.75-default/kernel/arch/x86/crypto/aes-x86_64.ko
jsmeix commented at 2021-05-18 12:44:¶
@tolga9009
how you could check what you get in your recovery system with your use
case:
See the KEEP_BUILD_DIR
description in
usr/share/rear/conf/default.conf
When you run it in debug mode KEEP_BUILD_DIR=1 is set in
usr/sbin/rear
I use basically always debugscript mode via rear -D ...
to get meaningful info in the log for debugging.
With current ReaR GitHub master code
you get with MODULES=( 'loaded_modules' )
this line output on the terminal (when you run it at least in verbose
mode):
Copying only currently loaded kernel modules (MODULES contains 'loaded_modules') and those in MODULES_LOAD
In the log check what happens after
Copying only currently loaded kernel modules
appears there
up to the lines starting with loaded_modules_files='...
that show which module files have been found.
jsmeix commented at 2021-05-19 10:59:¶
@tolga9009 @OliverO2
because I am not a Opal PBA user (so I can only test "rear mkrescue"):
I would like to ask if you could verify (as time permits)
that "rear mkopalpba" really does not make a difference
compared to "rear mkrescue" with MODULES=( 'loaded_modules' )
regarding what kernel modules get included in the ReaR recovery system.
@OliverO2
or could I myself somehow run "rear mkopalpba" as a test
as non Opal PBA user (I don't have a self-encrypting device).
If yes what etc/rear/local.conf would be right for such a test?
OliverO2 commented at 2021-05-19 12:13:¶
@jsmeix Well, you can test an Opal PBA on a machine without
self-encrypting devices by setting OPAL_PBA_DEBUG_DEVICE_COUNT=1
to
simulate one such device. However, the binary sedutil-cli
must be
available in the PATH
to build and test the Opal-related code.
As the issue apparently did not reproduce on your test installation, I have this on my list and will try to reproduce it here. I should be able to do so, as I have a system configuration available roughly matching Tolga's one. I'll try to test it, hopefully no later than the end of next week.
jsmeix commented at 2021-05-19 12:40:¶
I could run rear -D mkopalpba
on my openSUSE Leap 15.2 system
with only this etc/rear/locl.conf (only this single line):
OUTPUT=RAWDISK
plus
# ln -s /usr/bin/true /usr/bin/sedutil-cli
but without setting OPAL_PBA_DEBUG_DEVICE_COUNT=1
(I did it before
https://github.com/rear/rear/issues/2615#issuecomment-844044274).
I got all loaded modules in the ReaR recovery system
except kvm
and kvm_intel
which is expected because
prep/OPALPBA/Linux-i386/001_configure_workflow.sh
contains
local exclude_modules='kvm.*|nvidia.*|vbox.*'
So the issue is something specific on Ubuntu.
jsmeix commented at 2021-05-19 12:44:¶
By the way:
I did some rear -D mkopalpba
tests and for the first one
those modules were missing in the ReaR recovery system:
fat
kvm
kvm_intel
loop
nls_cp437
nls_iso8859_1
vfat
Apparently they were loaded while rear -D mkopalpba
was running
but after build/GNU/Linux/400_copy_modules.sh was run because
I compared the lsmod
output after rear -D mkopalpba
was run
with what modules there are in the ReaR recovery system.
Subsequent tests of rear -D mkopalpba
where those modules were already
loaded
copied them also into the ReaR recovery system (except kvm
and
kvm_intel
).
jsmeix commented at 2021-05-19 12:53:¶
Regarding the modules listed in
https://github.com/rear/rear/issues/2615#issuecomment-839823349
I have those loaded and in the ReaR recovery system:
hid_generic
usbcore
usbhid
OliverO2 commented at 2021-05-28 11:56:¶
I have tried to reproduce the issue on Ubuntu 20.04.2 LTS Desktop with
rear c6500f9c (2021-05-11) vs. e7338e54 (2020-12-20). It did not happen
in either configuration. rear mkopalpba
generated root file systems
with an identical set of modules for both versions, one of which
including commit d2588e8, the other excluding it. I was able to
successfully boot a fully working PBA created by ReaR c6500f9c
(including commit d2588e8).
So the current state of affairs is:
- We have the observation that under specific conditions, a ReaR image is created, which contains an incomplete kernel configuration, in this case affecting the PBA.
- We do not know the specific conditions so far, in particular:
- We cannot say that it is just d2588e8 causing the problem.
- We cannot say that the issue is related to Ubuntu 20.04.
- We cannot say that the issue affects the PBA (or
RAWDISK
output) only.
I have created a small test script which checks for modules in the same way commit d2588e8 does. A reference output for Ubuntu 20.04.2 is at https://pastebin.com/P6kGSqbz.
@tolga9009 maybe you could try this on your system to avoid the issue reappearing with a new ReaR release:
#!/bin/bash
KERNEL_VERSION="$( uname -r )"
function Error() {
echo "$*" >&2
}
function modinfo_filename () {
local module_name=$1
local module_filename=""
local alias_module_name=( $( modprobe -n -R $module_name 2>/dev/null ) )
test $alias_module_name && module_name=$alias_module_name
module_filename="$( modinfo -k $KERNEL_VERSION -F filename $module_name )"
if ! test "$module_filename" ; then
test "$KERNEL_VERSION" = "$( uname -r )" || Error "modinfo_filename failed because KERNEL_VERSION does not match 'uname -r'"
module_filename="$( modinfo -F filename $module_name )"
fi
grep -q '(builtin)' <<<"$module_filename" && echo '' || readlink -e $module_filename
}
loaded_modules+=" $( lsmod | tail -n +2 | cut -d ' ' -f 1 )"
loaded_modules_files="$( for loaded_module in $loaded_modules ; do modinfo_filename $loaded_module || Error "$loaded_module loaded or to be loaded but no module file?" ; done | sort -u )"
printf "%s\n" "$loaded_modules_files"
Is the output relevantly different from that of https://pastebin.com/P6kGSqbz?
jsmeix commented at 2021-05-28 12:59:¶
@OliverO2
WOW - thank you for your thorough and exhaustive analysis!
jsmeix commented at 2021-05-28 13:19:¶
@OliverO2 @tolga9009
I wish you a relaxed and recovering weekend!
OliverO2 commented at 2021-05-28 13:59:¶
Thank you! Have an excellent weekend, too!
github-actions commented at 2021-07-28 02:14:¶
Stale issue message
nolocimes commented at 2021-12-17 22:06:¶
I ran into this same issue on a new ubuntu 20.04.3 build.
This issue is related to the following line in 400_copy_modules.sh:
grep -q '(builtin)' <<<"$module_filename" && echo '' ||
readlink -e $module_filename
On my system, readlink -e $module_filename resolves to /usr/lib/XXXX
instead of /lib/XXXX
The modules are subsequently copied to:
/var/tmp/rear.XXXX/rootfs/usr/lib/modules/XXXX/kernel
as opposed to: /var/tmp/rear.XXXX/rootfs/lib/modules/XXXX/kernel
My temporary workaround was to replace 'readlink -e' with 'echo'.
jsmeix commented at 2021-12-20 11:19:¶
@nolocimes
thank you for your report and what you did to fix it!
Therefore it seems this issue is same as
https://github.com/rear/rear/issues/2677
I will continue there.
jsmeix commented at 2021-12-22 13:43:¶
I assume also this one is fixed by https://github.com/rear/rear/pull/2731
752377941 commented at 2022-05-27 07:08:¶
Sorry for the confusion.
I have checked out at c6500f9, applied Oliver's patch from #2615 (comment) and got a more verbose output now. I will try to find a way to get the full log, but for now:
Could not detect TCG Opal 2-compliant disks. Entering emergency shell... [...] The following error occured when executing /etc/scripts/unlock-opal-disks: modprobe: FATAL: Module mac_hid not found in directory /lib/modules/5.8.0-50-generic modprobe: FATAL: Module usbhid not found in directory /lib/modules/5.8.0-50-generic modprobe: FATAL: Module hid not found in directory /lib/modules/5.8.0-50-generic modprobe: FATAL: Module usb_storage not found in directory /lib/modules/5.8.0-50-generic Could not detect TCG Opal 2-compliant disks.
Could you tell me how you solve this problem?
[Export of Github issue for rear/rear.]