#2475 Issue closed: Opal PBA does not support NVMe drives

Labels: enhancement, fixed / solved / done

bmastenbrook opened issue at 2020-08-08 21:49:

Relax-and-Recover (ReaR) Issue Template

Fill in the following items before submitting a new issue
(quick response is not guaranteed with free support):

  • ReaR version ("/usr/sbin/rear -V"): Relax-and-Recover 2.6-git.4108.0f2f4a02.master / 2020-08-07

  • OS version ("cat /etc/os-release" or "lsb_release -a" or "cat /etc/rear/os.conf"): Ubuntu 20.04.1 LTS (Focal Fossa)

  • ReaR configuration files ("cat /etc/rear/site.conf" and/or "cat /etc/rear/local.conf"): N/A

  • Hardware (PC or PowerNV BareMetal or ARM) or virtual machine (KVM guest or PoverVM LPAR): PC

  • System architecture (x86 compatible or PPC64/PPC64LE or what exact ARM device): x86

  • Firmware (BIOS or UEFI or Open Firmware) and bootloader (GRUB or ELILO or Petitboot): UEFI with GRUB

  • Storage (local disk or SSD) and/or SAN (FC or iSCSI or FCoE) and/or multipath (DM or NVMe): NVMe WDC PC SN730 SDBQNTY-256G-1001

  • Storage layout ("lsblk -ipo NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,SIZE,MOUNTPOINT" or "lsblk" as makeshift):

NAME           MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme0n1        259:0    0 238.5G  0 disk  
├─nvme0n1p1    259:1    0    94M  0 part  /boot/efi
├─nvme0n1p2    259:2    0  15.3G  0 part  
│ └─swap_crypt 253:0    0  15.3G  0 crypt [SWAP]
└─nvme0n1p3    259:3    0 223.1G  0 part  /home
  • Description of the issue (ideally so that others can reproduce it):

NVMe devices are represented by the kernel as a root device (/dev/nvme0 in this example) and a sub-device for each namespace of the drive (/dev/nvme0n1 in this example). On most client SSDs, there is only one namespace, and it contains the entirety of the user-visible LBAs on the drive. sedutil-cli expects to be passed the device representing the entire drive, but partprobe needs to be passed the device representing the namespace.

All of the functions in opal-functions.sh attempt to pass the same device to both sedutil-cli and partprobe, and partprobe returns an error status when handed the root device for an NVMe drive instead of the namespace device. As such, the unlocking function fails.

  • Workaround, if any:

None.

  • Attachments, as applicable ("rear -D mkrescue/mkbackup/recover" debug log files):

jsmeix commented at 2020-08-10 10:17:

@OliverO2
could you have a look at this issue here?

OliverO2 commented at 2020-08-10 10:30:

I'll look into this one too. I'd have to familiarize myself with NVME stuff first but the issue seems to contain sufficient details to figure out a solution.

jsmeix commented at 2020-08-10 10:41:

@bmastenbrook
I do not have a NVME device so I cannot reproduce things myself.

I am puzzled by your lsblk output versus your description

NVMe devices are represented by the kernel as a root device
(/dev/nvme0 in this example) and a sub-device for each namespace
of the drive (/dev/nvme0n1 in this example)

because your lsblk output does not contain a nvme0 kernel device
so it seems there is no such device on your system.
But when there is no nvme0 kernel device I wonder how sedutil-cli can
expect to be passed the device representing the entire drive as nvme0?

I assume for sedutil-cli ... <device> the device needs to be a disk type device
which is usually e.g. something like /dev/sda.

In your lsblk output nvme0n1 is show as a disk type device
so I would expect that sedutil-cli ... /dev/nvme0n1 works.

But it seems in case of NVMe devices there are two layers of disk like devices:
A hidden / non-visible low-level nvme0 kind of whole-disk type device plus
one or more normal visible nvme0n1 kind of normal-disk type devices?

Only FYI regarding "fun with device naming" for mmcblk/eMMC devices
cf. https://github.com/rear/rear/issues/2087#issuecomment-475650821
and https://github.com/rear/rear/issues/2087#issuecomment-477551079

OliverO2 commented at 2020-08-10 12:12:

I assume for sedutil-cli ... <device> the device needs to be a disk type device
which is usually e.g. something like /dev/sda.

Actually it could be a toaster as long as its driver provides the necessary ioctls. sedutil-cli uses different ioctl commands for NVME and SATA devices.

jsmeix commented at 2020-08-10 13:12:

With disk type device I meant what the TYPE column of lsblk shows
so of course also whole toasters when shown as disk by lsblk
in contrast to toaster slots that could be shown as part by lsblk ;-)

OliverO2 commented at 2020-08-10 13:42:

@jsmeix
Actually, it seems like sedutil-cli uses the nvme character device nvme0 in the above example, which is just a control device. This control device is used to create the actual disk-type devices called namespaces (= "amount of NVM storage formatted for block access") like nvme0n1. Each namespace device can then be partitioned.

sedutil-cli --scan is used to determine available Opal devices and this seems to spit out the control devices in case of NVME. These character devices are not considered by lsblk.

Maybe we should campaign for toasters being supported as NVME devices, too, with the slots being the namespaces. ;-)

@bmastenbrook
Could you change usr/share/rear/lib/opal-functions.sh like this and try again?

  1. Insert this function just above the definion of function opal_device_regenerate_dek_ERASING_ALL_DATA():

    function opal_device_partprobe() {
        local device="${1:?}"
        # uses partprobe to update partition tables for all block devices belonging to the given Opal device.
    
        case "$device" in
            (*/nvme*)
                partprobe "$device"n[0-9]  # consider all namespace block devices (NOTE: relies on nullglob)
                ;;
            (*)
                partprobe "$device"
                ;;
        esac
    }
    
  2. Change all partprobe invocations to opal_device_partprobe

jsmeix commented at 2020-08-10 14:07:

@OliverO2
I noticed device="${1:?}" and found the ${VAR:?} form many times in Opal related code
but I this leads to an unfriendly exit without any error message when VAR is null or unset.

This is described in "man bash" (GNU bash, version 4.4.23 in my case):

${parameter:?word}
Display Error if Null or Unset.
If parameter is null or unset, the expansion of word
(or a message to that effect if word is not present)
is written to the standard error and
the shell, if it is not interactive, exits.
Otherwise, the value of parameter is substituted.

For example when I put

LogPrint TEST begin
foo=""
bar="${foo:?}" || BugError "foo null or unset"
LogPrint TEST end

in a script ReaR exits as follows

# usr/sbin/rear -D mkrescue
Relax-and-Recover 2.6 / Git
Running rear mkrescue (PID 8066 date 2020-08-10 16:04:49)
Using log file: /root/rear.github.master/var/log/rear/rear-linux-h9wr.log
TEST begin
Exiting rear mkrescue (PID 8066) and its descendant processes ...
Running exit tasks
rear mkrescue failed, check /root/rear.github.master/var/log/rear/rear-linux-h9wr.log for details

OliverO2 commented at 2020-08-10 15:33:

@jsmeix
These checks are just a compact attempt to catch internal bugs early without putting too much error-checking distractions in the code. The log file in your example helps to spot the cause:

2020-08-10 17:20:16.233077604 TEST begin
/home/oliver/Repositories/open-source/rear/usr/share/rear/prep/default/009_test.sh: line 3: foo: parameter null or not set
2020-08-10 17:20:16.234854487 Exiting rear mkrescue (PID 20879) and its descendant processes ...

Is there a better alternative?

  • Not checking at all (like most code in ReaR) would let ReaR either continue with undesirable results or trigger a follow-up error later on which is probably harder to diagnose.
  • Putting in explicit error-checking code everywhere (like [[ -z "$foo" ]] || BugError "foo: parameter null or not set" ]]) distracts from the primary functionality and makes the code harder to read.

jsmeix commented at 2020-08-11 10:42:

@OliverO2
I think explicit testing is the best way for bash scripts.
I try to explain my reasoning why below.

Regarding bar="${foo:?}":

The problem for the user is that he does not get the error message on his terminal.
Let him search the actual error message from the log is not user friendly.
He needs at least some message about what went wrong on his terminal.
Then he can search the log for more details related to that message.
To show the bash error foo: parameter null or not set on his terminal
and have that message also in the log so the user can find the matching part in the log

{ bar="${foo:?}" ; } 2>> >( tee -a "$RUNTIME_LOGFILE" 1>&8 )

is the only way that works for me.
Why a compound group command { ... ; } 2> must be used see
https://github.com/rear/rear/blob/master/usr/share/rear/lib/_input-output-functions.sh#L840
and for the complicated redirection via tee process substitution see
https://github.com/rear/rear/blob/master/usr/share/rear/backup/BORG/default/500_make_backup.sh#L71
and see the explanatory comment for that code.
and for background information see the discussion in
https://github.com/rear/rear/pull/2382
in particular
https://github.com/rear/rear/pull/2382#discussion_r417852393
and subsequent comments there like
https://github.com/rear/rear/pull/2382#discussion_r418018571
and
https://github.com/rear/rear/pull/2382#discussion_r419378558

But what would code like

function opal_device_partprobe() {
    { local device="${1:?}" ; } 2>> >( tee -a "$RUNTIME_LOGFILE" 1>&8 )

actually help?
E.g. it would not catch if opal_device_partprobe /etc/fstab was called.
Others who read that code would get a longer exercise in reverse engineering
what that complicated code intends to do or a longer comment is needed
that explains them what the idea behind is.

In contrast explicit testing like

function opal_device_partprobe() {
    local device="$1"
    test -b "$device" -o -c "$device" || BugError "opal_device_partprobe called for '$device' that is neither a block nor a character device"

makes it obvious for a code reader what that code is about and
the user gets a helpful error message via ReaR standard functionality.

OliverO2 commented at 2020-08-11 13:40:

@jsmeix Since this is about ReaR coding conventions in general, could we continue in a separate issue? This could also prompt some feedback from other contributors. I don't know if is is possible to move the above issue comments to a new issue. If so, I'd appreciate it.

jsmeix commented at 2020-08-11 13:42:

@OliverO2
yes, it does not belong to this issue.

jsmeix commented at 2020-08-12 08:42:

@OliverO2
would you like to open that separated issue about ReaR coding conventions?

I ask you because I think when you do the initial description
the issue would show things from your point of view which is the point of view
we need primarily as initial input to get something new and interesting.

In contrast if I did the initial description the issue would
show things from my point of view which is what I described in
https://github.com/rear/rear/issues/2475#issuecomment-671871423
which is no new and interesting input but "just old known stuff" as in
https://github.com/rear/rear/wiki/Coding-Style

And there is no time pressure, in particular no need to change things
in your code right now, so take your time as long as you need.

OliverO2 commented at 2020-08-12 09:53:

@jsmeix Sure, I can open the issue. The only downside is that I don't know how to move comments from this issue to the new one. If this were possible at all, it would probably be restricted to maintainers. But let's do it this way, I'll start the new issue sometime this week, so stay tuned. :-)

jsmeix commented at 2020-08-12 11:09:

@OliverO2
thank you in advance!

I don't know if it is possible to move comments to another issue.
I see a "Reference in new issue" functionality but that seems to be something different.
I can hide comments in issues.
So to "move" a comment I would manually copy&paste it into the new issue
and then I could hide the old comment in the old issue.

OliverO2 commented at 2020-08-23 20:04:

@bmastenbrook Thanks again for opening this issue. Just a friendly reminder: Could you please respond to the suggested solution (https://github.com/rear/rear/issues/2475#issuecomment-671361418) or state if you have lost interest, so that we have an idea on how to proceed?

OliverO2 commented at 2020-09-04 08:50:

@bmastenbrook I have prepared the above pull request to ease testing. You could use the patch https://github.com/rear/rear/pull/2488.patch or the entire code at https://github.com/OliverO2/rear/tree/feature-opal-nvme-support. As time permits, of course. 😊

github-actions commented at 2020-11-04 01:34:

Stale issue message

jsmeix commented at 2020-11-05 13:51:

With https://github.com/rear/rear/pull/2488 merged
this issue should be fixed.

gitthangbaby commented at 2020-11-15 20:56:

i got the git version, flashed Secure Boot PBA and a nice plymouth dialog is displayed after 11 seconds (not very fast, exactly like LUKS delay, but better than DTA PBA of 5 seconds + forced reboot 15 sec). After setting password, the message says "drive couldn't be unlocked" and next attempts fail to progress booting. In fact, the NVME is indeed unlocked with first attempt, as I can see when i fallback to OPAL PBA> command line (and by a negative check of falling back before password entry). I can unlock with sed-opal-unlocker also as only PBA changed. I have two NVME so the fastest approach is sed-opal-unlocker service with 1 sec delay but not suitable for single drive nor boot files being on the SED drive which i'd like.

unlockers summary:
BIOS: password prompt can be triggered with MBREnable=false but it doesn't work
DTA PBA: works with 5 + 15sec delay, except Ryzen systems, unreadable GUI with no pwd repeat, no Secure Boot?
DTA forks: some bring different hashing, some bring support for Ryzen, added password save for suspend, none are compatible with DTA. unreadable GUI with no pwd repeat, no Secure Boot?
sed-opal-unlocker: almost no delay, password can be in hashed file, added password save for suspend, runs as a service too late to move any boot files to SED drive, compatible with DTA, Secure boot ok
rear: 11sec delay, nice GUI with pwd repeat, added password save for suspend?, compatible with DTA, Secure boot ok

OliverO2 commented at 2020-11-16 17:47:

@gitthangbaby Thanks for your report.

If the PBA fails to unlock your device, could you please use the rescue system to provide a detailed log?

This is what you could do to help find the cause:

  1. Prepare a rescue system on a USB stick according to Step 3 in Quick Start: Setting up Self-Encrypting Disks.
  2. Shut down your system so that SEDs become locked.
  3. Boot the rescue system.
  4. Log in as root (locally, or from another system via ssh)
  5. Run these commands:
    • lsblk -o +MODEL
    • rear opaladmin info
  6. Provide the output of the above commands.
  7. Run this command to attempt to unlock SEDs.
    • rear -D opaladmin unlock
  8. Provide the complete contents of the log file /var/log/rear/rear-*.log
    • NOTE: Capture the log contents right away as it will be overwritten by the next rear invocation.
  9. Run once more:
    • rear opaladmin info
  10. Provide the output of the above command.

gitthangbaby commented at 2020-11-24 18:14:

Hi. Sorry for no reply, I had a misfortune of flashing the Rear USB image (thinking it's PBA) to unlocked NVME drive (thinking it's locked), so i lost all distros. The Rear USB started with grub> only.
Since i have advantage of two drives, i decided to put effort to initram-ize the boot process and it was extremely difficult but successful.

OliverO2 commented at 2020-11-24 22:04:

@gitthangbaby No problem. It would still be valuable if you could try the above steps and provide output and log file. I'm pretty sure whatever is is, it can be fixed easily once we have the necessary information.


[Export of Github issue for rear/rear.]