#2507 PR merged: Fix #2474: add AMD firmware to OPAL PBA if necessary

Labels: enhancement, fixed / solved / done

OliverO2 opened issue at 2020-11-03 14:47:

Pull Request Details:
  • Type: Bug Fix

  • Impact: Normal

  • Reference to related issue (URL): #2474

  • How was this pull request tested?

    As AMD graphics hardware was not available on the test system, tests were run by mocking as follows:

    • Add these lines in usr/share/rear/lib/mkopalpba-workflow.sh after SourceStage "prep":

          LogPrintError "FIRMWARE_FILES=<${FIRMWARE_FILES[*]}>"
          exit
      
    • Run usr/sbin/rear mkopalpba with default configuration (Intel graphics), expect result: FIRMWARE_FILES=<no>.

    • In default.conf use these single-line configuration changes:

      • FIRMWARE_FILES=( yes ), expect result: FIRMWARE_FILES=<no>.
      • FIRMWARE_FILES=( a b c ), expect result: FIRMWARE_FILES=<a b c>.
      • OPAL_PBA_FIRMWARE_FILES=( one two ), expect result: FIRMWARE_FILES=<one two>.
    • In usr/share/rear/conf/GNU/Linux.conf, change grep expression AMD to Intel, expect result: FIRMWARE_FILES=<amdgpu>.

  • Brief description of the changes in this pull request:

    OPAL PBA images only:

    • Add a required firmware file when AMD graphics hardware is present.
    • Allow setting a PBA-specific firmware configuration via the OPAL_PBA_FIRMWARE_FILES configuration variable.

    The issue report #2474 indicates that a single firmware file is all that is required to make a PBA boot with AMD graphics hardware. If field testing shows that more changes are necessary, this PR's solution should be easy to extend.

jsmeix commented at 2020-11-04 09:44:

@OliverO2
as always thank you so much for your continuous valuable contributions to ReaR
that help so much to make ReaR step by step better and better.

I think it is not right that a single firmware file is all that is required
to make a system boot with AMD graphics hardware because on my
openSUSE Leap 15.1 system find /lib*/firmware -ipath '*amdgpu*'
lists more than 300 firmware files and their names indicate that
each kind of AMD GPU chips has and needs its own matching firmware
which is what I would expect for all those many different AMD GPU chips,
cf. https://en.wikipedia.org/wiki/Radeon

On my openSUSE Leap 15.1 system there are 24MB of AMD GPU firmware files:

# du -hs /lib/firmware/amdgpu/
24M     /lib/firmware/amdgpu/

I hope this is not too much for a PBA?

OliverO2 commented at 2020-11-04 16:04:

@jsmeix
Thank you for the thorough review, it seems to be a bit more tricky than I had hoped for.

Note that all of the following only applies to PBA images. Standard ReaR images will contain the amdgpu module and all firmware by default.

The PBA image on my Ubuntu 20.04 desktop is about 46 MB (du -hs). AMD firmware looks like this:

$ du -hs /lib/firmware/amdgpu
33M     /lib/firmware/amdgpu

On Ubuntu 20.04 the amdgpu firmware directory contains 336 files for 30 different GPU models. While we do not know which of those might trigger the boot-time freeze, at least some AMD GPUs seem to be affected:

Without this package installed, poor 2D/3D performance is commonly experienced. Some GPUs may require firmware to function properly at all.

The spec. allows a PBA size up to 128 MB. So we could just throw in that entire firmware directory here to avoid boot-time freezes. Downside: Slows down PBA uploading to the device (can take minutes).

Additional notes:

  1. We should not change FIRMWARE_FILES: By default (non-PBA), it already includes all available files. If the user chose to restrict firmware files, we should not override this choice by auto-configuration.

  2. With respect to modules, I don't think we need to change anything (PBA or non-PBA):

    • default.conf contains MODULES=( 'all_modules' ), which should include the amdgpu module automatically.
    • usr/share/rear/prep/OPALPBA/Linux-i386/001_configure_workflow.sh contains MODULES=( 'loaded_modules' ) which apparently also includes the amdgpu module, otherwise the issue #2474 would not have occured as described (the kernel would have never tried to load AMD firmware).

OliverO2 commented at 2020-11-04 17:21:

With 0a1ba2f3c8fa08de8d5528dc1bb583d94a5eddcc on Ubuntu 20.04 desktop, the PBA size is

  • 48 MB for systems without AMD GPU,
  • 59 MB for systems with AMD GPU.

Looks like we get a good compression ratio on these files in initrd.cgz.

jsmeix commented at 2020-11-05 09:00:

@OliverO2

thank you for your explanatory reply.
It helped me a lot to avoid a mistake.

I fully agree with your reasoning in
https://github.com/rear/rear/pull/2507#issuecomment-721820442
therein in particular

We should not change FIRMWARE_FILES:
By default (non-PBA), it already includes all available files.
If the user chose to restrict firmware files, we should not override this choice by auto-configuration.

Yesterday I was too fast with my (false) conclusion that because of

/lib/firmware/amdgpu/* firmware files plus the amdgpu kernel module
are needed in any case to boot the ReaR recovery system
on systems with AMD-GPU graphics hardware

we should have in ReaR some hardcoded

if type -p lspci >/dev/null && lspci | grep --quiet ' VGA .*AMD'; then
    MODULES+=( amdgpu )
    FIRMWARE_FILES+=( '*amdgpu*' )

My conclusion is wrong because it would be against what I always demand:
"Final power to the user!"

In contrast with such hardcoded things it would be ReaR that has the final power
what gets included in the recovery system regardless what the user may have
specified in his etc/rear/local.conf.

jsmeix commented at 2020-11-05 09:01:

@rear/contributors
if there are no objections I would like to merge it today afternoon.

jsmeix commented at 2020-11-05 10:56:

@OliverO2
I think my argument in
https://github.com/rear/rear/pull/2507#issuecomment-722241196
also applies to

if type -p lspci >/dev/null && lspci | grep --quiet ' VGA .*AMD'; then
    OPAL_PBA_FIRMWARE_FILES+=( '*/amdgpu/*' )
fi

which hardcodedly adds amdgpu firmware files regardless what
the user has specified for OPAL_PBA_FIRMWARE_FILES in his local.conf.

So I wonder if not something like

if type -p lspci >/dev/null && lspci | grep --quiet ' VGA .*AMD'; then
    # Add amdgup firmware files unless the user has specified to not add any firmware files:
    is_false "$OPAL_PBA_FIRMWARE_FILES" || OPAL_PBA_FIRMWARE_FILES+=( '*/amdgpu/*' )
fi

would be better?

I know this is probably somewhat nitpicking or even over the top
but I like to understand how OPAL_PBA_FIRMWARE_FILES is meant to be used.

OliverO2 commented at 2020-11-05 12:29:

@jsmeix
Yes, in general you are right that OPAL_PBA_FIRMWARE_FILES+=... always extends the user's configuration. In this respect it works more like COPY_AS_IS and others in usr/share/rear/conf/GNU/Linux.conf.

If we would like to give the user more control, supporting just a "no firmware" decision seems insufficient. More likely an expert user would want to include some firmware as in

OPAL_PBA_FIRMWARE_FILES=( '*/amdgpu/vega20*' )

To support this, we could choose to enable AMD firmware auto-detection only if OPAL_PBA_FIRMWARE_FILES is empty. The downside would be that if a user wanted to include some unrelated firmware like */intel-ucode/*, the user would also have to remember dealing with AMD GPU quirks, probably without knowing.

The question is, given that

  • the added weight of AMD firmware files is not that significant and
  • those files are only present when an AMD GPU is detected, and
  • the probable outcome of not correctly including AMD firmware is a non-booting system,

is it worth it?

jsmeix commented at 2020-11-05 13:45:

@OliverO2
I will merge it as is because in its current state it is at least "good enough".
If further enhancements are needed we can add them as actually needed.


[Export of Github issue for rear/rear.]