#3239 PR `merged`: Fix version test in udev start by desupporting systemd < 190¶

Labels: bug, cleanup, fixed / solved / done

pcahyna opened issue at 2024-06-06 20:03:¶

Pull Request Details:¶

Type: Bug Fix
Impact: Normal
Reference to related issue (URL): #3175
How was this pull request tested?
Together with the changes in #3175 , the backup&recovery test make-backup-and-restore-iso passes when executed locally on Fedora rawhide.
Description of the changes in this pull request:
/etc/scripts/system-setup.d/40-start-udev-or-load-modules.sh in the rescue system examines systemd version in order to know whether it should use udev or load modules manually.

Unfortunately, at least on Fedora, the systemd version can contain a tilde (the systemd version is equal to the systemd package version, which conforms to the Fedora conventions: https://docs.fedoraproject.org/en-US/packaging-guidelines/Versioning/#_handling_non_sorting_versions_with_tilde_dot_and_caret ) and the bash code that compares versions chokes on it on Fedora rawhide, as the current systemd version in Fedora rawhide is 256~rc3:
/etc/scripts/system-setup.d/40-start-udev-or-load-modules.sh: line 24: [[: 256~rc3: syntax error in expression (error token is "~rc3")
and then the wrong branch is taken.

This leads at least to non-working make-backup-and-restore-iso CI test due to the ramdisk (brd) module being loaded earlier and thus not respecting the desired parameters. (A ramdisk of a pre-determined size is used to emulate the backup ISO in the test.) Probably has other unwanted consequences as well.

As reasonably modern distros should have a newer systemd (even RHEL 7 has systemd 219 and Ubuntu 16.04 has 229), delete the check and error out in a similar check during mkrescue if a too old systemd is encountered.

The check in build/GNU/Linux/600_verify_and_adjust_udev.sh is more robust, so it could be used as an alternative if desupporting old systemd is not desired. The version_newer function would have to be moved to a library available during boot, though, and the function does not really check for a tilde in version, thus it appears to work more by accident than by design.

pcahyna commented at 2024-06-06 20:14:¶

@jsmeix and other @rear/contributors, can you please check whether this is not a problem on distribution versions that you care about?

jsmeix commented at 2024-06-07 05:57:¶

SLES12-SP5 has systemd version 228

# systemd_version=$( systemd-notify --version 2>/dev/null | grep systemd | awk '{ print $2; }' )

# echo "'$systemd_version'"
'228'

# systemd-notify --version
systemd 228
...

SLES11 does not have systemd.

jsmeix commented at 2024-06-07 06:04:¶

Regarding
"if ReaR actually still works on non-systemd distros"
see
https://github.com/rear/rear/wiki/Test-Matrix-ReaR-2.7#sles11-sp4-with-lvm-that-is-luks-encrypted
and
https://github.com/rear/rear/wiki/Test-Matrix-rear-2.6#sles-11-sp-4
which proves
"The simpler the system, the simpler and easier the recovery"
in
https://en.opensuse.org/SDB:Disaster_Recovery#Inappropriate_expectations

See also in
https://github.com/rear/rear/blob/master/doc/rear-release-notes.txt
in the section "Support" the parts about
"ReaR 2.7 may still work for SLES 11 ..." and
"ReaR 2.7 and earlier versions are known to no longer work ..."

pcahyna commented at 2024-06-07 10:37:¶

I see the following CI failure on the commit (not on the whole PR):
Build Packages / build (push) Failing after 15m

Unfortunately, it is difficult to see from the run ( https://github.com/pcahyna/rear/actions/runs/9406696235/job/25910725301 ) what exactly has failed. @rear/contributors can you please have a look if anyone understands this?

jsmeix commented at 2024-06-07 11:08:¶

@pcahyna regarding your question in your
https://github.com/rear/rear/pull/3239#issuecomment-2154567904

I'm afraid - I don't understand how that CI stuff works.

I falsely guessed its sources are in
https://github.com/rear/rear-integration-tests

But I cannot find anything of what
https://github.com/pcahyna/rear/actions/runs/9406696235/job/25910725301
shows in the files of
https://github.com/rear/rear-integration-tests

In particular neither dist-all nor run-in-docker
nor make package nor ls -lR are in the files in
https://github.com/rear/rear-integration-tests

Is what
https://github.com/pcahyna/rear/actions/runs/9406696235/job/25910725301
shows all one can get from the CI stuff?
Or are more verbose logs elsewhere?

Are the CI sources elsewhere?
If yes, where?
If not:
For me that whole CI stuff results (at least for now)
a mental "too many indirections stack overflow exit"
(cf. RFC 1925 item 6a).

I am sorry to be unhelpful here (at least for now)
but I have certain mental hard limits with
nowadays chichi kind of computing, cf.
https://www.youtube.com/watch?v=j6dXmsR4_VQ

pcahyna commented at 2024-06-07 11:11:¶

Is what https://github.com/pcahyna/rear/actions/runs/9406696235/job/25910725301 shows all one can get from the CI stuff? Or are more verbose logs elsewhere?

My problem is that these logs are too verbose already.

Regarding https://github.com/rear/rear-integration-tests , don't worry - I know the tests there, I added those that we use in CI. The "Build Packages" stuff is triggered by some GitHub action that lives elsewhere....

jsmeix commented at 2024-06-07 11:13:¶

Just right after posting
https://github.com/rear/rear/pull/3239#issuecomment-2154613917
I think I found the CI sources:
https://github.com/rear/rear/blob/master/.github/workflows/build-packages.yml

But unfortunately this also doesn't tell me
what actually goes on - still (at least for now)
a mental "too many indirections stack overflow exit".

pcahyna commented at 2024-06-07 11:13:¶

it tells me that it was added by @schlomo

jsmeix commented at 2024-06-07 11:22:¶

Some wild blind guess:

I guess that the

Run ls -lR dist-all
  ls -lR dist-all
  shell: /usr/bin/bash -e {0}
ls: cannot access 'dist-all': No such file or directory
Error: Process completed with exit code 2.

is only a consequence of the right before error in

Run tools/run-in-docker -- --patch
...
[zillion lines later]
...
ERROR: failed to solve: process
"/bin/bash -xeuo pipefail -c type -p yum &>/dev/null
 || exit 0 ;
    grep -E '(CentOS.*Final|CentOS Linux release 8)' /etc/redhat-release
 &&         sed -i -e 's|#baseurl=http://mirror.centos.org/|baseurl=http://vault.centos.org/|' -e '/^mirror/d' /etc/yum.repos.d/*.repo ;
     yum -q --nogpgcheck install -y         kbd cpio binutils ethtool gzip iputils parted tar openssl gawk attr bc syslinux rpcbind iproute nfs-utils xorriso genisoimage util-linux psmisc procps-ng util-linux         make binutils git rpm-build ;
     yum -q --nogpgcheck install -y sysvinit-tools mkisofs asciidoc xmlto
 ||         yum -q --nogpgcheck install -y asciidoctor
 ||             yum -q --nogpgcheck install -y mkisofs asciidoc xmlto"
did not complete successfully: exit code: 1
ERROR: Failed building quay.io/centos/centos:stream8
** SCRIPT RUN TIME 937 SECONDS **
Error: Process completed with exit code 1.

I added line breaks to have the very long command line string
"..." on a separated line and I spit that long string
at ; and && and ||

Currently I fail to see why it failed, i.e. what
specific command in that long command line failed
and why - i.e. with what specific error message.

schlomo commented at 2024-06-07 11:52:¶

@jsmeix @pcahyna I think that you can safely ignore that error as of now. It also happens on upstream.

https://github.com/rear/rear/actions/runs/9386213011/job/25846226128#step:4:20744 shows the exact error:

0.289 + yum -q --nogpgcheck install -y kbd cpio binutils ethtool gzip iputils parted tar openssl gawk attr bc syslinux rpcbind iproute nfs-utils xorriso genisoimage util-linux psmisc procps-ng util-linux make binutils git rpm-build
0.556 Error: Failed to download metadata for repo 'appstream': Cannot prepare internal mirrorlist: No URLs in mirrorlist

I'd guess that this is related to the mirrorlist / YUM repos for CentOS and/or a configuration error in the quay.io/centos/centos:stream8 docker image.

The inline shell is indeed a bit confusing, the source is at
https://github.com/rear/rear/blob/1ec0559b3fc43e69a0a52feeb3f5fe615e7c7241/tools/run-in-docker#L85-L94

I'll take what you said as inspiration to extract those RUN lines into a script for better development and better error reporting... When I have time for this.

jsmeix commented at 2024-06-07 12:05:¶

@schlomo
thank you for your comment!
Now I got some initial basic idea
how all those various code pieces work together.

schlomo commented at 2024-06-07 12:09:¶

@jsmeix sorry for the somewhat hackish approach in run-in-docker. I put it together to solve my problems (check something on all distros and build packages for our master branch) and was hoping to see more ReaR developers to make use of it.

About CentOS8, the rpm-build check of this PR also fails in https://github.com/rear/rear/pull/3239/checks?check_run_id=25937084878 with this error:

Error: Failed to download metadata for repo 'baseos': Cannot prepare internal mirrorlist: No URLs in mirrorlist

Which is why I believe there is a general CentOS8 problem that we just experience here.

pcahyna commented at 2024-06-07 12:18:¶

yes, I guess it is because CentOS 8 is EOL. @schlomo can you please disable it in this test? and @jsmeix how were you able to locate the relevant error so quickly? I was not able to, the whole output is so long.

pcahyna commented at 2024-06-07 12:19:¶

I will take care of disabling the rpm-build CentOS 8 test.

jsmeix commented at 2024-06-07 12:39:¶

@schlomo
yes, since your explanation in
https://github.com/rear/rear/pull/3239#issuecomment-2154680294
I also think the root cause in this case
is "something outside of ReaR"
so we can - and should - just ignore it.

I think this particular case shows rather well
what I had in mind when I wrote
"too many indirections stack overflow exit".

We have our "own ReaR code" plus our "own CI code"
(which is the only code we have under control).

But (at least some) CI tests depend on tons of
other stuff outside of what is under our control
(docker, OS images at third-party locations, whatnot...).

So when some CI test fails it is almost impossible
to see directly if that failure is caused by a change
in our own code - in particular whether or not a
CI failure is caused by a change in our own ReaR code.

I think it is almost impossible because of a too high
stack of indirections where most of that stack is
outside of our control and - even worse - much of that
stack is located at various third-party locations.

So in practice it means CI failures are mostly
meaningless to decide if a change in our own ReaR code
is good or bad because in most cases CI failures
are not caused by a change in our own ReaR code
but instead by somthing outside of our control
on an arbitrary third-party location.

This makes it rather questionable whether or not
the CI stuff is actually helpful for us in practice.

In the end it is the same basic question as it
had been with those automated ShellCheck stuff.

When automated tests result too many false positives
those automated tests do more harm than good.

I think more than one or two percent false positives
is already when it does more harm than good.

When for each pull request 10 automated tests are run
and each on has 98% reliability, then the overall
reliability is only 82% (0.98 ^ 10 = 0.817)
so every 5th CI failure is a false positive.

To have 98% overall reliability (for 10 CI tests)
each one must have 99.8% reliability
(0.998 ^ 10 = 0.980).

With "reliability" I mean that a failing CI test
correctly indentifies an issue in our own code.

schlomo commented at 2024-06-07 12:47:¶

@jsmeix BTW, when using the run-in-docker tool locally it looks much nicer than on CI:

https://github.com/rear/rear/assets/101384/3fb4bab5-d50a-494c-ac17-ecd3276f4eaa

(the upper part shows a completed section and the lower parts shows an in-progress section. Section meaning patching the Docker image for a distro to use.

And yes, I agree with your thoughts on CI and 3rd party dependencies. OTOH I think it is useful to have as we can then see if a PR breaks something that works on master, or - as in this case - we see that it is the same broken as master and that we can therefore safely ignore it.

pcahyna commented at 2024-06-07 17:08:¶

I took the opportunity to disable CentOS CI Copr builds and tests in this PR (even though it is unrelated to the change).

pcahyna commented at 2024-06-07 17:10:¶

So in practice it means CI failures are mostly
meaningless to decide if a change in our own ReaR code
is good or bad because in most cases CI failures
are not caused by a change in our own ReaR code
but instead by somthing outside of our control
on an arbitrary third-party location.

This makes it rather questionable whether or not
the CI stuff is actually helpful for us in practice.

Depends. Some of the failures are due only to external circumstances breaking the test. But some are due to the OS evolving in a way that breaks ReaR itself. (And this was the case for the CI failure that triggered this PR.) In the latter case, we want to know about it, because we need to fix it anyway.

pcahyna commented at 2024-06-07 17:29:¶

@rear/contributors I would like to merge the PR today to help unblock other stuff, given the approvals and the CI change is not a significant change.

pcahyna commented at 2024-06-07 18:22:¶

thank you all for the reviews!

pcahyna commented at 2024-06-17 14:03:¶

@jsmeix

@schlomo yes, since your explanation in #3239 (comment) I also think the root cause in this case is "something outside of ReaR" so we can - and should - just ignore it.

I think this particular case shows rather well what I had in mind when I wrote "too many indirections stack overflow exit".

We have our "own ReaR code" plus our "own CI code" (which is the only code we have under control).

But (at least some) CI tests depend on tons of other stuff outside of what is under our control (docker, OS images at third-party locations, whatnot...).

So when some CI test fails it is almost impossible to see directly if that failure is caused by a change in our own code - in particular whether or not a CI failure is caused by a change in our own ReaR code.

I think it is almost impossible because of a too high stack of indirections where most of that stack is outside of our control and - even worse - much of that stack is located at various third-party locations.

So in practice it means CI failures are mostly meaningless to decide if a change in our own ReaR code is good or bad because in most cases CI failures are not caused by a change in our own ReaR code but instead by somthing outside of our control on an arbitrary third-party location.

This makes it rather questionable whether or not the CI stuff is actually helpful for us in practice.

In the end it is the same basic question as it had been with those automated ShellCheck stuff.

When automated tests result too many false positives those automated tests do more harm than good.

I think more than one or two percent false positives is already when it does more harm than good.

When for each pull request 10 automated tests are run and each on has 98% reliability, then the overall reliability is only 82% (0.98 ^ 10 = 0.817) so every 5th CI failure is a false positive.

To have 98% overall reliability (for 10 CI tests) each one must have 99.8% reliability (0.998 ^ 10 = 0.980).

With "reliability" I mean that a failing CI test correctly indentifies an issue in our own code.

I would like to discuss the future of the CI stuff (if you want, we can move it to a separate issue).

I agree that false positives are bad, but I do not want to get back to a state where even basic backup and recovery test on each PR should be tested manually by the submitter. I would thus like to put some effort into improving the reliability of the CI so that it is "green" most of the time and any "red" result most likely represents an actual bug (although the bug can be due to unrelated changes in the distro that we test on, sure, but even that we need to know about).

First of all, we have the rpm-build:opensuse-tumbleweed-x86_64 build test. Would you agree to removing it? I thought that it might be useful for you, but among the builds it seems to be the one that fails the most often. And for software that is not compiled, like ReaR, a build test is of limited use anyway, since there will be few build failures (no syntax errors to be caught by the compiler).

Second, I will disable the koji-build checks, I found that I only rarely use them (they produce RPMs that can be easily installed on a system, but I build my own if I need to anyway).

Third, the fedora-rawhide tests are prone to fail because of entirely unrelated breakage in Fedora that occurs form time to time, like recently. One can expect it to happen many times a year, so this will not pass your "2%" criterion. Let's disable the Fedora Rawhide test then.

Last, there is now a test failure when recovering F40 and Rawhide, discussed in more detail in #3175 . This may well be due to a real problem in ReaR or in the distro, as it is 100% reproducible in the CI environment. But since Testing Farm does not provide any good way to debug it (console logs are incomplete), I will disable also the Fedora 40 tests until I debug the problem.

There will always be test failures due to changes outside our control, but we should first focus on eliminating flakiness/temporary breakage and then we can focus on tests where the failures are reliable/persistent. Even if they are due to changes outside of our control, we still need to adapt to them or report bugs at their source, so such tests are still valuable for us and we should act on the results.

At the end, there will be some cases where the test fails reliably because of a change that breaks only the test script and not ReaR, and the test will need to be fixed. This is extra maintenance work that is the price for having a working CI and IMO this price is acceptable.

WDYT?

jsmeix commented at 2024-06-17 14:36:¶

@pcahyna
I fully agree to disable at least for now all CI stuff
that currently does not reasonably reliably produce
actually helpful results for us.

In general regarding openSUSE Tumbleweed
see what I wrote about it starting at
https://github.com/rear/rear/blob/master/doc/rear-release-notes.txt#L3862

In theory ReaR 2.7 should work on openSUSE Tumbleweed but in practice
arbitrary failures could happen at any time because the Tumbleweed
distribution is a pure rolling release version of openSUSE containing the
latest stable versions of all software
(cf. https://en.opensuse.org/Portal:Tumbleweed) so arbitrary changes of any
software are possible at any time that could arbitrarily break how ReaR works.

The "arbitrary changes of any software are possible at any time"
also means that building RPMs on openSUSE Tumbleweed cannot
reasonably reliably produce actually helpful results for us.
So yes, please disable all openSUSE Tumbleweed related stuff.

I cannot comment on the koji-build checks
because I don't know what they are good for
so from my personal (in this case ignorant) point of view
feel free to disable them as you think what is best.

Regarding Fedora related things, feel free to do
what fits best your needs regarding Red Hat.

jsmeix commented at 2024-06-17 14:50:¶

@pcahyna
what I would prefer - if possible - is that our CI tests
run in stable environments that are under our control.

In particular I would prefer - if possible - using
fixed OS versions for our CI tests.
For example - if possible - by using fixed OS images
of stable OS versions that we keep in our 'rear' space
at GitHub under our control: First and foremost
the stable enterprise Linux distributions RHEL and SLES
plus optionally also stable Debian or Ubuntu "LTS" versions
provided someone could maintain ReaR for Debian or Ubuntu.

pcahyna commented at 2024-06-17 15:17:¶

@jsmeix Koji is the Fedora build system and this check builds scratch packages the same way as the official Fedora packages would then be built. In practice, I think that there is not enough differences from the rpm-build tests, which use Copr (another build system) to expect that there would be problems uncovered by one that are not also uncovered by the other. I wanted to use the RPMs build this way in test automation that expects Koji builds and input, but I have found myself very rarely, if ever, doing that. The builds are in fact very reliable, but since they are not very useful, I will eliminate them to reduce visual clutter in the result list.

Regarding opensuse, I see that there is opensuse-leap-15.3-x86_64 build configured now, which does not appear in the results, probably because Copr does not know this target anymore (it knows opensuse-leap-15.5-x86_64 and opensuse-leap-15.4-x86_64). If you want, I can update it. Beware though, it will need updating in a similar way in the future, and also I saw the opensuse builds failing several times for reasons entirely unrelated to ReaR or even opensuse itself, so false positives are more likely than for other builds.

schlomo commented at 2024-06-17 15:32:¶

As we discuss CI builds here... my thoughts:

I really would love to have build automation that can run both locally on my machine and in the Cloud. That gives me significantly faster cycle times for changes and allows me to debug problems happening in the Cloud in my local setup. tools/run-in-docker.sh was born out of that thought and it creates our snapshot builds. That is also the reason why I find OBS, Koji or other Cloud-only build environments less useful
I see value in continuously building and testing against unknown Linux distro versions, because that gives us an early warning for problems that our users might be facing tomorrow.
Yes, there is - and will be - a permanent risk of such builds failing for external, e.g. non-ReaR, reasons. IMHO this is an accepted risk and in the end we still benefit more from the continued validation than what we suffer from the flaky builds. For me this is a trade-off decision where I choose the information provided or confirmation supplied by these builds.
Finally, yes we should keep investing into stabilizing our builds and reducing the external influence on the build results - but not at the expense of staying back.

HTH, and maybe we move this to a dedicated issue with specific suggestions?

jsmeix commented at 2024-06-17 15:33:¶

@pcahyna
don't worry about openSUSE CI build tests.
I cannot care about them (no time for that).
I.e. feel free to disable all of them.

In particular for SUSE (i.e. SLES) users they are meaningless
(SLES users get their RPMs directly from SUSE).

When needed I try to fix build issues
e.g. the last one was
https://github.com/rear/rear/issues/3235
in the openSUSE Build Service (OBS)
project "Archiving:Backup:Rear"
https://build.opensuse.org/project/show/Archiving:Backup:Rear
which is the traditional place
where we (i.e. ReaR upstream) build packages
which is the "officially documented place" at
http://relax-and-recover.org/download/

schlomo commented at 2024-06-17 15:41:¶

@jsmeix the problem I see with OBS, e.g. https://build.opensuse.org/package/show/Archiving:Backup:Rear/rear-2.7 is that it seems like we don't really maintain that any more. At least the Debian 12 and xUbuntu builds are failed and therefore users would have a problem if they need those packages.

For the next ReaR release I would therefore like to rely on automation that runs in GitHub and where it is simpler for all of us maintainers to see what is going on or to make changes. And where all the configuration for the builds is in our ReaR git repo and a change & push to it will effect the change.

For distro supported packages targeting SLES and RHEL and Fedora I of course expect the vendors to keep their official build infra up and running, but I wouldn't expect any other packages to come out of that.

For snapshot builds of master I would in turn expect the distros to not care and to supply all of the snapshot packages from our GitHub Releases page.

For a release I would expect to also provide release packages via our GitHub Releases (and GH actions automation), similar to the snapshots.

And, I'm happy to keep an eye on that stuff and to keep working on the build automation that runs on GH actions.

jsmeix commented at 2024-06-17 15:43:¶

I see negative value in continuously building and testing
against unknown Linux distro versions
because of the "unknown".
Who should continuously care about continuously failing
building and testing against unknown Linux distro versions?
@schlomo
when you continuously care about what you ask for
all is OK for me.
I cannot - no time for that.

schlomo commented at 2024-06-17 16:03:¶

@jsmeix yes, I also want to solve the "no time" and "rotting systems" problem. So my suggestion is that

the distro / vendor build systems cover "their" distros and you (@jsmeix and @pcahyna) take care of that and the fact that it exists means that somebody cares about it.
here on the GitHub page we build everything and care less because it is all based on best effort and if it is broken then eventually somebody will fix it.

pcahyna commented at 2024-06-17 16:20:¶

Moving to https://github.com/rear/rear/issues/3251

jsmeix commented at 2024-06-18 05:36:¶

@schlomo
regarding build failures in OBS
https://build.opensuse.org/project/show/Archiving:Backup:Rear
for non-(open)SUSE distributions:

As I wrote above
https://github.com/rear/rear/pull/3239#issuecomment-2173727926

When needed I try to fix build issues
e.g. the last one was
https://github.com/rear/rear/issues/3235
in the openSUSE Build Service (OBS)

I try to fix build issues as far as I can justify
also for non-(open)SUSE distributions but there
my efforts are very limited because I am not paid
by non-SUSE distributions.

In particular regarding Ubuntu I am very reluctant
because Ubuntu has a commercial distributor but
it seems Canonical is not interested in supporting
us at ReaR upstream so they get what they pay for.

openSUSE and in particular the openSUSE Build Service is
a public and free (both as in freedom and in free beer)
project where Ubuntu ReaR users could (and should)
contribute. They get what they contribute.

Because any Ubuntu user could contribute and fix the
Ubuntu build failures in the openSUSE Build Service,
I feel I have no right to "simply remove" their
currently failing Ubuntu ReaR source packages.

[Export of Github issue for rear/rear.]

#3239 PR merged: Fix version test in udev start by desupporting systemd < 190¶