#2018 PR merged: Improved setup of etc/resolv.conf in the recovery system (issue 2015)

Labels: enhancement, fixed / solved / done

jsmeix opened issue at 2019-01-16 15:48:

  • Type: Enhancement

  • Impact: Low and High on Ubuntu

  • Reference to related issue (URL):
    https://github.com/rear/rear/issues/2015

  • How was this pull request tested?
    By me on my openSUSE Leap 15.0 system.
    Testing on Ubuntu systems is needed.

  • Brief description of the changes in this pull request:

Improved validation of etc/resolv.conf in the recovery system and
provide final power to the user for DNS setup if needed via USE_RESOLV_CONF.

Now it errors out during "rear mkrescue" if the recovery system etc/resolv.conf
contains no real nameserver (e.g. only loopback addresses 127.*).

On Ubuntu 18.x versions /etc/resol.conf is linked to /lib/systemd/resolv.conf
where its actual content is only the following single line

nameserver 127.0.0.53

cf. https://github.com/rear/rear/issues/2015#issuecomment-454082087

But a loopback IP address for the DNS nameserver cannot work within
the recovery system because there is no DNS server listening at 127.0.0.53
because systemd-resolved is not running within the recovery system.

Therefore when /etc/resolv.conf in the recovery system contains nameserver values
it means DNS should be used within the recovery system and then things cannot work
when only loopback nameservers are specified so that we error out in this case.

It is o.k. to have an empty /etc/resolv.conf in the recovery system
(perhaps no DNS should be used within the recovery system)
and that can be even enforced via USE_RESOLV_CONF="no".

jsmeix commented at 2019-01-16 15:55:

Currently USE_RESOLV_CONF is not yet documented in default.conf.
I will document it if we agree to use USE_RESOLV_CONF.

jsmeix commented at 2019-01-17 16:38:

I will wait until next Monday for feedback from
https://github.com/rear/rear/issues/2015#issuecomment-454831893
and if there is none I think I will "just merge" it Monday afternoon.
If issues appear later, we can fix them later.

OliverO2 commented at 2019-03-08 12:05:

While testing ReaR a Ubuntu 18.04.2 LTS desktop system which has been upgraded from 16.04, I have encountered these situations:

  • with ReaR version 2.4, DNS was working correctly on the recovery system without any intervention,
  • with ReaR's current master version, rear -v mkrescue aborted with the following message:
ERROR: No nameserver or only loopback addresses in /tmp/rear.RK1T691WDugerN3/rootfs/etc/resolv.conf, specify a real nameserver via USE_RESOLV_CONF
Some latest log messages since the last called script 630_verify_resolv_conf_file.sh:
  2019-03-08 12:49:00.428966084 Including build/GNU/Linux/630_verify_resolv_conf_file.sh
  '/etc/resolv.conf' -> '/tmp/rear.RK1T691WDugerN3/rootfs/etc/resolv.conf'
  2019-03-08 12:49:00.433446296 Useless loopback nameserver '127.0.0.53' in /tmp/rear.RK1T691WDugerN3/rootfs/etc/resolv.conf

So at least after a Ubuntu upgrade from 16.04 and with my (probably pretty standard) desktop configuration (nameserver assignment via DHCP), this PR makes ReaR fail where it did work before.

I'll test everything on a clean Ubuntu 18.04.2 installation, but this will probably happen in April.

jsmeix commented at 2019-03-08 13:21:

@OliverO2
your https://github.com/rear/rear/pull/2018#issuecomment-470906450
is the intended and expected behaviour in current ReaR master code, cf.
https://github.com/rear/rear/issues/2015#issuecomment-456058659
and see the USE_RESOLV_CONF description in default.conf.

The reason is not that ReaR does no longer work, the reason is
that /etc/resolv.conf (or its symlink target) on the original system
is useless to be used in the ReaR recovery system and it is better
to error out during "rear mkbackup/mkrescue" when things cannot work
later during "rear recover", cf. "Try to care about possible errors" in
https://github.com/rear/rear/wiki/Coding-Style

So I did what I could do which is to detect when /etc/resolv.conf
(or its symlink target) on the original system is useless to be used
in the ReaR recovery system and error out early in this case.

I would appreciate an enhancement that makes things work automatically
provided that automatism works reliably and with reasonable effort, cf.
https://github.com/rear/rear/issues/2015#issuecomment-454749972

Simply put:
I would be against having much systemd bloatware in the ReaR recovery system.
I would prefer to keep the recovery system simple and straightforward (KISS).

jsmeix commented at 2019-03-08 14:28:

My immediate idea for an automatism that makes name resolution work
in the recovery system when the /etc/resolv.conf content on the original
system is useless is to "just set the currently used actual nameservers"
in the /etc/resolv.conf for the recovery system.

Unfortunately some Googling for show currently used nameserver
results various different kind of ways how one could do that
depending on what latest greatest networking stuff one uses
which seems to indicate that nowadays there is no such thing
as "the one generically right way" how to do that.

E.g.

dig $( hostname -f ) | grep 'SERVER:' | cut -s -d '(' -f2 | cut -s -d ')' -f1

may look sufficiently generic on first glance but
https://unix.stackexchange.com/questions/28941/what-dns-servers-am-i-using
tells (excerpts)

dig yourserver.somedomain.xyz
...
Ubuntu 18.04 just shows the local dns cache:
SERVER: 127.0.0.53#53(127.0.0.53)

so we are back at square one...

schlomo commented at 2019-03-08 15:16:

@jsmeix I think your wish to keep systemd "pollution" out of the ReaR rescue system is not realistic. What about udev? This is now a systemd service. Also PID 1 is provided by systemd.

I would suggest another strategy: Embrace systemd and faithfully replicate the source system as it was. I think systemd-resolved is a huge improvement over the previously widespread use of dnsmasq or other local DNS servers. systemd-resolved(8) nicely explains the reasons for preferring it. One which I like most is that it finally ensures that there won't be a discrepancy between programs querying DNS directly and those using NSS functions (back in the day I used this problem to test young admins for their Linux networking skills).

While thinking about this topic I realized that so far we only ship a standard nsswitch.conf via SKEL and don't ever look at the real /etc/nsswitch.conf of the system. I think that this is not correct any more and that we should fix that and correctly process that file as well. It will lead us to properly support the future development of Linux with systemd.

Regarding this PR, I think that the possibility to manually set the DNS configuration via ReaR config is nice, with a few adjustments:

  • ReaR config is shell script so there is no problem to have a multi line string, e.g. via a <<EOF style definition instead of an array of configuration file lines.
  • Users must also be able to set the NSS configuration accordingly as using/configuring resolv.conf has a dependency to nsswitch.conf

I think actually that it would be easier for users if instead of the current implementation they could set the DNS servers and DNS search list via variables and not have to think about lines of the resolv.conf configuration file. That way we could also configure nsswitch.conf correctly if DNS is set manually and otherwise leave it as it was on the source system.

And finally, the "best" way to get the DNS servers on a systemd system is either asking it via networkctl status or looking at /run/systemd/resolve/resolv.conf. But beware of systems that use NetworkManager, like my Ubuntu 18.04 desktop. There only the generated resolv.conf showed true results.

OliverO2 commented at 2019-03-08 15:21:

@jsmeix https://github.com/rear/rear/pull/2018#issuecomment-470926305

it is better to error out during "rear mkbackup/mkrescue" when things cannot work
later during "rear recover"

While I agree with that strategy, what I have observed so far is that ReaR actually worked with that seemingly useless /etc/resolv.conf (in conjunction with whatever other files it copied). So in my case, the assumption that recovery would fail does not hold true. Maybe we need some more analysis before implementing a viable solution here.

schlomo commented at 2019-03-08 15:22:

@OliverO2 I would guess that you used DHCP in the recovery system and that some DHCP script actually wrote your /etc/resolv.conf

OliverO2 commented at 2019-03-08 15:32:

@schlomo Right on the spot. '/etc/resolv.conf' reads:

; generated by /bin/dhclient-script
search <internal domain name here>
nameserver <internal dns ipv4 address here>

jsmeix commented at 2019-03-08 15:41:

@schlomo
I did what I could do with what I have on my systems, cf.
https://github.com/rear/rear/issues/2015#issuecomment-453981364
and
https://github.com/rear/rear/issues/2015#issuecomment-454313695
so that further enhancements in this area would need to be
provided to ReaR by people who use that other ways
of name resolution via new separated pull requests.

@OliverO2
if you cannot avoid that "rear mkrescue" errors out for you
with a USE_RESOLV_CONF setting in your local.conf
then I would have to fix my USE_RESOLV_CONF stuff.

I cannot reproduce your particular case because
I do not use your particular system here at work
(and at home I do not use ReaR) but I guess one of

USE_RESOLV_CONF="no"
USE_RESOLV_CONF="# dummy"

should avoid that "rear mkrescue" errors out.

Use KEEP_BUILD_DIR="yes" so that you can directly inspect
the recovery system content in $TMPDIR/rear.XXX/rootfs/
in particular the $TMPDIR/rear.XXX/rootfs/etc/resolv.conf

OliverO2 commented at 2019-03-08 15:47:

@jsmeix No problem for me here, I can easily avoid the situation. My concern is that given a configuration like mine, ReaR would cease to work out of the box for others.

schlomo commented at 2019-03-08 15:49:

@jsmeix maybe you can adjust the default value of the new variable in a way that avoids breaking upgrades?

jsmeix commented at 2019-03-08 16:19:

@OliverO2
ReaR pretends much too often that it "just works out of the box"
by "just blindly proceeding" in case of possible errors, cf.
Try to care about possible errors in
https://github.com/rear/rear/wiki/Coding-Style

Then sometimes the user finds out the hard way
when it is too late (i.e. by failures during "rear recover")
that actually it does not work out of the box.

In particular https://github.com/rear/rear/issues/2015
describes a (repairable) failure for "rear recover"

When the rescue media is created this fails to resolve DNS
and therefore also fails to resolve hostnames for remote backups

but it is still a major annoyance when the first "rear recover" run fails and
one must (in case of emergency and time pressure) analyze what is wrong
and what must be done in the running recovery system to repair it.

Therefore I am adding verification steps during "rear mkrescue"
and of course that will sometimes error out with "false alarm",
cf. https://github.com/rear/rear/pull/2060#issuecomment-469705996

But a "false alarm" during "rear mkrescue" can be easily fixed
when you have time for it and when you can do it step by step
in a 'relaxed' way.

In contrast a failure during "rear recover" is often a dead end, cf.

A note on the meaning of 'Relax' in 'Relax-and-Recover'

in https://en.opensuse.org/SDB:Disaster_Recovery

Very very seriously:
I prefer SLES customer bug reports about issues during "rear mkrescue"
very much over customer bug reports about failures during "rear recover",
cf. Help and Support in
https://en.opensuse.org/SDB:Disaster_Recovery
I get every issue about ReaR in SLES so I do care about that most of all.

OliverO2 commented at 2019-03-09 11:36:

@jsmeix I understand your motivation that you want your support inquiries come in at the earliest possible time and not under pressure. This always makes sense. I'm just wondering about the gains in this particular case: When there's a non-working DNS on the recovery system, could those users have tested their recovery process at all?

jsmeix commented at 2019-03-11 10:37:

@OliverO2 regarding
https://github.com/rear/rear/pull/2018#issuecomment-470968951
I assume you have in your etc/rear/local.conf USE_DHCLIENT="yes"
but you do not also have USE_STATIC_NETWORKING="yes"

With https://github.com/rear/rear/pull/2076
build/GNU/Linux/630_verify_resolv_conf_file.sh
does no longer error out
when etc/resolv.conf has no nameserver or only loopback addresses
and USE_DHCLIENT has a true value
(and USE_STATIC_NETWORKING does not have a true value)
because then etc/resolv.conf in the recovery system
is generated anew by /bin/dhclient-script
so that its content before does not matter.

jsmeix commented at 2019-03-11 10:59:

@OliverO2 regarding your
https://github.com/rear/rear/pull/2018#issuecomment-471169618

In my reasoning it does not really matter whether or not
users have tested their recovery process at all.

For the good users who test their recovery process
it is easier for them to see the root cause where things start to go wrong
when ReaR errors out early with a clear error messsage
so that it helps them to get faster a recovery process that really works.

I think a good recent example is
https://github.com/rear/rear/issues/2006
where things failed late at "rear recover" and we needed
a long time to find out what goes wrong (sometimes)
which was the final reason for me to do
https://github.com/rear/rear/pull/2060

Also there were those weird More than 128 partitions is not supported issues
where the root cause was some brokenness related to disklayout.conf where
"rear recover" blindly proceeded until it got lost in get_partition_number()
with its error message that is totally useless at that (much too late) point.

For the bad users who do not test their recovery process
it is easier for us when ReaR errors out early for them
because this avoids that we get weird issues from those users
when "rear recover" fails which are much harder to deal with
compared to issues when ReaR errors out early during
"rear mkbackup/mkrescue" with a clear error messsage
(just RTEM "Read The Error Message" ;-)

OliverO2 commented at 2019-03-11 11:18:

@jsmeix https://github.com/rear/rear/pull/2018#issuecomment-471487864

I assume you have in your etc/rear/local.conf USE_DHCLIENT="yes"
but you do not also have USE_STATIC_NETWORKING="yes"

Actually, before this issue came up, I had no network-related settings in the local configuration at all (just backup- and output-related stuff).

I have only had a cursory look at the issue, but to me it looks like this:

  • Ubuntu 18.04 uses systemd-resolved to manage DNS resolution, supporting dynamic networking changes induced, for example, by WiFi roaming.
  • For backward compatibility, systemd-resolved(8) listens on 127.0.0.53 and directs old clients there via static content in /etc/resolv.conf (so it is neither broken nor useless).
  • It seems like ReaR does not copy the complete networking software required in this case (which would include a correctly configured systemd-resolved). Nonetheless, in this case the old scripts seem to kick in, updating /etc/resolv.conf as shown above.
  • For ReaR, it doesn't really seem to matter whether DNS is configured via netplan or something else as netplan is just a front-end and will update the necessary backend configuration files (in this case, systemd.network(5)).

So this is just my idea of how things interact currently.

jsmeix commented at 2019-03-11 12:49:

@OliverO2
your idea of how things interact currently is right and that is
basically described at USE_RESOLV_CONF in default.conf.

I never said /etc/resolv.conf is broken or useless on the original system.

I said when in the recovery system etc/resolv.conf has no nameserver entry
or when it contains only loopback addresses as nameservers,
then such a etc/resolv.conf is useless in the recovery system
because we do not have support for systemd-resolved.

The current test in
build/GNU/Linux/630_verify_resolv_conf_file.sh
matches this state - as far as currently known to me.

When things worked for you with a /etc/resolv.conf on the original system
that is useless in the recovery system even without USE_DHCLIENT="yes"
it seems - for my current point of view - things had worked for you by luck.

If you can show me that I am wong and how things had worked for you
by intention by this or that ReaR scripts, I will further enhance my test in
build/GNU/Linux/630_verify_resolv_conf_file.sh
to better match the current state in ReaR (i.e. do no longer error out
when things are made working by intention).

Above in
https://github.com/rear/rear/pull/2018#issuecomment-470972084
I had written that further enhancements regarding systemd-resolved
would need to be provided to ReaR by people who use it
via separated pull requests.

It seems @schlomo also uses systemd-resolved and according to his
https://github.com/rear/rear/pull/2018#issuecomment-470963117
he likes to have support in ReaR for systemd-resolved
so that I look forward to a pull request.

jsmeix commented at 2019-03-11 15:25:

With https://github.com/rear/rear/pull/2076 merged
build/GNU/Linux/630_verify_resolv_conf_file.sh does no longer error out
when etc/resolv.conf has no nameserver or only loopback addresses
and USE_DHCLIENT is true (and USE_STATIC_NETWORKING is not true)
because then etc/resolv.conf in the recovery system is generated anew
by /bin/dhclient-script so that its content before does not matter.

This way it should in particular no longer falsely error out on systems
that use systemd-resolved (like Ubuntu 18.04) and do their networking
setup via DHCP (probably pretty standard on usual desktop systems),
cf. https://github.com/rear/rear/pull/2018#issuecomment-470906450


[Export of Github issue for rear/rear.]