#2018 PR merged
: Improved setup of etc/resolv.conf in the recovery system (issue 2015)¶
Labels: enhancement
, fixed / solved / done
jsmeix opened issue at 2019-01-16 15:48:¶
-
Type: Enhancement
-
Impact: Low and High on Ubuntu
-
Reference to related issue (URL):
https://github.com/rear/rear/issues/2015 -
How was this pull request tested?
By me on my openSUSE Leap 15.0 system.
Testing on Ubuntu systems is needed. -
Brief description of the changes in this pull request:
Improved validation of etc/resolv.conf in the recovery system and
provide final power to the user for DNS setup if needed via
USE_RESOLV_CONF.
Now it errors out during "rear mkrescue" if the recovery system
etc/resolv.conf
contains no real nameserver (e.g. only loopback addresses 127.*).
On Ubuntu 18.x versions /etc/resol.conf is linked to
/lib/systemd/resolv.conf
where its actual content is only the following single line
nameserver 127.0.0.53
cf. https://github.com/rear/rear/issues/2015#issuecomment-454082087
But a loopback IP address for the DNS nameserver cannot work within
the recovery system because there is no DNS server listening at
127.0.0.53
because systemd-resolved is not running within the recovery system.
Therefore when /etc/resolv.conf in the recovery system contains
nameserver values
it means DNS should be used within the recovery system and then things
cannot work
when only loopback nameservers are specified so that we error out in
this case.
It is o.k. to have an empty /etc/resolv.conf in the recovery system
(perhaps no DNS should be used within the recovery system)
and that can be even enforced via USE_RESOLV_CONF="no"
.
jsmeix commented at 2019-01-16 15:55:¶
Currently USE_RESOLV_CONF is not yet documented in default.conf.
I will document it if we agree to use USE_RESOLV_CONF.
jsmeix commented at 2019-01-17 16:38:¶
I will wait until next Monday for feedback from
https://github.com/rear/rear/issues/2015#issuecomment-454831893
and if there is none I think I will "just merge" it Monday afternoon.
If issues appear later, we can fix them later.
OliverO2 commented at 2019-03-08 12:05:¶
While testing ReaR a Ubuntu 18.04.2 LTS desktop system which has been upgraded from 16.04, I have encountered these situations:
- with ReaR version 2.4, DNS was working correctly on the recovery system without any intervention,
- with ReaR's current master version,
rear -v mkrescue
aborted with the following message:
ERROR: No nameserver or only loopback addresses in /tmp/rear.RK1T691WDugerN3/rootfs/etc/resolv.conf, specify a real nameserver via USE_RESOLV_CONF
Some latest log messages since the last called script 630_verify_resolv_conf_file.sh:
2019-03-08 12:49:00.428966084 Including build/GNU/Linux/630_verify_resolv_conf_file.sh
'/etc/resolv.conf' -> '/tmp/rear.RK1T691WDugerN3/rootfs/etc/resolv.conf'
2019-03-08 12:49:00.433446296 Useless loopback nameserver '127.0.0.53' in /tmp/rear.RK1T691WDugerN3/rootfs/etc/resolv.conf
So at least after a Ubuntu upgrade from 16.04 and with my (probably pretty standard) desktop configuration (nameserver assignment via DHCP), this PR makes ReaR fail where it did work before.
I'll test everything on a clean Ubuntu 18.04.2 installation, but this will probably happen in April.
jsmeix commented at 2019-03-08 13:21:¶
@OliverO2
your
https://github.com/rear/rear/pull/2018#issuecomment-470906450
is the intended and expected behaviour in current ReaR master code,
cf.
https://github.com/rear/rear/issues/2015#issuecomment-456058659
and see the USE_RESOLV_CONF
description in default.conf.
The reason is not that ReaR does no longer work, the reason is
that /etc/resolv.conf (or its symlink target) on the original system
is useless to be used in the ReaR recovery system and it is better
to error out during "rear mkbackup/mkrescue" when things cannot work
later during "rear recover", cf. "Try to care about possible errors"
in
https://github.com/rear/rear/wiki/Coding-Style
So I did what I could do which is to detect when /etc/resolv.conf
(or its symlink target) on the original system is useless to be used
in the ReaR recovery system and error out early in this case.
I would appreciate an enhancement that makes things work automatically
provided that automatism works reliably and with reasonable effort,
cf.
https://github.com/rear/rear/issues/2015#issuecomment-454749972
Simply put:
I would be against having much systemd bloatware in the ReaR recovery
system.
I would prefer to keep the recovery system simple and straightforward
(KISS).
jsmeix commented at 2019-03-08 14:28:¶
My immediate idea for an automatism that makes name resolution work
in the recovery system when the /etc/resolv.conf content on the
original
system is useless is to "just set the currently used actual
nameservers"
in the /etc/resolv.conf for the recovery system.
Unfortunately some Googling for show currently used nameserver
results various different kind of ways how one could do that
depending on what latest greatest networking stuff one uses
which seems to indicate that nowadays there is no such thing
as "the one generically right way" how to do that.
E.g.
dig $( hostname -f ) | grep 'SERVER:' | cut -s -d '(' -f2 | cut -s -d ')' -f1
may look sufficiently generic on first glance but
https://unix.stackexchange.com/questions/28941/what-dns-servers-am-i-using
tells (excerpts)
dig yourserver.somedomain.xyz
...
Ubuntu 18.04 just shows the local dns cache:
SERVER: 127.0.0.53#53(127.0.0.53)
so we are back at square one...
schlomo commented at 2019-03-08 15:16:¶
@jsmeix I think your wish to keep systemd "pollution" out of the ReaR
rescue system is not realistic. What about udev
? This is now a systemd
service. Also PID 1 is provided by systemd.
I would suggest another strategy: Embrace systemd and faithfully
replicate the source system as it was. I think systemd-resolved
is a
huge improvement over the previously widespread use of dnsmasq or other
local DNS servers.
systemd-resolved(8)
nicely explains the reasons for preferring it. One which I like most is
that it finally ensures that there won't be a discrepancy between
programs querying DNS directly and those using NSS functions (back in
the day I used this problem to test young admins for their Linux
networking skills).
While thinking about this topic I realized that so far we only ship a
standard nsswitch.conf
via SKEL and don't ever look at the real
/etc/nsswitch.conf
of the system. I think that this is not correct any
more and that we should fix that and correctly process that file as
well. It will lead us to properly support the future development of
Linux with systemd.
Regarding this PR, I think that the possibility to manually set the DNS configuration via ReaR config is nice, with a few adjustments:
- ReaR config is shell script so there is no problem to have a multi
line string, e.g. via a
<<EOF
style definition instead of an array of configuration file lines. - Users must also be able to set the NSS configuration accordingly as
using/configuring
resolv.conf
has a dependency tonsswitch.conf
I think actually that it would be easier for users if instead of the
current implementation they could set the DNS servers and DNS search
list via variables and not have to think about lines of the
resolv.conf
configuration file. That way we could also configure
nsswitch.conf
correctly if DNS is set manually and otherwise leave it
as it was on the source system.
And finally, the "best" way to get the DNS servers on a systemd system
is either asking it via networkctl status
or looking at
/run/systemd/resolve/resolv.conf
. But beware of systems that use
NetworkManager, like my Ubuntu 18.04 desktop. There only the generated
resolv.conf
showed true results.
OliverO2 commented at 2019-03-08 15:21:¶
@jsmeix https://github.com/rear/rear/pull/2018#issuecomment-470926305
it is better to error out during "rear mkbackup/mkrescue" when things cannot work
later during "rear recover"
While I agree with that strategy, what I have observed so far is that
ReaR actually worked with that seemingly useless /etc/resolv.conf
(in conjunction with whatever other files it copied). So in my case, the
assumption that recovery would fail does not hold true. Maybe we need
some more analysis before implementing a viable solution here.
schlomo commented at 2019-03-08 15:22:¶
@OliverO2 I would guess that you used DHCP in the recovery system and
that some DHCP script actually wrote your /etc/resolv.conf
OliverO2 commented at 2019-03-08 15:32:¶
@schlomo Right on the spot. '/etc/resolv.conf' reads:
; generated by /bin/dhclient-script
search <internal domain name here>
nameserver <internal dns ipv4 address here>
jsmeix commented at 2019-03-08 15:41:¶
@schlomo
I did what I could do with what I have on my systems, cf.
https://github.com/rear/rear/issues/2015#issuecomment-453981364
and
https://github.com/rear/rear/issues/2015#issuecomment-454313695
so that further enhancements in this area would need to be
provided to ReaR by people who use that other ways
of name resolution via new separated pull requests.
@OliverO2
if you cannot avoid that "rear mkrescue" errors out for you
with a USE_RESOLV_CONF
setting in your local.conf
then I would have to fix my USE_RESOLV_CONF
stuff.
I cannot reproduce your particular case because
I do not use your particular system here at work
(and at home I do not use ReaR) but I guess one of
USE_RESOLV_CONF="no"
USE_RESOLV_CONF="# dummy"
should avoid that "rear mkrescue" errors out.
Use KEEP_BUILD_DIR="yes"
so that you can directly inspect
the recovery system content in $TMPDIR/rear.XXX/rootfs/
in particular the $TMPDIR/rear.XXX/rootfs/etc/resolv.conf
OliverO2 commented at 2019-03-08 15:47:¶
@jsmeix No problem for me here, I can easily avoid the situation. My concern is that given a configuration like mine, ReaR would cease to work out of the box for others.
schlomo commented at 2019-03-08 15:49:¶
@jsmeix maybe you can adjust the default value of the new variable in a way that avoids breaking upgrades?
jsmeix commented at 2019-03-08 16:19:¶
@OliverO2
ReaR pretends much too often that it "just works out of the box"
by "just blindly proceeding" in case of possible errors, cf.
Try to care about possible errors
in
https://github.com/rear/rear/wiki/Coding-Style
Then sometimes the user finds out the hard way
when it is too late (i.e. by failures during "rear recover")
that actually it does not work out of the box.
In particular
https://github.com/rear/rear/issues/2015
describes a (repairable) failure for "rear recover"
When the rescue media is created this fails to resolve DNS
and therefore also fails to resolve hostnames for remote backups
but it is still a major annoyance when the first "rear recover" run
fails and
one must (in case of emergency and time pressure) analyze what is
wrong
and what must be done in the running recovery system to repair it.
Therefore I am adding verification steps during "rear mkrescue"
and of course that will sometimes error out with "false alarm",
cf.
https://github.com/rear/rear/pull/2060#issuecomment-469705996
But a "false alarm" during "rear mkrescue" can be easily fixed
when you have time for it and when you can do it step by step
in a 'relaxed' way.
In contrast a failure during "rear recover" is often a dead end, cf.
A note on the meaning of 'Relax' in 'Relax-and-Recover'
in https://en.opensuse.org/SDB:Disaster_Recovery
Very very seriously:
I prefer SLES customer bug reports about issues during "rear mkrescue"
very much over customer bug reports about failures during "rear
recover",
cf. Help and Support
in
https://en.opensuse.org/SDB:Disaster_Recovery
I get every issue about ReaR in SLES so I do care about that most of
all.
OliverO2 commented at 2019-03-09 11:36:¶
@jsmeix I understand your motivation that you want your support inquiries come in at the earliest possible time and not under pressure. This always makes sense. I'm just wondering about the gains in this particular case: When there's a non-working DNS on the recovery system, could those users have tested their recovery process at all?
jsmeix commented at 2019-03-11 10:37:¶
@OliverO2 regarding
https://github.com/rear/rear/pull/2018#issuecomment-470968951
I assume you have in your etc/rear/local.conf USE_DHCLIENT="yes"
but you do not also have USE_STATIC_NETWORKING="yes"
With
https://github.com/rear/rear/pull/2076
build/GNU/Linux/630_verify_resolv_conf_file.sh
does no longer error out
when etc/resolv.conf has no nameserver or only loopback addresses
and USE_DHCLIENT has a true value
(and USE_STATIC_NETWORKING does not have a true value)
because then etc/resolv.conf in the recovery system
is generated anew by /bin/dhclient-script
so that its content before does not matter.
jsmeix commented at 2019-03-11 10:59:¶
@OliverO2 regarding your
https://github.com/rear/rear/pull/2018#issuecomment-471169618
In my reasoning it does not really matter whether or not
users have tested their recovery process at all.
For the good users who test their recovery process
it is easier for them to see the root cause where things start to go
wrong
when ReaR errors out early with a clear error messsage
so that it helps them to get faster a recovery process that really
works.
I think a good recent example is
https://github.com/rear/rear/issues/2006
where things failed late at "rear recover" and we needed
a long time to find out what goes wrong (sometimes)
which was the final reason for me to do
https://github.com/rear/rear/pull/2060
Also there were those weird More than 128 partitions is not supported
issues
where the root cause was some brokenness related to disklayout.conf
where
"rear recover" blindly proceeded until it got lost in
get_partition_number()
with its error message that is totally useless at that (much too late)
point.
For the bad users who do not test their recovery process
it is easier for us when ReaR errors out early for them
because this avoids that we get weird issues from those users
when "rear recover" fails which are much harder to deal with
compared to issues when ReaR errors out early during
"rear mkbackup/mkrescue" with a clear error messsage
(just RTEM
"Read The Error Message" ;-)
OliverO2 commented at 2019-03-11 11:18:¶
@jsmeix https://github.com/rear/rear/pull/2018#issuecomment-471487864
I assume you have in your etc/rear/local.conf USE_DHCLIENT="yes"
but you do not also have USE_STATIC_NETWORKING="yes"
Actually, before this issue came up, I had no network-related settings in the local configuration at all (just backup- and output-related stuff).
I have only had a cursory look at the issue, but to me it looks like this:
- Ubuntu 18.04 uses
systemd-resolved
to manage DNS resolution, supporting dynamic networking changes induced, for example, by WiFi roaming. - For backward compatibility,
systemd-resolved(8)
listens on
127.0.0.53
and directs old clients there via static content in/etc/resolv.conf
(so it is neither broken nor useless). - It seems like ReaR does not copy the complete networking software
required in this case (which would include a correctly configured
systemd-resolved
). Nonetheless, in this case the old scripts seem to kick in, updating/etc/resolv.conf
as shown above. - For ReaR, it doesn't really seem to matter whether DNS is configured via netplan or something else as netplan is just a front-end and will update the necessary backend configuration files (in this case, systemd.network(5)).
So this is just my idea of how things interact currently.
jsmeix commented at 2019-03-11 12:49:¶
@OliverO2
your idea of how things interact currently is right and that is
basically described at USE_RESOLV_CONF in default.conf.
I never said /etc/resolv.conf is broken or useless on the original system.
I said when in the recovery system etc/resolv.conf has no nameserver
entry
or when it contains only loopback addresses as nameservers,
then such a etc/resolv.conf is useless in the recovery system
because we do not have support for systemd-resolved.
The current test in
build/GNU/Linux/630_verify_resolv_conf_file.sh
matches this state - as far as currently known to me.
When things worked for you with a /etc/resolv.conf on the original
system
that is useless in the recovery system even without
USE_DHCLIENT="yes"
it seems - for my current point of view - things had worked for you by
luck.
If you can show me that I am wong and how things had worked for you
by intention by this or that ReaR scripts, I will further enhance my
test in
build/GNU/Linux/630_verify_resolv_conf_file.sh
to better match the current state in ReaR (i.e. do no longer error out
when things are made working by intention).
Above in
https://github.com/rear/rear/pull/2018#issuecomment-470972084
I had written that further enhancements regarding systemd-resolved
would need to be provided to ReaR by people who use it
via separated pull requests.
It seems @schlomo also uses systemd-resolved and according to his
https://github.com/rear/rear/pull/2018#issuecomment-470963117
he likes to have support in ReaR for systemd-resolved
so that I look forward to a pull request.
jsmeix commented at 2019-03-11 15:25:¶
With
https://github.com/rear/rear/pull/2076
merged
build/GNU/Linux/630_verify_resolv_conf_file.sh does no longer error
out
when etc/resolv.conf has no nameserver or only loopback addresses
and USE_DHCLIENT is true (and USE_STATIC_NETWORKING is not true)
because then etc/resolv.conf in the recovery system is generated anew
by /bin/dhclient-script so that its content before does not matter.
This way it should in particular no longer falsely error out on
systems
that use systemd-resolved (like Ubuntu 18.04) and do their networking
setup via DHCP (probably pretty standard on usual desktop systems),
cf.
https://github.com/rear/rear/pull/2018#issuecomment-470906450
[Export of Github issue for rear/rear.]