#2952 Issue closed: Confusing error about checksum mismatch after recovery

Labels: enhancement, fixed / solved / done

schlomo opened issue at 2023-03-05 02:55:

Probably I missed the context, but I find it consuing to get a checksum mismatch error during recovery:
image

While I understand the mechanics, I don't understand the purpose and I don't understand which problem ReaR tries to solve here. Also, I find this error not actionable.

I read through #2795 and I'm wondering if checking the CHECK_CONFIG_FILES array via MD5 is maybe too much... Or maybe we started to use that variable for too many purposes. For example, it containt not only the disk layout related configs but also ReaR configuration and even backup software configuration (for TSM, FDRUPSTREAM and soon also GALAXY11).

As a result, I'd expect the majority of recoveries to show that error even though there is no need to do anything because the disk layout related files (that we wanted to check for missing ID changes, no?) .

Maybe we need to split CHECK_CONFIG_FILES into two arrays, or limit the files that we check MD5 sums for? Also, maybe "checking if we should run mkrescue again" is a completely different problem from "is there a problem with the recovered system" so that we shouldn't use the same set of checksums for that?

Some more examples where the MD5 checksums are sure to fail:

  • mounting by hardware IDs, e.g. serial numbers, disk IDs or hardware-related paths
  • different network adapters
  • updates that happened between creating the backup and taking the mkrescue

All of that is because we use the MD5 checksums stored in the rescue image to check content of the restored backup, and they are in my experience much less tightly coupled.

The purpose of this issue is to talk a bit about how to narrow down the feature introduced by #2795 to reduce confusing error messages.

jsmeix commented at 2023-03-06 12:05:

The history behind:

It started with
https://github.com/rear/rear/issues/2785
that lead to
https://github.com/rear/rear/issues/2787
which is still open,
i.e. we do not have a solution for that.

jsmeix commented at 2023-03-07 09:22:

In general the reason for this error message is that
the user made a ReaR recovery system and a backup
that do not match.

As far as I can imagine such a mismatch can only happen
when the ReaR recovery system is made separated from the backup
i.e. not via "rear mkbackup" but via "rear mkrescue" and
"rear mkbackuponly" or by a third-party backup tool
plus when something was changed between "rear mkrescue"
and making the backup.

Here "do not match" means that for md5sums stored during
"rear mkrescue" in var/lib/rear/layout/config/files.md5sum
then during "rear recover" after the backup was restored
a md5sum does not match for restored files that are
listed in var/lib/rear/layout/config/files.md5sum
which means the files where the md5sum does not match
changed between "rear mkrescue" and making the backup.

The purpose of this error message is that
the user is informed about a possible problem.

ReaR does not try to solve any problem here
because ReaR cannot solve it during "rear recover"
when the user made a ReaR recovery system
and a backup that do not match.

Accordingly it is up to the user to evaluate
what such reported inconsistencies mean for him
and take some appropriate action to fix things
if needed or ignore reported inconsistencies.

jsmeix commented at 2023-03-07 09:27:

@pcahyna
I hope I described things sufficiently right in my above
https://github.com/rear/rear/issues/2952#issuecomment-1457828809
if not please correct me.

schlomo commented at 2023-03-07 09:50:

@jsmeix that is exactly my point: All users of commercial backup software will every only use rear mkrescue and the frequency of running mkrescue is typically much more seldom (e.g. after software and configuration rollouts) compared to the backup software taking a backup (nightly).

A mismatch is therefore to be expected and that is why I'd like to improve ReaR in a way that it won't show errors for expected situations.

jsmeix commented at 2023-03-07 10:46:

@schlomo
in your particular case

/etc/rear/local.conf
/etc/rear/os.conf

changed.

I don't know details when etc/rear/os.conf changes.
I think this could be a false positive in your case
but when really the OS or its version changed then
it likely indicates severe problems with a recovered system
when the ReaR recovery system was made on a different OS
than what is in the backup.

When etc/rear/local.conf doesn't match it always indicates
that there could be severe problems with a recovered system
because the user may have changed something critical
but only the user knows if this is really the case
(perhaps he only added a comment).

Do you always get this errors or other errors of that kind?

In particular do you also get them when you did not change
your etc/rear/local.conf between your "rear mkrescue" and
making your backup (or vice versa)?

jsmeix commented at 2023-03-07 11:03:

@schlomo
can you elaborate why you think

... the MD5 checksums are sure to fail:

*  mounting by hardware IDs, e.g. serial numbers,
   disk IDs or hardware-related paths

*  different network adapters

*  updates that happened between creating the backup
   and taking the mkrescue

I don't see why failing MD5 checksums
in those cases will always be false alarm.

See
https://github.com/rear/rear/issues/2785#issuecomment-1091782697
and
https://github.com/rear/rear/issues/2787#issue-1196066263

Summary (excerpts) from both:

change the UUID of the ... partition
...
The rescue is built everytime the layout changes
...
But the /etc/fstab file that will be used
during the recover ... is still the one in the backup
thus the old one with old UUID.
And the patch will fail ... (without a warning :D).
And the system will not boot.

The crucial part therein is

without a warning

https://github.com/rear/rear/pull/2795
implements such a "warning" but in compliance with
https://schlomo.schapiro.org/2015/04/warning-is-waste-of-my-time.html

jsmeix commented at 2023-03-07 12:33:

I think the main reason why this error message lines
look confusing is that context is missing what it means
in particular because it looks related to the
"Recreating directories" step before.

So instead of

Recreating directories ...
/etc/rear/local.conf: FAILED
/etc/rear/os.conf: FAILED
md5sum: WARNING: 2 computed checksums did NOT match
Error: Restored files do not match the recreated system in /mnt/local
Migrating disk-by-id mappings in certain restored files ...

a more comprehensible output could be like

Recreating directories ...
Checking if restored files may not match the recreated system
/etc/rear/local.conf: FAILED
/etc/rear/os.conf: FAILED
md5sum: WARNING: 2 computed checksums did NOT match
Error: Restored files do not match the recreated system in /mnt/local
Migrating disk-by-id mappings in certain restored files ...

jsmeix commented at 2023-03-07 12:48:

See my initial proposal to make it
at least look more comprehensible:
https://github.com/rear/rear/pull/2954
therein see
https://github.com/rear/rear/pull/2954/commits/50885c1963d0787f1e45f14ad983141e61d12654

jsmeix commented at 2023-03-07 13:27:

I made another proposal via
https://github.com/rear/rear/pull/2954/commits/2507822e9841bae97fd4408f9dfc0087fe495fca
for better user messages in finalize/default/060_compare_files.sh

With this it should look like

Recreating directories ...
Error: Restored files do not match the recreated system in /mnt/local
  /etc/rear/local.conf: FAILED
  /etc/rear/os.conf: FAILED
  md5sum: WARNING: 2 computed checksums did NOT match
Migrating disk-by-id mappings in certain restored files ...

This is not yet tested.

pcahyna commented at 2023-03-08 15:54:

Let me add more info about the purpose of the check (I believe I still remember it quite well). I think @jsmeix is right here but perhaps not clear enough.

@schlomo

Also, maybe "checking if we should run mkrescue again" is a completely different problem from "is there a problem with the recovered system" so that we shouldn't use the same set of checksums for that?

The idea behind this check is to check for problems that may arise because we should have run mkrescue again, but we have not. So it is not a completely different problem - it is the same problem detected at different times. That's why it uses the same set of checksums - it is deliberate.

Some more examples where the MD5 checksums are sure to fail:

  • mounting by hardware IDs, e.g. serial numbers, disk IDs or hardware-related paths
  • different network adapters
  • updates that happened between creating the backup and taking the mkrescue

They should not fail in this case, because the check happens before patching the files to change the hardware IDs to the new values (the restored backup should contain the original values and the checksums have also been calculated with the original values). Indeed, if the check fails, it indicates that this subsequent patching of hardware IDs in the files will fail (potentially serious problem! - as @jsmeix said) and this was originally the main reason for introducing the check, IIRC.
Not sure about network adapters, but I think the same reasoning applies (maybe the problem is less serious if detected, but should not normally occur either).
Concerning updates, I don't know, do they often lead to a checksum mismatch in the files concerned?

All of that is because we use the MD5 checksums stored in the rescue image to check content of the restored backup, and they are in my experience much less tightly coupled.

Unfortunately they are coupled, because the restored files (from backup) are to be patched using rules stored in the rescue image. That's the point.

All users of commercial backup software will every only use rear mkrescue and the frequency of running mkrescue is typically much more seldom (e.g. after software and configuration rollouts) compared to the backup software taking a backup (nightly).

A mismatch is therefore to be expected and that is why I'd like to improve ReaR in a way that it won't show errors for expected situations.

It may happen often in this situation, but it is not "expected" (in the sense of being harmless) IMO. It indicated that one should have executed "mkrescue" more often.

pcahyna commented at 2023-03-08 16:02:

For example, it containt not only the disk layout related configs but also ReaR configuration

I believe this is because ReaR configuration may refer to disks (for example, in the include/exclude rules), so it is also a "disk layout related config".

and even backup software configuration (for TSM, FDRUPSTREAM and soon also GALAXY11).

Not sure here, but isn't it a potential problem if the backup was restored using a different backup software configuration than the one it was made with?

jsmeix commented at 2023-03-09 12:24:

In particular regarding changed /etc/rear/* files:

I noticed my offhanded gut feeling reasoning at that time in
https://github.com/rear/rear/pull/2795#issuecomment-1113124577
which mentiones in particular the section
"Relax-and-Recover versus backup and restore" in
https://en.opensuse.org/SDB:Disaster_Recovery
(excerpt)

It is your task to ensure your backup is consistent.
First and foremost the files in your backup
must be consistent with the data that is stored
in your ReaR recovery system (in particular what
is stored to recreate the storage in files like
var/lib/rear/layout/disklayout.conf) because
ReaR will recreate the storage (disk partitions
with filesystems and mount points) with the data
that is stored in the ReaR recovery system and
then it restores the files from the backup into
the recreated storage which means in particular
restored config files must match the actually
recreated system (e.g. the contents of the
restored etc/fstab must match the actually
recreated disk layout).
Therefore after each change of the basic system
(in particular after a change of the disk layout)
"rear mkbackup" needs to be run to create
a new ReaR recovery system together with a
matching new backup of the files.

jsmeix commented at 2023-03-16 12:22:

With https://github.com/rear/rear/pull/2954 merged
this issue should be solved at least to some reasonable extent
which does not mean there cannot be further improvements but
those should then happen as separated issues and pull requests.


[Export of Github issue for rear/rear.]