#2029 Issue closed: No /verify/BORG/ script to verify a BORG backup can be restored

Labels: enhancement, fixed / solved / done

ChojinDSL opened issue at 2019-02-01 17:10:

Today I setup a new server and took that opportunity to test a recovery scenario using ReaR. This exposed a couple of issues, which I was thankfully able to work around.

But I was wondering, is there a possibility to test the recovery procedure without actually changing anything?
A sort of "dry-run" option if you will.

While I was testing, I ran into several issues.

  1. I use borg as a backup utility. I installed borg via the distro's repo. It wasn't until I actually tried to perform a recovery that I then realized that a bunch of dependency libraries where missing. I managed to fix that by uploading the standalone binary and copying it to the correct place and afterwards ensuring that the standalone binary was installed on my systems instead of the distro version.

  2. Another issue was missing ssh keys for the Recovery ISO to be able to connect via SSH to wherever the backups are hosted.
    This was an option I wasn't even aware of and only found by digging through the documentation.

Unfortunately, if you perform a recovery, connecting to the Backup Repository is not performed until AFTER the disks have been written to, e.g. partition layout, filesystem creation, etc.

This whole situation creates a bit of a chicken and egg scenario. You won't be aware of the issues until you actually try to perform the whole procedure for real. Risking a potentially borked system and having to pursue other recovery options.

Here's where a test run would be beneficial. Basically, run through all (or almost all) the steps without actually writing anything to the disks so you can make sure that all the bits and pieces work together.
Especially considering how tricky (if not impossible) it would be to create a new recovery ISO with the issues fixed.

gozora commented at 2019-02-05 11:03:

Hello @ChojinDSL,

Today I setup a new server and took that opportunity to test a recovery scenario using ReaR. This exposed a couple of issues, which I was thankfully able to work around.

Yes, as you might have noticed ReaR is designed not to let you hang in the middle of restore and it gives you breathing space and possibilities to finish your operation (most of the time ;-))

  1. I use borg as a backup utility. I installed borg via the distro's repo. It wasn't until I actually tried to perform a recovery that I then realized that a bunch of dependency libraries where missing. I managed to fix that by uploading the standalone binary and copying it to the correct place and afterwards ensuring that the standalone binary was installed on my systems instead of the distro version.

There is a notice about this in Documentation, right at the beginning marked as Important ;-)

  1. Another issue was missing ssh keys for the Recovery ISO to be able to connect via SSH to wherever the backups are hosted. This was an option I wasn't even aware of and only found by digging through the documentation.

Just FYI, this was implemented because of https://github.com/rear/rear/issues/1512. I'll not comment on this any further ...

Unfortunately, if you perform a recovery, connecting to the Backup Repository is not performed until AFTER the disks have been written to, e.g. partition layout, filesystem creation, etc.

I currently don't see any easy way how we could implement global --dry-run into ReaR because there is way to many backup solutions implemented. However I can think about some pre-check for Borg restore, that will be doing a test connection to Borg server prior anything is written to disk, and prompts user for if he wants to continue on failure, it should not be such a big deal ...

@jsmeix thanks for assigning! It somehow escaped my attention ;-).

V.

jsmeix commented at 2019-02-05 11:08:

Testing the recovery procedure without actually changing anything
i.e. via something like rear --dry-run recover is not supported
and as far as I can imagine this cannot be supported in general.

General reason:
I think it is not possible to test something without actually doing it.
I think really testing something requires to actually do it
in contrast to checking only this or that preconditions but even
to check a precondition something must be actually done.
One may think to only test something it could be done only temporarily
but "do something temporarily" does not make sense because
what exactly should it mean to "do something temporarily"?
Either it is actually done or something else is done that is only a fake
to make it look as if it was actually done but in fact nothing was done
but of course such "nothing" cannot provide any real result.
Cf. the "Reasoning" part in the section about
Transaction Semantics For Each "One Setup" in
https://en.opensuse.org/Archive:YaST_Printer_redesign

@ChojinDSL
regardless of the general reason in your particular case you wrote that

if you perform a recovery, connecting to the Backup Repository
is not performed until AFTER the disks have been written

This should not happen.

What scripts are usually run during rear recover are
(excerpts - in my case I use BACKUP=NETFS)

# usr/sbin/rear -s recover
Relax-and-Recover 2.4 / Git
...
Source conf/...
...
Source init/...
...
Source setup/...
...
Source verify/default/020_cciss_scsi_engage.sh
Source verify/default/020_translate_url.sh
Source verify/default/030_translate_tape.sh
Source verify/default/040_validate_variables.sh
Source verify/NETFS/default/050_check_NETFS_requirements.sh
Source verify/default/050_create_mappings_dir.sh
Source verify/GNU/Linux/050_sane_recovery_check.sh
Source verify/NETFS/default/050_start_required_nfs_daemons.sh
Source verify/NETFS/default/060_mount_NETFS_path.sh
Source verify/NETFS/default/070_set_backup_archive.sh
Source verify/NETFS/default/090_set_readonly_options.sh
Source verify/GNU/Linux/230_storage_and_network_modules.sh
Source verify/GNU/Linux/260_recovery_storage_drivers.sh
Source verify/NETFS/default/550_check_backup_archive.sh
Source verify/NETFS/default/600_check_encryption_key.sh
Source verify/NETFS/default/980_umount_NETFS_dir.sh
Source layout/prepare/...
...
Source layout/recreate/...
...
Source restore/...
...
Source finalize/...
...
Source wrapup/...
...
Exiting rear recover ...

In the verify stage in case of BACKUP=NETFS there is in particular
usr/share/rear/verify/NETFS/default/550_check_backup_archive.sh
run which checks whether the backup archive is actually there.

I do not use BORG backup but I can set BACKUP=BORG and
run usr/sbin/rear -s recover and then it seems there is
no check script to verify that the backup can be restored:

# usr/sbin/rear -s recover
...
Source verify/default/020_cciss_scsi_engage.sh
Source verify/default/020_translate_url.sh
Source verify/default/030_translate_tape.sh
Source verify/default/040_validate_variables.sh
Source verify/default/050_create_mappings_dir.sh
Source verify/GNU/Linux/050_sane_recovery_check.sh
Source verify/GNU/Linux/230_storage_and_network_modules.sh
Source verify/GNU/Linux/260_recovery_storage_drivers.sh
...

I.e. it seems some /verify/BORG/ script is missing that would
verify that the BORG backup can be actually restored.

@gozora
could you have a look here?
I guess some excerpts of the scripts that are run during
BORG backup restore

Source restore/BORG/default/100_set_vars.sh
Source restore/BORG/default/250_mount_usb.sh
Source restore/BORG/default/300_load_archives.sh
Source restore/BORG/default/400_restore_backup.sh

should be also run already during the verify stage to verify that
the BORG backup can be restored later during the restore stage.

@ChojinDSL
in general when you booted the ReaR recovery system
and logged in there as 'root' you can manually run any
commands you like to check what you need before you
call rear recover.

You can automate that by using PRE_RECOVERY_SCRIPT
see its description in usr/share/rear/conf/default.conf
that is run via usr/share/rear/setup/default/010_pre_recovery_script.sh
to automatically run e.g. a BORG check command as a workaround
that you could use for now until this issue here got fixed in ReaR.

@ChojinDSL
regarding what you wrote
missing ssh keys for the Recovery ISO to be able to connect via SSH
see the documentation about SSH_FILES and SSH_ROOT_PASSWORD
and SSH_UNPROTECTED_PRIVATE_KEYS in
usr/share/rear/conf/default.conf e.g. online at
https://raw.githubusercontent.com/rear/rear/master/usr/share/rear/conf/default.conf

For the reason why the recovery system ISO image should not contain
private or confidential data (at least not unless the user explicitly
configured something) see
https://github.com/rear/rear/pull/1472#issuecomment-328459748

gozora commented at 2019-02-05 11:17:

@jsmeix

could you have a look here?

Yes, I'll work on this as time permits ...

V.

schlomo commented at 2019-02-05 11:46:

@ChojinDSL I think that we expect users to also use VMs for initial testing if they are not sure about their setup. You can also recover a physical system or other VM in a new VM, ReaR will do the necessary migration for you.

jsmeix commented at 2019-02-05 11:50:

@ChojinDSL
in general you may also have a look at
https://en.opensuse.org/SDB:Disaster_Recovery

gozora commented at 2019-02-10 19:18:

Code in https://github.com/rear/rear/pull/2037 will do some basic Borg repository listing and behaves as follows:

RESCUE fedora:~ # rear recover
Relax-and-Recover 2.4 / Git
Running rear recover (PID 1751)
Using log file: /var/log/rear/rear-fedora.log
Running workflow recover within the ReaR rescue/recovery system
Failed to list Borg archive.
If you decide to continue, ReaR will partition your disks, but most probably will NOT be able to restore your data!
Do you want to continue?
1) View 'rear recover' log file (/var/log/rear/rear-fedora.log)
2) Use Relax-and-Recover shell and return back to here
3) Continue 'rear recover'
4) Abort 'rear recover'
(default '4' timeout 300 seconds)

and logs the reason why borg list failed:

2019-02-10 13:51:37.599616287 Including verify/BORG/default/400_check_archive_access.sh
2019-02-10 13:51:38.833303269 Failed to list Borg archive.
2019-02-10 13:51:38.834574449 If you decide to continue, ReaR will partition your disks, but most probably will NOT be able to restore your data!
2019-02-10 13:51:38.835747610 Command "borg list  ssh://borg@backup:22/mnt/rear/borg/fedora" returned: 
No key file for repository ssh://borg@backup:22/mnt/rear/borg/fedora found in /root/.config/borg/keys.

I've tested three scenarios (IMHO the most common):

  1. Borg server is down
  2. pass phrase is incorrect
  3. keys are missing

V.


[Export of Github issue for rear/rear.]