#2611 Issue closed: ReaR configured with netfs backup can delete other/old folders on NFS at failure

Labels: cleanup, fixed / solved / done, critical / security / legal

pcahyna opened issue at 2021-05-05 19:50:

Relax-and-Recover (ReaR) Issue Template

Fill in the following items before submitting a new issue
(quick response is not guaranteed with free support):

  • ReaR version ("/usr/sbin/rear -V"): Relax-and-Recover 2.4 / Git, according to code also current master and older versions since 1.18

  • OS version ("cat /etc/os-release" or "lsb_release -a" or "cat /etc/rear/os.conf"): CentOS Linux 7 (Core)

  • ReaR configuration files ("cat /etc/rear/site.conf" and/or "cat /etc/rear/local.conf"):

BACKUP_URL=nfs://192.168.122.145/export/backup
BACKUP=NETFS
ISO_RECOVER_MODE=unattended
ISO_DEFAULT=automatic
NETFS_KEEP_OLD_BACKUP_COPY=yes
  • Hardware (PC or PowerNV BareMetal or ARM) or virtual machine (KVM guest or PoverVM LPAR): KVM guest

  • System architecture (x86 compatible or PPC64/PPC64LE or what exact ARM device): x86

  • Firmware (BIOS or UEFI or Open Firmware) and bootloader (GRUB or ELILO or Petitboot): BIOS/GRUB

  • Storage (local disk or SSD) and/or SAN (FC or iSCSI or FCoE) and/or multipath (DM or NVMe): local disk

  • Storage layout ("lsblk -ipo NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,SIZE,MOUNTPOINT" or "lsblk" as makeshift): trivial

  • Description of the issue (ideally so that others can reproduce it):
    In case of NETFS_KEEP_OLD_BACKUP_COPY ReaR can remove old backup (*.old) if interrupted during the output stage. If for any reason unmounting the NFS directory mounted at outputfs fails (for example, there is a process using this filesystem), ReaR proceeds to rm -Rf -v /.../outputfs as part of the exit tasks. This removes the old backup directory.
    It will also destroy the backup directories for other machines, if multiple machines are configured to use the same NFS directory. A similar problem was reported as #465 and after being fixed it was reintroduced in a misguided "fix" (PR #782, 4e9c2a1b05f87762fb06355cf959b24eacc21f62) in ReaR 1.18.
    There are multiple different copies of the exist task to cleanup the output directory in case of failure: those in the prep and backup stages look mostly safe (it would be still safer to use rmdir instead of rm -Rf though) and the only problematic one is in the output stage (usr/share/rear/output/default/100_mount_output_path.sh)

  • Workaround, if any:
    Probably configuring an OUTPUT_URL different from BACKUP_URL and not sharing the NFS directory among multiple machines would limit the damage (one would not lose backup data if using NETFS, only the output ISO).

  • Attachments, as applicable ("rear -D mkrescue/mkbackup/recover" debug log files):

+++ Print 'Running exit tasks'
+++ test 1
+++ echo -e 'Running exit tasks'
+++ local exit_task=
+++ for exit_task in '"${EXIT_TASKS[@]}"'
+++ Debug 'Exit task '\''umount -f -v '\''/tmp/rear.uUNE07AQyw2ey0S/outputfs'\'' >&2'\'''
+++ test 1
+++ Log 'Exit task '\''umount -f -v '\''/tmp/rear.uUNE07AQyw2ey0S/outputfs'\'' >&2'\'''
++++ date '+%Y-%m-%d %H:%M:%S.%N '
+++ local 'timestamp=2021-05-05 13:01:00.404599344 '
+++ test 1 -gt 0
+++ echo '2021-05-05 13:01:00.404599344 Exit task '\''umount -f -v '\''/tmp/rear.uUNE07AQyw2ey0S/outputfs'\'' >&2'\'''
2021-05-05 13:01:00.404599344 Exit task 'umount -f -v '/tmp/rear.uUNE07AQyw2ey0S/outputfs' >&2'
+++ eval 'umount -f -v '\''/tmp/rear.uUNE07AQyw2ey0S/outputfs'\'' >&2'
++++ umount -f -v /tmp/rear.uUNE07AQyw2ey0S/outputfs
umount.nfs4: /tmp/rear.uUNE07AQyw2ey0S/outputfs: device is busy
/tmp/rear.uUNE07AQyw2ey0S/outputfs: nfs4 mount point detected
/tmp/rear.uUNE07AQyw2ey0S/outputfs: umount failed
+++ for exit_task in '"${EXIT_TASKS[@]}"'
+++ Debug 'Exit task '\''rm -Rf -v /tmp/rear.uUNE07AQyw2ey0S/outputfs >&2'\'''
+++ test 1
+++ Log 'Exit task '\''rm -Rf -v /tmp/rear.uUNE07AQyw2ey0S/outputfs >&2'\'''
++++ date '+%Y-%m-%d %H:%M:%S.%N '
+++ local 'timestamp=2021-05-05 13:01:00.413623168 '
+++ test 1 -gt 0
+++ echo '2021-05-05 13:01:00.413623168 Exit task '\''rm -Rf -v /tmp/rear.uUNE07AQyw2ey0S/outputfs >&2'\'''
2021-05-05 13:01:00.413623168 Exit task 'rm -Rf -v /tmp/rear.uUNE07AQyw2ey0S/outputfs >&2'
+++ eval 'rm -Rf -v /tmp/rear.uUNE07AQyw2ey0S/outputfs >&2'
++++ rm -Rf -v /tmp/rear.uUNE07AQyw2ey0S/outputfs
removed '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost.old/rear-localhost.iso'
removed '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost.old/VERSION'
removed '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost.old/README'
removed '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost.old/rear-localhost.log'
removed '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost.old/backup.tar.gz'
removed directory: '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost.old'
removed '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost/.lockfile'
removed directory: '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost'
rm: cannot remove '/tmp/rear.uUNE07AQyw2ey0S/outputfs': Device or resource busy
+++ for exit_task in '"${EXIT_TASKS[@]}"'
+++ Debug 'Exit task '\''cleanup_build_area_and_end_program'\'''
+++ test 1
+++ Log 'Exit task '\''cleanup_build_area_and_end_program'\'''
++++ date '+%Y-%m-%d %H:%M:%S.%N '
+++ local 'timestamp=2021-05-05 13:01:00.439437963 '
+++ test 1 -gt 0
+++ echo '2021-05-05 13:01:00.439437963 Exit task '\''cleanup_build_area_and_end_program'\'''
2021-05-05 13:01:00.439437963 Exit task 'cleanup_build_area_and_end_program'
+++ eval cleanup_build_area_and_end_program
++++ cleanup_build_area_and_end_program
++++ Log 'Finished in 72 seconds'
+++++ date '+%Y-%m-%d %H:%M:%S.%N '
++++ local 'timestamp=2021-05-05 13:01:00.442303240 '
++++ test 1 -gt 0
++++ echo '2021-05-05 13:01:00.442303240 Finished in 72 seconds'
2021-05-05 13:01:00.442303240 Finished in 72 seconds
++++ test 1
++++ LogPrint 'You should also rm -Rf /tmp/rear.uUNE07AQyw2ey0S'
++++ Log 'You should also rm -Rf /tmp/rear.uUNE07AQyw2ey0S'

(The log message above also needs updating. If we are telling the user what to do, we should not instruct them to use a command that will destroy data. rm -Rf --one-file-system could help.)

jsmeix commented at 2021-05-06 07:48:

Sigh!
In general: We have rather many rm -Rf or rm -rf in the code.
I would be surprised if I don't find some more places where it is used carelessly...

pcahyna commented at 2021-05-06 08:00:

@jsmeix would it work to omit those exit tasks that unmount and remove the outputfs and leave the only one - cleanup_build_area_and_end_program? You can see in the log above that Exit task 'umount -f -v '/tmp/rear.uUNE07AQyw2ey0S/outputfs' gets called, then the problematic Exit task 'rm -Rf -v /tmp/rear.uUNE07AQyw2ey0S/outputfs >&2' gets called, and finally Exit task 'cleanup_build_area_and_end_program' gets called. The last one would be enough to do the needed job. Of course, relying on it would make the code less local (usr/share/rear/output/default/100_mount_output_path.sh would need to rely on someone else cleaning up after it), but now the code is duplicated, or rather multiplicated.

pcahyna commented at 2021-05-06 08:11:

Or, if it is desirable to keep the code structured as it is (each stage of the code taking care of cleaning up after itself in the case of failure, with the cleanup possibly running multiple times), it would be helpful to factor out the cleanup into a function that could then be called multiple times.
(The same function could be called both in exit tasks and in regular cleanups at the end of stages that need it.)

jsmeix commented at 2021-05-06 08:16:

@pcahyna
I have no experience with the code parts that are involved here
so from my point of view feel free to do it as you think it is best, cf.
https://github.com/rear/rear/pull/782#issuecomment-806585351

Perhaps @gdha has some experience with the code parts that are involved here
so he could suggest what the best solution could be here?

pcahyna commented at 2021-05-12 10:15:

@jsmeix I think you can assign the issue to me, if nobody else has started working on it.

@gdha would it be acceptable to entirely remove the problematic ExitTask and keep just the global one cleanup_build_area_and_end_program ?

jsmeix commented at 2021-05-12 10:21:

@pcahyna
thank you in advance for working on yet another "messy" part of ReaR!

jsmeix commented at 2021-05-12 10:30:

Only FYI my general opinion:
I think if some rm -rf is missing it is a normal bug or even just an enhancement.
But when rm -rf happens that should not have been done it is a critical bug.
So in case of doubt better less rm -rf than too much.

pcahyna commented at 2021-05-12 10:41:

One thing I want to do is to add --one-file-system to all rm -rf instances that remove the build area or parts of it (that will remain after the intended cleanup).

jsmeix commented at 2021-05-12 11:17:

@pcahyna
the rm option --one-file-system is not available on older versions of rm
so check how far it is supported on older RHEL versions.

For example not for rm in coreutils-5.93
that I have on my SLES10-SP4 test system
ReaR does no longer support SLES10 - I use it here only as an example
because SLES10 is the oldest test system that I currently have.
I will check SLES11 soon...

When --one-file-system is specified rm (on SLES10) shows

rm: unrecognized option '--one-file-system'

and does not remove anything.

jsmeix commented at 2021-05-12 11:44:

rm in coreutils-8.12 that I have on SLES11 supports the --one-file-system option.

pcahyna commented at 2021-05-12 12:15:

I did an investigation and reported here: https://github.com/rear/rear/pull/782#issuecomment-806577614

pcahyna commented at 2021-05-12 12:16:

Thanks for checking SLES, I think it was the only supported distribution missing from my investigation.

jsmeix commented at 2021-05-12 12:33:

We may even stay backward compatible for the old unsupported distributions
by conditional setting a rm_one_file_system variable e.g. with code like

# rm --one-file-system --help &>/dev/null && rm_one_file_system='--one-file-system' || rm_one_file_system=''

# rm $rm_one_file_system ...

which is perhaps better than to let things "just fail" on old distributions?

pcahyna commented at 2021-05-12 12:42:

@jsmeix I would prefer to not complicate the code only for the benefit of distributions that are unsupported anyway according to https://github.com/rear/rear.github.com/blob/master/documentation/release-notes-2-6.md#supported-and-unsupported-operating-systems

jsmeix commented at 2021-05-12 12:55:

@pcahyna
when you do the actual work it is you who decides how to do it
and in https://github.com/rear/rear/pull/782#issuecomment-806585351
I wrote (excerpt)

feel free and be brave and merciless and try to clean up the mess

so feel free to mercilessly clean up things!

By the way:
Today is logical Friday for me
because tomorrow is public holiday in Germany
https://en.wikipedia.org/wiki/Feast_of_the_Ascension
and the day after tomorrow is vacation for me
so I wish you a relaxed and recovering weekend!

pcahyna commented at 2021-05-12 16:15:

@jsmeix may you ascend successfully!

jsmeix commented at 2021-05-17 11:53:

It didn't work for me - I am still on earth ;-)

pcahyna commented at 2021-05-28 09:22:

Hi @jsmeix , here is my branch to fix these problems together with many cleanups in related areas. Please have a look: https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task

It is not tested yet, so I am not yet opening a PR. Please take a quick look if you can to check whether you see something obviously wrong or changes that you don't like for style reasons. I will try to test it in the meantime (it touches many output methods so testing will take some time).

It also addresses the problem found in https://github.com/rear/rear/commit/20359a987662cc0c3fcfa52d62d1feac7cd55850#r51319634 .

jsmeix commented at 2021-05-28 11:27:

@pcahyna
while reading through your canges in
https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task
I noticed another issue that I like to mention here FYI:

Summary:
Code like

umount_url ... || Error "..."

can be problematic in scripts that are run during "rear recover"
after the backup was restored.

Reason:
We do not want to error out after the backup was restored
unless the error is so serious that it is impossible to continue.

But some failed umount in the ReaR recovery system
that will be rebooted anyway is likely no really serious error.

For comparison what stages are sourced during "rear mkbackup" versus "rear recover":

# usr/sbin/rear -s mkbackup | grep -o 'Source [^/]*' | uniq
Source conf
Source init
Source prep
Source layout
Source rescue
Source build
Source pack
Source output
Source backup

# usr/sbin/rear -s recover | grep -o 'Source [^/]*' | uniq
Source conf
Source init
Source setup
Source verify
Source layout
Source restore
Source finalize
Source wrapup

What scripts (including symlinks to scripts) call umount_url
in the current GitHub master code:

# find -L usr/share/rear -type f | xargs grep -l 'umount_url' | grep -v global-functions.sh
usr/share/rear/output/default/980_umount_output_dir.sh
usr/share/rear/output/PXE/default/810_create_pxelinux_cfg.sh
usr/share/rear/output/PXE/default/800_copy_to_tftp.sh
usr/share/rear/restore/YUM/default/980_umount_YUM_dir.sh
usr/share/rear/restore/BLOCKCLONE/default/980_umount_NETFS_dir.sh
usr/share/rear/restore/NETFS/default/980_umount_NETFS_dir.sh
usr/share/rear/restore/DUPLICITY/default/980_unmount_duplicity_path.sh
usr/share/rear/restore/RBME/default/980_umount_NETFS_dir.sh
usr/share/rear/restore/BORG/default/900_umount_usb.sh
usr/share/rear/verify/YUM/default/980_umount_YUM_dir.sh
usr/share/rear/verify/BLOCKCLONE/default/980_umount_NETFS_dir.sh
usr/share/rear/verify/NETFS/default/980_umount_NETFS_dir.sh
usr/share/rear/verify/DUPLICITY/default/980_unmount_duplicity_path.sh
usr/share/rear/backup/BLOCKCLONE/default/980_umount_NETFS_dir.sh
usr/share/rear/backup/NETFS/default/980_umount_NETFS_dir.sh
usr/share/rear/backup/DUPLICITY/default/980_unmount_duplicity_path.sh
usr/share/rear/backup/BORG/default/900_umount_usb.sh
usr/share/rear/prep/BLOCKCLONE/default/980_umount_NETFS_dir.sh
usr/share/rear/prep/NETFS/default/980_umount_NETFS_dir.sh

So the scripts or symlinks to scripts that call umount_url
during "rear recover" after the backup was restored are

usr/share/rear/restore/YUM/default/980_umount_YUM_dir.sh
usr/share/rear/restore/BLOCKCLONE/default/980_umount_NETFS_dir.sh
usr/share/rear/restore/NETFS/default/980_umount_NETFS_dir.sh
usr/share/rear/restore/DUPLICITY/default/980_unmount_duplicity_path.sh
usr/share/rear/restore/RBME/default/980_umount_NETFS_dir.sh
usr/share/rear/restore/BORG/default/900_umount_usb.sh

Furthermore there are usr/share/rear/restore scripts (or symlinks)
that do other umount calls in the current GitHub master code:

usr/share/rear/restore/YUM/default/410_restore_backup.sh:
                umount "$BUILD_DIR/outputfs"

usr/share/rear/restore/BLOCKCLONE/default/400_restore_clone.sh:
        umount_mountpoint $mp

usr/share/rear/restore/NETFS/default/400_restore_backup.sh:
                umount "$BUILD_DIR/outputfs"

As far as I see those plain umount "$BUILD_DIR/outputfs" are also in your
https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task

pcahyna commented at 2021-05-28 11:36:

Code like

umount_url ... || Error "..."

can be problematic in scripts that are run during "rear recover"
after the backup was restored.

Sure, but I don't see any umount_url ... || Error "..." in my changes. There is only

perform_umount_url $url $mountpoint || Error "Unmounting '$mountpoint' failed."

in umount_url(), i.e. the check is performed inside umount_url(), not in the callers. The original code was

           Log "Unmounting with '$umount_cmd'"
           $umount_cmd
           StopIfError "Unmounting failed."

which is essentially equivalent. So you have a point, but my change does not seem to introduce any substantial change, the problem has existed before.

pcahyna commented at 2021-05-28 11:41:

@jsmeix if you think that those changes present a good opportunity to fix this problem that you have noticed, I can do that.

jsmeix commented at 2021-05-28 11:43:

In your https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task
in usr/share/rear/lib/global-functions.sh there is (excerpts):

        (var)
            ### The mount command is given by variable in the url host
            local var=$(url_host $url)
            mount_cmd="${!var} $mountpoint"
            ;;
...
        (var)
            local var
            var=$(url_host $url)
            Log "Unmounting with '${!var} $mountpoint'"
            # lazy unmount not supported with custom umount command
            ${!var} $mountpoint
            ;;

The usage of the scheme name var also for the variable name drives me nuts.
I just fail to understand it.
From reading the code var=$(url_host $url)
the var variable contains a URL host part so it should be named host
but the comment tells the var variable actually contains a "mount command"
so it should be named mount_command or my understanding is totally confused
and actually the var variable contains the name of another variable
because it is then used as ${!var} or actually the var variable means
the scheme name var?
Therefore I would like to ask you if you could clean up that naming mess
so that it is more clear what that code actually is about.
I also do not understand what the scheme name var actually means.

jsmeix commented at 2021-05-28 11:49:

@pcahyna
yes the umount_url ... || Error "..." issue existed before.
I only spotted it now and I reported it here primarily so that it is not forgotten.

I don't have a simple straightforward idea how to get that new issue solved
i.e. how to distinguish if we are "after the backup was restored"
so I think you should not try to solve that additional issue now.
Things worked as is since years so there in no pressure to enhance it now.

pcahyna commented at 2021-05-28 11:51:

From reading the code var=$(url_host $url)
the var variable contains a URL host part so it should be named host
but the comment tells the var variable actually contains a "mount command"
so it should be named mount_command or my understanding is totally confused
and actuall the var variable actually contains the name of another variable
because it is then used as ${!var} or actually the var variable means
the scheme name var.

There is an indirection: ${var} is not a mount command, but the name of another variable which contains the mount command. It is used in the var: scheme and confusingly it is named identically tho that scheme.

pcahyna commented at 2021-05-28 11:53:

var:// is an internal scheme to implement the *_{U,}MOUNTCMD functionality. It needs to be changed a bit because as it is currently it will not handle errors very well. (mount_url() needs to know the _UMOUNTCMD in order to use it in the exit task, but it is not known at the point where mount_url() gets called.)

pcahyna commented at 2021-05-28 11:55:

@pcahyna
yes the umount_url ... || Error "..." issue existed before.
I only spotted it now and I reported it here primarily so that it is not forgotten.

I don't have a simple straightforward idea how to get that new issue solved
i.e. how to distinguish if we are "after the backup was restored"
so I think you should not try to solve that additional issue now.
Things worked as is since years so there in no pressure to enhance it now.

Wouldn't it be enough to make the Error conditional on "recover" != "$WORKFLOW" ?

jsmeix commented at 2021-05-28 12:02:

Backup restore also happens in "restoreonly" workflow that runs those stages:

# usr/sbin/rear -s restoreonly | grep -o 'Source [^/]*' | uniq
Source conf
Source init
Source setup
Source verify
Source restore
Source wrapup

And during "recover" workflow we should error our as usual
before the backup was restored i.e. during the usr/share/rear/verify stage.

E.g. one same 980_umount_NETFS_dir.sh exists as script or symlinks
where during "rear mkbackup" it is called as those

usr/share/rear/prep/NETFS/default/980_umount_NETFS_dir.sh
usr/share/rear/backup/NETFS/default/980_umount_NETFS_dir.sh

versus during "rear recover" it is called as those

usr/share/rear/verify/NETFS/default/980_umount_NETFS_dir.sh
usr/share/rear/restore/NETFS/default/980_umount_NETFS_dir.sh

jsmeix commented at 2021-05-28 12:15:

When var is an internal scheme to implement the *_{U,}MOUNTCMD functionality
then that scheme should be renamed accordingly to mountcmd.

The var named variable (a variable named var how explanatory! - shudder! ;-)
should be renamed to something that explains what that thingy is about
e.g. something like mount_command_variable_name or mnt_cmd_var_name
or mountcmd_var_name - the latter contains the matching scheme name
so grep -i mountcmd over the whole code "just finds" all related places,
cf. "Code must be easy to read" in
https://github.com/rear/rear/wiki/Coding-Style

pcahyna commented at 2021-05-28 12:21:

It is not named var because it is a variable, but because its value is a variable :-) which is in some (EDIT: perverse) sense explanatory.

pcahyna commented at 2021-05-28 12:23:

Backup restore also happens in "restoreonly" workflow that runs those stages:

What about making it conditional on stage instead of workflow, would this work?

pcahyna commented at 2021-05-28 12:27:

usr/share/rear/restore scripts (or symlinks)
that do other umount calls in the current GitHub master code:

usr/share/rear/restore/YUM/default/410_restore_backup.sh:
                umount "$BUILD_DIR/outputfs"

usr/share/rear/restore/BLOCKCLONE/default/400_restore_clone.sh:
        umount_mountpoint $mp

usr/share/rear/restore/NETFS/default/400_restore_backup.sh:
                umount "$BUILD_DIR/outputfs"

Those umounts are in the code that implements split archive support. This code umounts the CD-ROM, lets the user insert another CD-ROM, and mounts it, possibly several times. Does not sound like a problem. The umount_url() at the end will see the last CD-ROM and unmount it, although it is not the same CD-ROM that mount_url() started with.

jsmeix commented at 2021-05-28 12:28:

In your https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task
in usr/share/rear/lib/global-functions.s there is

function remove_temporary_mountpoint() {
    rmdir $v "$1" >&2
}

Usually you should no longer need to do something special with stdout and stderr,
see https://github.com/rear/rear/issues/2416

pcahyna commented at 2021-05-28 12:30:

In restore/BLOCKCLONE/default/400_restore_clone.sh the unmount is there for the target device, which is an independent case from mount/umount_url(), so this should not be a problem.

pcahyna commented at 2021-05-28 12:35:

In your https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task
in usr/share/rear/lib/global-functions.s there is

function remove_temporary_mountpoint() {
    rmdir $v "$1" >&2
}

Usually you should no longer need to do something special with stdout and stderr,

This function replaces the numerous uses of rmdir and rm -Rf that had been there before. All of them use this redirection, so I preserved it. If it is not needed anymore, I can drop it in a separate commit. (This is the purpose of having it in a function, even if tiny - one can do such changes on one place instead of many.)

jsmeix commented at 2021-05-28 12:36:

In your https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task
in usr/share/rear/output/default/950_copy_result_files.sh there is

# XXX should we check for empty $OUTPUT_URL here? Most likely plenty of things got broken before reaching this point if it is empty

Usually we use TODO and FIXME in comments for such cases.

pcahyna commented at 2021-05-28 13:07:

Usually we use TODO and FIXME in comments for such cases.

Changed now

jsmeix commented at 2021-05-28 13:11:

So I had now a rather quick look over the changes in
https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task
and all looks really good to me from plain looking at the code.

In particular I do appreciate the various major cleanups therein.
Even if those cleanups might break some unexpected special cases
I prefer to have them and fix those unexpected special cases later as needed.

@pcahyna
thank you so much for all your work here!

pcahyna commented at 2021-05-28 13:13:

@jsmeix thanks a lot for the review. I have found during the cleanups that many unexpected special cases are already broken in one way or another. I will add a list of these findings to the PR when I open it.

jsmeix commented at 2021-05-28 13:19:

@pcahyna
I wish you a relaxed and recovering weekend!

pcahyna commented at 2021-05-31 10:05:

@jsmeix thanks, hope you had a relaxing and recovering weekend as well.

Concerning

Backup restore also happens in "restoreonly" workflow that runs those stages:

What about making it conditional on stage instead of workflow, would this work?

I realized that there is apparently no way to test what stage we are in, unlike for workflow. Would it work to introduce a global variable STAGE and set it in SourceStage to the name of the current stage, then test if $STAGE != "restore before erroring out in umount_url() to address your initial concerns that we should not abort the recovery just because the umount of the backup directory has failed?

jsmeix commented at 2021-05-31 12:26:

I also have the "use a global variable" idea in my mind
but currently my mind is not yet finished with thinking about it.

My current idea is a more direct variable like BACKUP_RESTORED
that could have is_true/is_false values for 'success' versus 'failure'
(neither is_true nor is_false means 'unknown' backup restore state)
and the Error function checks that and does not error out
if is_true $BACKUP_RESTORED
to get something which works in general in all scripts.
But for a generic solution a variable like BACKUP_RESTORED
is too specific because the actual intent is to not error out
so a more explicit variable name could be NO_ERROR_EXIT
or positively named ERROR_PROCEED or something like that.

pcahyna commented at 2021-06-04 18:17:

I found that the issue is actually more serious than what I wrote initially. I wrote:

if interrupted during the output stage. If for any reason unmounting the NFS directory mounted at outputfs fails (for example, there is a process using this filesystem), ReaR proceeds to rm -Rf -v /.../outputfs as part of the exit tasks.

but actually the if interrupted during the output stage part is not necessary. If there is anything that blocks umount or it fails for any other reason, the umount_url() function checks the error from umount and aborts ReaR itself, thus invoking the problematic rm -Rf exit task, leading to a certain data loss.

I also found that similar problems exist for the filesystem mounted at $BUILD_DIR/tftpbootfs in the case of OUTPUT=PXE: https://github.com/rear/rear/blob/2433c72f614af5b22547d6bd873ff4d224901a2c/usr/share/rear/output/PXE/default/800_copy_to_tftp.sh#L14 and https://github.com/rear/rear/blob/2433c72f614af5b22547d6bd873ff4d224901a2c/usr/share/rear/output/PXE/default/810_create_pxelinux_cfg.sh#L17. The directory could contain data for other machines that would thus be lost.

jsmeix commented at 2021-06-18 13:04:

With https://github.com/rear/rear/pull/2625 merged
this issue should be avoided.


[Export of Github issue for rear/rear.]