#2611 Issue closed
: ReaR configured with netfs backup can delete other/old folders on NFS at failure¶
Labels: cleanup
, fixed / solved / done
,
critical / security / legal
pcahyna opened issue at 2021-05-05 19:50:¶
Relax-and-Recover (ReaR) Issue Template¶
Fill in the following items before submitting a new issue
(quick response is not guaranteed with free support):
-
ReaR version ("/usr/sbin/rear -V"): Relax-and-Recover 2.4 / Git, according to code also current master and older versions since 1.18
-
OS version ("cat /etc/os-release" or "lsb_release -a" or "cat /etc/rear/os.conf"): CentOS Linux 7 (Core)
-
ReaR configuration files ("cat /etc/rear/site.conf" and/or "cat /etc/rear/local.conf"):
BACKUP_URL=nfs://192.168.122.145/export/backup
BACKUP=NETFS
ISO_RECOVER_MODE=unattended
ISO_DEFAULT=automatic
NETFS_KEEP_OLD_BACKUP_COPY=yes
-
Hardware (PC or PowerNV BareMetal or ARM) or virtual machine (KVM guest or PoverVM LPAR): KVM guest
-
System architecture (x86 compatible or PPC64/PPC64LE or what exact ARM device): x86
-
Firmware (BIOS or UEFI or Open Firmware) and bootloader (GRUB or ELILO or Petitboot): BIOS/GRUB
-
Storage (local disk or SSD) and/or SAN (FC or iSCSI or FCoE) and/or multipath (DM or NVMe): local disk
-
Storage layout ("lsblk -ipo NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,SIZE,MOUNTPOINT" or "lsblk" as makeshift): trivial
-
Description of the issue (ideally so that others can reproduce it):
In case ofNETFS_KEEP_OLD_BACKUP_COPY
ReaR can remove old backup (*.old) if interrupted during theoutput
stage. If for any reason unmounting the NFS directory mounted atoutputfs
fails (for example, there is a process using this filesystem), ReaR proceeds torm -Rf -v /.../outputfs
as part of the exit tasks. This removes the old backup directory.
It will also destroy the backup directories for other machines, if multiple machines are configured to use the same NFS directory. A similar problem was reported as #465 and after being fixed it was reintroduced in a misguided "fix" (PR #782, 4e9c2a1b05f87762fb06355cf959b24eacc21f62) in ReaR 1.18.
There are multiple different copies of the exist task to cleanup the output directory in case of failure: those in theprep
andbackup
stages look mostly safe (it would be still safer to usermdir
instead ofrm -Rf
though) and the only problematic one is in theoutput
stage (usr/share/rear/output/default/100_mount_output_path.sh) -
Workaround, if any:
Probably configuring anOUTPUT_URL
different fromBACKUP_URL
and not sharing the NFS directory among multiple machines would limit the damage (one would not lose backup data if using NETFS, only the output ISO). -
Attachments, as applicable ("rear -D mkrescue/mkbackup/recover" debug log files):
+++ Print 'Running exit tasks'
+++ test 1
+++ echo -e 'Running exit tasks'
+++ local exit_task=
+++ for exit_task in '"${EXIT_TASKS[@]}"'
+++ Debug 'Exit task '\''umount -f -v '\''/tmp/rear.uUNE07AQyw2ey0S/outputfs'\'' >&2'\'''
+++ test 1
+++ Log 'Exit task '\''umount -f -v '\''/tmp/rear.uUNE07AQyw2ey0S/outputfs'\'' >&2'\'''
++++ date '+%Y-%m-%d %H:%M:%S.%N '
+++ local 'timestamp=2021-05-05 13:01:00.404599344 '
+++ test 1 -gt 0
+++ echo '2021-05-05 13:01:00.404599344 Exit task '\''umount -f -v '\''/tmp/rear.uUNE07AQyw2ey0S/outputfs'\'' >&2'\'''
2021-05-05 13:01:00.404599344 Exit task 'umount -f -v '/tmp/rear.uUNE07AQyw2ey0S/outputfs' >&2'
+++ eval 'umount -f -v '\''/tmp/rear.uUNE07AQyw2ey0S/outputfs'\'' >&2'
++++ umount -f -v /tmp/rear.uUNE07AQyw2ey0S/outputfs
umount.nfs4: /tmp/rear.uUNE07AQyw2ey0S/outputfs: device is busy
/tmp/rear.uUNE07AQyw2ey0S/outputfs: nfs4 mount point detected
/tmp/rear.uUNE07AQyw2ey0S/outputfs: umount failed
+++ for exit_task in '"${EXIT_TASKS[@]}"'
+++ Debug 'Exit task '\''rm -Rf -v /tmp/rear.uUNE07AQyw2ey0S/outputfs >&2'\'''
+++ test 1
+++ Log 'Exit task '\''rm -Rf -v /tmp/rear.uUNE07AQyw2ey0S/outputfs >&2'\'''
++++ date '+%Y-%m-%d %H:%M:%S.%N '
+++ local 'timestamp=2021-05-05 13:01:00.413623168 '
+++ test 1 -gt 0
+++ echo '2021-05-05 13:01:00.413623168 Exit task '\''rm -Rf -v /tmp/rear.uUNE07AQyw2ey0S/outputfs >&2'\'''
2021-05-05 13:01:00.413623168 Exit task 'rm -Rf -v /tmp/rear.uUNE07AQyw2ey0S/outputfs >&2'
+++ eval 'rm -Rf -v /tmp/rear.uUNE07AQyw2ey0S/outputfs >&2'
++++ rm -Rf -v /tmp/rear.uUNE07AQyw2ey0S/outputfs
removed '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost.old/rear-localhost.iso'
removed '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost.old/VERSION'
removed '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost.old/README'
removed '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost.old/rear-localhost.log'
removed '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost.old/backup.tar.gz'
removed directory: '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost.old'
removed '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost/.lockfile'
removed directory: '/tmp/rear.uUNE07AQyw2ey0S/outputfs/localhost'
rm: cannot remove '/tmp/rear.uUNE07AQyw2ey0S/outputfs': Device or resource busy
+++ for exit_task in '"${EXIT_TASKS[@]}"'
+++ Debug 'Exit task '\''cleanup_build_area_and_end_program'\'''
+++ test 1
+++ Log 'Exit task '\''cleanup_build_area_and_end_program'\'''
++++ date '+%Y-%m-%d %H:%M:%S.%N '
+++ local 'timestamp=2021-05-05 13:01:00.439437963 '
+++ test 1 -gt 0
+++ echo '2021-05-05 13:01:00.439437963 Exit task '\''cleanup_build_area_and_end_program'\'''
2021-05-05 13:01:00.439437963 Exit task 'cleanup_build_area_and_end_program'
+++ eval cleanup_build_area_and_end_program
++++ cleanup_build_area_and_end_program
++++ Log 'Finished in 72 seconds'
+++++ date '+%Y-%m-%d %H:%M:%S.%N '
++++ local 'timestamp=2021-05-05 13:01:00.442303240 '
++++ test 1 -gt 0
++++ echo '2021-05-05 13:01:00.442303240 Finished in 72 seconds'
2021-05-05 13:01:00.442303240 Finished in 72 seconds
++++ test 1
++++ LogPrint 'You should also rm -Rf /tmp/rear.uUNE07AQyw2ey0S'
++++ Log 'You should also rm -Rf /tmp/rear.uUNE07AQyw2ey0S'
(The log message above also needs updating. If we are telling the user
what to do, we should not instruct them to use a command that will
destroy data. rm -Rf --one-file-system
could help.)
jsmeix commented at 2021-05-06 07:48:¶
Sigh!
In general: We have rather many rm -Rf
or rm -rf
in the code.
I would be surprised if I don't find some more places where it is used
carelessly...
pcahyna commented at 2021-05-06 08:00:¶
@jsmeix would it work to omit those exit tasks that unmount and remove
the outputfs and leave the only one -
cleanup_build_area_and_end_program
? You can see in the log above that
Exit task 'umount -f -v '/tmp/rear.uUNE07AQyw2ey0S/outputfs' gets
called, then the problematic Exit task 'rm -Rf -v
/tmp/rear.uUNE07AQyw2ey0S/outputfs >&2' gets called, and finally Exit
task 'cleanup_build_area_and_end_program' gets called. The last one
would be enough to do the needed job. Of course, relying on it would
make the code less local
(usr/share/rear/output/default/100_mount_output_path.sh would need to
rely on someone else cleaning up after it), but now the code is
duplicated, or rather multiplicated.
pcahyna commented at 2021-05-06 08:11:¶
Or, if it is desirable to keep the code structured as it is (each stage
of the code taking care of cleaning up after itself in the case of
failure, with the cleanup possibly running multiple times), it would be
helpful to factor out the cleanup into a function that could then be
called multiple times.
(The same function could be called both in exit tasks and in regular
cleanups at the end of stages that need it.)
jsmeix commented at 2021-05-06 08:16:¶
@pcahyna
I have no experience with the code parts that are involved here
so from my point of view feel free to do it as you think it is best,
cf.
https://github.com/rear/rear/pull/782#issuecomment-806585351
Perhaps @gdha has some experience with the code parts that are involved
here
so he could suggest what the best solution could be here?
pcahyna commented at 2021-05-12 10:15:¶
@jsmeix I think you can assign the issue to me, if nobody else has started working on it.
@gdha would it be acceptable to entirely remove the problematic ExitTask
and keep just the global one cleanup_build_area_and_end_program
?
jsmeix commented at 2021-05-12 10:21:¶
@pcahyna
thank you in advance for working on yet another "messy" part of ReaR!
jsmeix commented at 2021-05-12 10:30:¶
Only FYI my general opinion:
I think if some rm -rf
is missing it is a normal bug or even just an
enhancement.
But when rm -rf
happens that should not have been done it is a
critical bug.
So in case of doubt better less rm -rf
than too much.
pcahyna commented at 2021-05-12 10:41:¶
One thing I want to do is to add --one-file-system
to all rm -rf
instances that remove the build area or parts of it (that will remain
after the intended cleanup).
jsmeix commented at 2021-05-12 11:17:¶
@pcahyna
the rm
option --one-file-system
is not available on older versions
of rm
so check how far it is supported on older RHEL versions.
For example not for rm
in coreutils-5.93
that I have on my SLES10-SP4 test system
ReaR does no longer support SLES10 - I use it here only as an example
because SLES10 is the oldest test system that I currently have.
I will check SLES11 soon...
When --one-file-system
is specified rm
(on SLES10) shows
rm: unrecognized option '--one-file-system'
and does not remove anything.
jsmeix commented at 2021-05-12 11:44:¶
rm
in coreutils-8.12 that I have on SLES11 supports the
--one-file-system
option.
pcahyna commented at 2021-05-12 12:15:¶
I did an investigation and reported here: https://github.com/rear/rear/pull/782#issuecomment-806577614
pcahyna commented at 2021-05-12 12:16:¶
Thanks for checking SLES, I think it was the only supported distribution missing from my investigation.
jsmeix commented at 2021-05-12 12:33:¶
We may even stay backward compatible for the old unsupported
distributions
by conditional setting a rm_one_file_system
variable e.g. with code
like
# rm --one-file-system --help &>/dev/null && rm_one_file_system='--one-file-system' || rm_one_file_system=''
# rm $rm_one_file_system ...
which is perhaps better than to let things "just fail" on old distributions?
pcahyna commented at 2021-05-12 12:42:¶
@jsmeix I would prefer to not complicate the code only for the benefit of distributions that are unsupported anyway according to https://github.com/rear/rear.github.com/blob/master/documentation/release-notes-2-6.md#supported-and-unsupported-operating-systems
jsmeix commented at 2021-05-12 12:55:¶
@pcahyna
when you do the actual work it is you who decides how to do it
and in
https://github.com/rear/rear/pull/782#issuecomment-806585351
I wrote (excerpt)
feel free and be brave and merciless and try to clean up the mess
so feel free to mercilessly clean up things!
By the way:
Today is logical Friday for me
because tomorrow is public holiday in Germany
https://en.wikipedia.org/wiki/Feast_of_the_Ascension
and the day after tomorrow is vacation for me
so I wish you a relaxed and recovering weekend!
pcahyna commented at 2021-05-12 16:15:¶
@jsmeix may you ascend successfully!
jsmeix commented at 2021-05-17 11:53:¶
It didn't work for me - I am still on earth ;-)
pcahyna commented at 2021-05-28 09:22:¶
Hi @jsmeix , here is my branch to fix these problems together with many cleanups in related areas. Please have a look: https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task
It is not tested yet, so I am not yet opening a PR. Please take a quick look if you can to check whether you see something obviously wrong or changes that you don't like for style reasons. I will try to test it in the meantime (it touches many output methods so testing will take some time).
It also addresses the problem found in https://github.com/rear/rear/commit/20359a987662cc0c3fcfa52d62d1feac7cd55850#r51319634 .
jsmeix commented at 2021-05-28 11:27:¶
@pcahyna
while reading through your canges in
https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task
I noticed another issue that I like to mention here FYI:
Summary:
Code like
umount_url ... || Error "..."
can be problematic in scripts that are run during "rear recover"
after the backup was restored.
Reason:
We do not want to error out after the backup was restored
unless the error is so serious that it is impossible to continue.
But some failed umount
in the ReaR recovery system
that will be rebooted anyway is likely no really serious error.
For comparison what stages are sourced during "rear mkbackup" versus "rear recover":
# usr/sbin/rear -s mkbackup | grep -o 'Source [^/]*' | uniq
Source conf
Source init
Source prep
Source layout
Source rescue
Source build
Source pack
Source output
Source backup
# usr/sbin/rear -s recover | grep -o 'Source [^/]*' | uniq
Source conf
Source init
Source setup
Source verify
Source layout
Source restore
Source finalize
Source wrapup
What scripts (including symlinks to scripts) call umount_url
in the current GitHub master code:
# find -L usr/share/rear -type f | xargs grep -l 'umount_url' | grep -v global-functions.sh
usr/share/rear/output/default/980_umount_output_dir.sh
usr/share/rear/output/PXE/default/810_create_pxelinux_cfg.sh
usr/share/rear/output/PXE/default/800_copy_to_tftp.sh
usr/share/rear/restore/YUM/default/980_umount_YUM_dir.sh
usr/share/rear/restore/BLOCKCLONE/default/980_umount_NETFS_dir.sh
usr/share/rear/restore/NETFS/default/980_umount_NETFS_dir.sh
usr/share/rear/restore/DUPLICITY/default/980_unmount_duplicity_path.sh
usr/share/rear/restore/RBME/default/980_umount_NETFS_dir.sh
usr/share/rear/restore/BORG/default/900_umount_usb.sh
usr/share/rear/verify/YUM/default/980_umount_YUM_dir.sh
usr/share/rear/verify/BLOCKCLONE/default/980_umount_NETFS_dir.sh
usr/share/rear/verify/NETFS/default/980_umount_NETFS_dir.sh
usr/share/rear/verify/DUPLICITY/default/980_unmount_duplicity_path.sh
usr/share/rear/backup/BLOCKCLONE/default/980_umount_NETFS_dir.sh
usr/share/rear/backup/NETFS/default/980_umount_NETFS_dir.sh
usr/share/rear/backup/DUPLICITY/default/980_unmount_duplicity_path.sh
usr/share/rear/backup/BORG/default/900_umount_usb.sh
usr/share/rear/prep/BLOCKCLONE/default/980_umount_NETFS_dir.sh
usr/share/rear/prep/NETFS/default/980_umount_NETFS_dir.sh
So the scripts or symlinks to scripts that call umount_url
during "rear recover" after the backup was restored are
usr/share/rear/restore/YUM/default/980_umount_YUM_dir.sh
usr/share/rear/restore/BLOCKCLONE/default/980_umount_NETFS_dir.sh
usr/share/rear/restore/NETFS/default/980_umount_NETFS_dir.sh
usr/share/rear/restore/DUPLICITY/default/980_unmount_duplicity_path.sh
usr/share/rear/restore/RBME/default/980_umount_NETFS_dir.sh
usr/share/rear/restore/BORG/default/900_umount_usb.sh
Furthermore there are usr/share/rear/restore scripts (or symlinks)
that do other umount
calls in the current GitHub master code:
usr/share/rear/restore/YUM/default/410_restore_backup.sh:
umount "$BUILD_DIR/outputfs"
usr/share/rear/restore/BLOCKCLONE/default/400_restore_clone.sh:
umount_mountpoint $mp
usr/share/rear/restore/NETFS/default/400_restore_backup.sh:
umount "$BUILD_DIR/outputfs"
As far as I see those plain umount "$BUILD_DIR/outputfs"
are also in
your
https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task
pcahyna commented at 2021-05-28 11:36:¶
Code like
umount_url ... || Error "..."
can be problematic in scripts that are run during "rear recover"
after the backup was restored.
Sure, but I don't see any umount_url ... || Error "..."
in my changes.
There is only
perform_umount_url $url $mountpoint || Error "Unmounting '$mountpoint' failed."
in umount_url()
, i.e. the check is performed inside umount_url()
,
not in the callers. The original code was
Log "Unmounting with '$umount_cmd'"
$umount_cmd
StopIfError "Unmounting failed."
which is essentially equivalent. So you have a point, but my change does not seem to introduce any substantial change, the problem has existed before.
pcahyna commented at 2021-05-28 11:41:¶
@jsmeix if you think that those changes present a good opportunity to fix this problem that you have noticed, I can do that.
jsmeix commented at 2021-05-28 11:43:¶
In your
https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task
in usr/share/rear/lib/global-functions.sh there is (excerpts):
(var)
### The mount command is given by variable in the url host
local var=$(url_host $url)
mount_cmd="${!var} $mountpoint"
;;
...
(var)
local var
var=$(url_host $url)
Log "Unmounting with '${!var} $mountpoint'"
# lazy unmount not supported with custom umount command
${!var} $mountpoint
;;
The usage of the scheme name var
also for the variable name drives me
nuts.
I just fail to understand it.
From reading the code var=$(url_host $url)
the var
variable contains a URL host part so it should be named
host
but the comment tells the var
variable actually contains a "mount
command"
so it should be named mount_command
or my understanding is totally
confused
and actually the var
variable contains the name of another variable
because it is then used as ${!var}
or actually the var
variable
means
the scheme name var
?
Therefore I would like to ask you if you could clean up that naming
mess
so that it is more clear what that code actually is about.
I also do not understand what the scheme name var
actually means.
jsmeix commented at 2021-05-28 11:49:¶
@pcahyna
yes the umount_url ... || Error "..."
issue existed before.
I only spotted it now and I reported it here primarily so that it is not
forgotten.
I don't have a simple straightforward idea how to get that new issue
solved
i.e. how to distinguish if we are "after the backup was restored"
so I think you should not try to solve that additional issue now.
Things worked as is since years so there in no pressure to enhance it
now.
pcahyna commented at 2021-05-28 11:51:¶
From reading the code
var=$(url_host $url)
thevar
variable contains a URL host part so it should be namedhost
but the comment tells thevar
variable actually contains a "mount command"
so it should be namedmount_command
or my understanding is totally confused
and actuall thevar
variable actually contains the name of another variable
because it is then used as${!var}
or actually thevar
variable means
the scheme namevar
.
There is an indirection: ${var}
is not a mount command, but the name
of another variable which contains the mount command. It is used in the
var:
scheme and confusingly it is named identically tho that scheme.
pcahyna commented at 2021-05-28 11:53:¶
var://
is an internal scheme to implement the *_{U,}MOUNTCMD
functionality. It needs to be changed a bit because as it is currently
it will not handle errors very well. (mount_url()
needs to know the
_UMOUNTCMD
in order to use it in the exit task, but it is not known at
the point where mount_url()
gets called.)
pcahyna commented at 2021-05-28 11:55:¶
@pcahyna
yes theumount_url ... || Error "..."
issue existed before.
I only spotted it now and I reported it here primarily so that it is not forgotten.I don't have a simple straightforward idea how to get that new issue solved
i.e. how to distinguish if we are "after the backup was restored"
so I think you should not try to solve that additional issue now.
Things worked as is since years so there in no pressure to enhance it now.
Wouldn't it be enough to make the Error conditional on
"recover" != "$WORKFLOW"
?
jsmeix commented at 2021-05-28 12:02:¶
Backup restore also happens in "restoreonly" workflow that runs those stages:
# usr/sbin/rear -s restoreonly | grep -o 'Source [^/]*' | uniq
Source conf
Source init
Source setup
Source verify
Source restore
Source wrapup
And during "recover" workflow we should error our as usual
before the backup was restored i.e. during the usr/share/rear/verify
stage.
E.g. one same 980_umount_NETFS_dir.sh exists as script or symlinks
where during "rear mkbackup" it is called as those
usr/share/rear/prep/NETFS/default/980_umount_NETFS_dir.sh
usr/share/rear/backup/NETFS/default/980_umount_NETFS_dir.sh
versus during "rear recover" it is called as those
usr/share/rear/verify/NETFS/default/980_umount_NETFS_dir.sh
usr/share/rear/restore/NETFS/default/980_umount_NETFS_dir.sh
jsmeix commented at 2021-05-28 12:15:¶
When var
is an internal scheme to implement the *_{U,}MOUNTCMD
functionality
then that scheme should be renamed accordingly to mountcmd
.
The var
named variable (a variable named var
how explanatory! -
shudder! ;-)
should be renamed to something that explains what that thingy is about
e.g. something like mount_command_variable_name
or
mnt_cmd_var_name
or mountcmd_var_name
- the latter contains the matching scheme name
so grep -i mountcmd
over the whole code "just finds" all related
places,
cf. "Code must be easy to read" in
https://github.com/rear/rear/wiki/Coding-Style
pcahyna commented at 2021-05-28 12:21:¶
It is not named var
because it is a variable, but because its value is
a variable :-) which is in some (EDIT: perverse) sense explanatory.
pcahyna commented at 2021-05-28 12:23:¶
Backup restore also happens in "restoreonly" workflow that runs those stages:
What about making it conditional on stage instead of workflow, would this work?
pcahyna commented at 2021-05-28 12:27:¶
usr/share/rear/restore scripts (or symlinks)
that do otherumount
calls in the current GitHub master code:usr/share/rear/restore/YUM/default/410_restore_backup.sh: umount "$BUILD_DIR/outputfs" usr/share/rear/restore/BLOCKCLONE/default/400_restore_clone.sh: umount_mountpoint $mp usr/share/rear/restore/NETFS/default/400_restore_backup.sh: umount "$BUILD_DIR/outputfs"
Those umount
s are in the code that implements split archive support.
This code umounts the CD-ROM, lets the user insert another CD-ROM, and
mounts it, possibly several times. Does not sound like a problem. The
umount_url()
at the end will see the last CD-ROM and unmount it,
although it is not the same CD-ROM that mount_url()
started with.
jsmeix commented at 2021-05-28 12:28:¶
In your
https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task
in usr/share/rear/lib/global-functions.s there is
function remove_temporary_mountpoint() {
rmdir $v "$1" >&2
}
Usually you should no longer need to do something special with stdout
and stderr,
see
https://github.com/rear/rear/issues/2416
pcahyna commented at 2021-05-28 12:30:¶
In restore/BLOCKCLONE/default/400_restore_clone.sh
the unmount is
there for the target device, which is an independent case from
mount/umount_url(), so this should not be a problem.
pcahyna commented at 2021-05-28 12:35:¶
In your https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task
in usr/share/rear/lib/global-functions.s there isfunction remove_temporary_mountpoint() { rmdir $v "$1" >&2 }
Usually you should no longer need to do something special with stdout and stderr,
This function replaces the numerous uses of rmdir
and rm -Rf
that
had been there before. All of them use this redirection, so I preserved
it. If it is not needed anymore, I can drop it in a separate commit.
(This is the purpose of having it in a function, even if tiny - one can
do such changes on one place instead of many.)
jsmeix commented at 2021-05-28 12:36:¶
In your
https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task
in usr/share/rear/output/default/950_copy_result_files.sh there is
# XXX should we check for empty $OUTPUT_URL here? Most likely plenty of things got broken before reaching this point if it is empty
Usually we use TODO
and FIXME
in comments for such cases.
pcahyna commented at 2021-05-28 13:07:¶
Usually we use
TODO
andFIXME
in comments for such cases.
Changed now
jsmeix commented at 2021-05-28 13:11:¶
So I had now a rather quick look over the changes in
https://github.com/pcahyna/rear/tree/fix-backup-removal-in-exit-task
and all looks really good to me from plain looking at the code.
In particular I do appreciate the various major cleanups therein.
Even if those cleanups might break some unexpected special cases
I prefer to have them and fix those unexpected special cases later as
needed.
@pcahyna
thank you so much for all your work here!
pcahyna commented at 2021-05-28 13:13:¶
@jsmeix thanks a lot for the review. I have found during the cleanups that many unexpected special cases are already broken in one way or another. I will add a list of these findings to the PR when I open it.
jsmeix commented at 2021-05-28 13:19:¶
@pcahyna
I wish you a relaxed and recovering weekend!
pcahyna commented at 2021-05-31 10:05:¶
@jsmeix thanks, hope you had a relaxing and recovering weekend as well.
Concerning
Backup restore also happens in "restoreonly" workflow that runs those stages:
What about making it conditional on stage instead of workflow, would this work?
I realized that there is apparently no way to test what stage we are in,
unlike for workflow. Would it work to introduce a global variable
STAGE
and set it in SourceStage
to the name of the current stage,
then test if $STAGE != "restore
before erroring out in umount_url()
to address your initial concerns that we should not abort the recovery
just because the umount of the backup directory has failed?
jsmeix commented at 2021-05-31 12:26:¶
I also have the "use a global variable" idea in my mind
but currently my mind is not yet finished with thinking about it.
My current idea is a more direct variable like BACKUP_RESTORED
that could have is_true
/is_false
values for 'success' versus
'failure'
(neither is_true
nor is_false
means 'unknown' backup restore
state)
and the Error
function checks that and does not error out
if is_true $BACKUP_RESTORED
to get something which works in general in all scripts.
But for a generic solution a variable like BACKUP_RESTORED
is too specific because the actual intent is to not error out
so a more explicit variable name could be NO_ERROR_EXIT
or positively named ERROR_PROCEED
or something like that.
pcahyna commented at 2021-06-04 18:17:¶
I found that the issue is actually more serious than what I wrote initially. I wrote:
if interrupted during the output stage. If for any reason unmounting the NFS directory mounted at outputfs fails (for example, there is a process using this filesystem), ReaR proceeds to
rm -Rf -v /.../outputfs
as part of the exit tasks.
but actually the if interrupted during the output stage
part is not
necessary. If there is anything that blocks umount or it fails for any
other reason, the umount_url()
function checks the error from umount
and aborts ReaR itself, thus invoking the problematic rm -Rf
exit
task, leading to a certain data loss.
I also found that similar problems exist for the filesystem mounted at
$BUILD_DIR/tftpbootfs
in the case of OUTPUT=PXE
:
https://github.com/rear/rear/blob/2433c72f614af5b22547d6bd873ff4d224901a2c/usr/share/rear/output/PXE/default/800_copy_to_tftp.sh#L14
and
https://github.com/rear/rear/blob/2433c72f614af5b22547d6bd873ff4d224901a2c/usr/share/rear/output/PXE/default/810_create_pxelinux_cfg.sh#L17.
The directory could contain data for other machines that would thus be
lost.
jsmeix commented at 2021-06-18 13:04:¶
With
https://github.com/rear/rear/pull/2625
merged
this issue should be avoided.
[Export of Github issue for rear/rear.]