#3447 PR open: Extend the umount_mountpoint function to try normal, forced and lazy umounts

Labels: enhancement, cleanup

gdha opened issue at 2025-04-05 06:45:

Pull Request Details:
  • Type: Enhancement

  • Impact: Low

  • Reference to related issue (URL): #3429

  • How was this pull request tested? Basic test performed with normal umount so far.

  • Description of the changes in this pull request:
    Issue #3429 is about "The new umount_mountpoint_retry_lazy function https://github.com/rear/rear/pull/3408 versus the already existing umount_mountpoint_lazy function - try to make a single generic ReaR umount function for all cases"
    However, after carefully looking at the code we feel it would be better to merge all kind if umounts in one function and that is within the main function "umount_mountpoint".

In this first version of the PR we did not (yet) remove the 'lazy' argument with the function umount_mountpoint, nor the two remaining lazy functions "umount_mountpoint_lazy " and "umount_mountpoint_retry_lazy" to protect the existing calls to these functions in ReaR.

When the ReaR maintainers agree with the new code (or after updates) we can start cleaning up the remaining lazy functions.

jsmeix commented at 2025-04-09 07:52:

I did not yet find time for a thorough review
so I made only comments what I noticed at first glance.

jsmeix commented at 2025-04-09 13:12:

My main worries with "just killing some process"
comes from what I learned from @pcahyna during his
https://github.com/rear/rear/issues/2611
and
https://github.com/rear/rear/pull/2625

Excepts from his initial descriptions
in https://github.com/rear/rear/issues/2611

If for any reason unmounting the NFS directory
mounted at outputfs fails (for example,
there is a process using this filesystem),
ReaR proceeds to
rm -Rf -v /.../outputfs
as part of the exit tasks.
This removes the old backup directory.
It will also destroy the backup directories
for other machines, if multiple machines are configured
to use the same NFS directory.
A similar problem was reported as
https://github.com/rear/rear/issues/465
and after being fixed it was reintroduced
in a misguided "fix"
(PR https://github.com/rear/rear/pull/782,
https://github.com/rear/rear/commit/4e9c2a1b05f87762fb06355cf959b24eacc21f62)
in ReaR 1.18.

and in https://github.com/rear/rear/pull/2625

Replaced the various
rm -rf
of the mountpoint by rmdir
(this fixes https://github.com/rear/rear/issues/2611)
and added lazy umount in case normal umount does not succeed.
If build dir is kept, propose a safe way to remove it to the user
(rm -Rf --one-file-system instead of just rm -Rf
where the user would risk to remove everything
in their mounted filesystem if still mounted.)

In those cases it was "just 'rm -Rf' some directory"
instead of a careful way how to properly deal with
when the correctly intended task "rmdir directory" failed.

In this case here it could be "just kill some process"
instead of a careful way how to properly deal with
when the correctly intended task "umount directory" failed.

In both cases when the correctly intended task failed,
not the reason behind the failure is carefully handled,
instead the whole problem is disregarded and ReaR even
somewhat "brutally" proceeds in case of those failures
which could lead to disastrous outcomes, e.g. as in
https://github.com/rear/rear/issues/2611

The worst outcome when ReaR itself causes a disaster is
that then ReaR may no longer be perceived by its users
as a reasonably trustworthy tool for disaster recovery
after they found out how certain ReaR code behaves.
No software is free of issues, but one can relatively
easily see if code tries to behave careful or not.

In general this would be against
"Try hard to care about possible errors"
https://github.com/rear/rear/wiki/Coding-Style#try-hard-to-care-about-possible-errors

gdha commented at 2025-04-10 10:04:

@jsmeix
Personally, I’ve never had an issue with the umount_mountpoint function itself.
That said, I have experienced problems with NFS hangs - especially when dealing with cross-WAN NFS mounts.

In general, WAN-based NFSv4 over TCP tends to be more reliable than plain NFSv3. However, WAN NFS is often subject to QoS policies that throttle the bitrate to a crawl over time. This typically happens when backing up large amounts of data over NFS. If the NFS server doesn’t receive input for a while, it may send a RST (reset) signal to the NFS client. As a result, the ReaR process on the client - usually via tar - keeps waiting for input that never arrives, causing the NFS mount to hang.

This situation is made worse in enterprise environments, where ReaR is often run in unattended mode, and where the outcome of the backup isn't closely monitored. If backups are scheduled weekly, this can lead to subsequent jobs also hanging.

While you can kill the tar, cpio, and rear processes, there’s usually one process that remains unkillable and becomes a zombie.
A lazy unmount of the temporary ReaR directory does work, but needing to do so is a clear sign that ReaR has already failed.

As for killing the process - it might be a heavy-handed approach, and I wouldn't object to removing that logic and simply reporting the issue. However, allowing ReaR to continue in that state isn’t much better, since its behavior becomes unreliable from that point forward.

Is this ReaR’s fault? I’d argue no. In my opinion, NFS is simply not a great transport mechanism for backups.
As Didac pointed out in his talk, rsync offers much more robust recovery mechanisms compared to NFS.

jsmeix commented at 2025-04-10 11:55:

@gdha
thank you so much for your background information,
in particular what happens "out there in the wild".

I have nothing against killing a process in principle.
What worries me is only when such things happen
unconditioned and/or in a hardcoded / automated way
where the user has no influence.

For example when ReaR is running in unattended mode
this would be a condition to do an automated killing
of processes which use a mountpoint.

We also have '--non-interactive' mode where I think
this mode is also meant for such cases - regardless
that currently "rear help" tells

 -n --non-interactive   non-interactive mode; aborts when any
                        user input is required (experimental)

where "user input is required" is not exactly this case
but I think the meaning is same, i.e. when ReaR comes
to a point where ReaR would need user input how to proceed
then in non-interactive mode ReaR would abort.

What seems to be missung is that when "rear recover"
is run in 'unattended_recovery' mode that then
not also NON_INTERACTIVE=1 is set, i.e.
'unattended_recovery' mode does not automatically
also mean 'non-interactive' mode.

I think 'non-interactive' mode is the more generic mode
while 'unattended_recovery' mode is a special kind of
the 'non-interactive' mode so 'unattended_recovery' mode
should implicate 'non-interactive' mode.


[Export of Github issue for rear/rear.]