#3447 PR `merged`: Extend the umount_mountpoint function to try normal, forced and lazy umounts¶

Labels: enhancement, cleanup, ready-to-merge?

gdha opened issue at 2025-04-05 06:45:¶

Pull Request Details:¶

Type: Enhancement
Impact: Low
Reference to related issue (URL): #3429
How was this pull request tested? Basic test performed with normal umount so far.
Description of the changes in this pull request:
Issue #3429 is about "The new umount_mountpoint_retry_lazy function https://github.com/rear/rear/pull/3408 versus the already existing umount_mountpoint_lazy function - try to make a single generic ReaR umount function for all cases"
However, after carefully looking at the code we feel it would be better to merge all kind if umounts in one function and that is within the main function "umount_mountpoint".

In this first version of the PR we did not (yet) remove the 'lazy' argument with the function umount_mountpoint, nor the two remaining lazy functions "umount_mountpoint_lazy " and "umount_mountpoint_retry_lazy" to protect the existing calls to these functions in ReaR.

When the ReaR maintainers agree with the new code (or after updates) we can start cleaning up the remaining lazy functions.

jsmeix commented at 2025-04-09 07:52:¶

I did not yet find time for a thorough review
so I made only comments what I noticed at first glance.

jsmeix commented at 2025-04-09 13:12:¶

My main worries with "just killing some process"
comes from what I learned from @pcahyna during his
https://github.com/rear/rear/issues/2611
and
https://github.com/rear/rear/pull/2625

Excepts from his initial descriptions
in https://github.com/rear/rear/issues/2611

If for any reason unmounting the NFS directory
mounted at outputfs fails (for example,
there is a process using this filesystem),
ReaR proceeds to
rm -Rf -v /.../outputfs
as part of the exit tasks.
This removes the old backup directory.
It will also destroy the backup directories
for other machines, if multiple machines are configured
to use the same NFS directory.
A similar problem was reported as
https://github.com/rear/rear/issues/465
and after being fixed it was reintroduced
in a misguided "fix"
(PR https://github.com/rear/rear/pull/782,
https://github.com/rear/rear/commit/4e9c2a1b05f87762fb06355cf959b24eacc21f62)
in ReaR 1.18.

and in https://github.com/rear/rear/pull/2625

Replaced the various
rm -rf
of the mountpoint by rmdir
(this fixes https://github.com/rear/rear/issues/2611)
and added lazy umount in case normal umount does not succeed.
If build dir is kept, propose a safe way to remove it to the user
(rm -Rf --one-file-system instead of just rm -Rf
where the user would risk to remove everything
in their mounted filesystem if still mounted.)

In those cases it was "just 'rm -Rf' some directory"
instead of a careful way how to properly deal with
when the correctly intended task "rmdir directory" failed.

In this case here it could be "just kill some process"
instead of a careful way how to properly deal with
when the correctly intended task "umount directory" failed.

In both cases when the correctly intended task failed,
not the reason behind the failure is carefully handled,
instead the whole problem is disregarded and ReaR even
somewhat "brutally" proceeds in case of those failures
which could lead to disastrous outcomes, e.g. as in
https://github.com/rear/rear/issues/2611

The worst outcome when ReaR itself causes a disaster is
that then ReaR may no longer be perceived by its users
as a reasonably trustworthy tool for disaster recovery
after they found out how certain ReaR code behaves.
No software is free of issues, but one can relatively
easily see if code tries to behave careful or not.

In general this would be against
"Try hard to care about possible errors"
https://github.com/rear/rear/wiki/Coding-Style#try-hard-to-care-about-possible-errors

gdha commented at 2025-04-10 10:04:¶

@jsmeix
Personally, I’ve never had an issue with the umount_mountpoint function itself.
That said, I have experienced problems with NFS hangs - especially when dealing with cross-WAN NFS mounts.

In general, WAN-based NFSv4 over TCP tends to be more reliable than plain NFSv3. However, WAN NFS is often subject to QoS policies that throttle the bitrate to a crawl over time. This typically happens when backing up large amounts of data over NFS. If the NFS server doesn’t receive input for a while, it may send a RST (reset) signal to the NFS client. As a result, the ReaR process on the client - usually via tar - keeps waiting for input that never arrives, causing the NFS mount to hang.

This situation is made worse in enterprise environments, where ReaR is often run in unattended mode, and where the outcome of the backup isn't closely monitored. If backups are scheduled weekly, this can lead to subsequent jobs also hanging.

While you can kill the tar, cpio, and rear processes, there’s usually one process that remains unkillable and becomes a zombie.
A lazy unmount of the temporary ReaR directory does work, but needing to do so is a clear sign that ReaR has already failed.

As for killing the process - it might be a heavy-handed approach, and I wouldn't object to removing that logic and simply reporting the issue. However, allowing ReaR to continue in that state isn’t much better, since its behavior becomes unreliable from that point forward.

Is this ReaR’s fault? I’d argue no. In my opinion, NFS is simply not a great transport mechanism for backups.
As Didac pointed out in his talk, rsync offers much more robust recovery mechanisms compared to NFS.

jsmeix commented at 2025-04-10 11:55:¶

@gdha
thank you so much for your background information,
in particular what happens "out there in the wild".

I have nothing against killing a process in principle.
What worries me is only when such things happen
unconditioned and/or in a hardcoded / automated way
where the user has no influence.

For example when ReaR is running in unattended mode
this would be a condition to do an automated killing
of processes which use a mountpoint.

We also have '--non-interactive' mode where I think
this mode is also meant for such cases - regardless
that currently "rear help" tells

 -n --non-interactive   non-interactive mode; aborts when any
                        user input is required (experimental)

where "user input is required" is not exactly this case
but I think the meaning is same, i.e. when ReaR comes
to a point where ReaR would need user input how to proceed
then in non-interactive mode ReaR would abort according to
what "rear help" currently tells.

But in this specific case here I think it could be better
to not "just abort" but to kill what uses the mountpoint.

What seems to be missung is that when "rear recover"
is run in 'unattended_recovery' mode that then
not also NON_INTERACTIVE=1 is set, i.e.
'unattended_recovery' mode does not automatically
also mean 'non-interactive' mode.

I think 'non-interactive' mode is the more generic mode
while 'unattended_recovery' mode is a special kind of
the 'non-interactive' mode so 'unattended_recovery' mode
should implicate 'non-interactive' mode.

gdha commented at 2025-05-07 08:51:¶

@jsmeix could you have a look at the latest updates and give feedback on it - thanks.

gdha commented at 2025-05-20 13:48:¶

@rear/contributors If there are no further modifications required I would like to merge this PR by the end of this week.

jsmeix commented at 2025-05-20 15:20:¶

@pcahyna
please have a look here.

In particular I would like to know your opinion
about enforced umount because according to what I read
"just doing an enforced umount" causes more harm than good, cf.
https://github.com/rear/rear/pull/3447#discussion_r2077472659
and
https://github.com/rear/rear/pull/3469

According to what

# git log -p --follow usr/share/rear/lib/global-functions.sh

shows, the "just doing an enforced umount" in the
umount_mountpoint() function plus the idea to

    ### otherwise, try to kill all processes that opened files on the mount.
    # TODO: actually implement this

originated in
https://github.com/rear/rear/commit/7de049a45bf8e8b17b88ed4008de0989de1c981e

gdha commented at 2025-05-22 08:06:¶

@jsmeix @pcahyna Let discuss this topic on our next meeting to get an agreement on how to proceed.

    ### otherwise, try to kill all processes that opened files on the mount.
    # TODO: actually implement this

originated in 7de049a

FYI - latest updates do not kill any processes.

gdha commented at 2025-05-22 08:49:¶

Idea - change function remove_temporary_mountpoint as follows:

function remove_temporary_mountpoint() {
    if test -d "$1" ; then
        # only remove directory when it is NOT mounted
        mountpoint --quiet --nofollow -- "$1" || rmdir $v "$1"
    fi
}

jsmeix commented at 2025-05-22 09:07:¶

@gdha
I know that it does not SIGKILL processes
and it never did before.
It was just a comment which shows
that the idea to "just SIGKILL processes" is old
(cf. RFC 1925 item 11).

What I am mostly missing is the unintrusive behaviour
from my umount_mountpoint_retry_lazy() function.

Currently umount_mountpoint() does

timeout $timeout_secs umount $v "$mountpoint" && return 0
timeout $timeout_secs umount $v --force "$mountpoint" && return 0
timeout $timeout_secs umount $v --lazy "$mountpoint" && return 0
return 1

without any attempt to let things finish on its own
by sleeping a bit ('timeout' does not sleep)
as umount_mountpoint_retry_lazy() does

umount $v "$mountpoint" && return 0
sleep 1
umount $v "$mountpoint" && return 0
umount $v --lazy "$mountpoint" && return 0
return 1

So umount_mountpoint() should at least partially implement
the unintrusive behaviour of umount_mountpoint_retry_lazy()
to be an acceptable replacement for the use case
why I made umount_mountpoint_retry_lazy() as I did,
for example like

timeout $timeout_secs umount $v "$mountpoint" && return 0
sleep 1
timeout $timeout_secs umount $v "$mountpoint" && return 0
timeout $timeout_secs umount $v --force "$mountpoint" && return 0
timeout $timeout_secs umount $v --lazy "$mountpoint" && return 0
return 1

or with sleep $timeout_secs instead of hardcoded 'sleep 1'

timeout $timeout_secs umount $v "$mountpoint" && return 0
sleep $timeout_secs
timeout $timeout_secs umount $v "$mountpoint" && return 0
timeout $timeout_secs umount $v --force "$mountpoint" && return 0
timeout $timeout_secs umount $v --lazy "$mountpoint" && return 0
return 1

What I like to discuss on our next meeting is
how far forced umount in umount_mountpoint()

timeout $timeout_secs umount $v --force "$mountpoint" && return 0

is actually helpful for the user and his system as a whole
and in particular helpful for ReaR's task as a whole.
E.g. think about the cases why I implemented
umount_mountpoint_retry_lazy() where some task in ReaR
needed a bit more time to finish so any forced umount
may make that task in ReaR fail so the forced umount
could be unhelpful or even destructive for the success
of ReaR's task as a whole.
Of course '--force' could help to make 'umount' exit
with zero exit code but I question if that really
helps to make things succeed as a whole.

Same for forced lazy umount in umount_mountpoint_lazy()

umount $v -f -l "$mountpoint" >&2

(I think the stdout redirection to stderr therein
should be no longer needed.)

By the way:

(1)
I wonder why timeout_secs is hardcoded and cannot
be specified by the caller of umount_mountpoint() ?

(2)
My optional string in umount_mountpoint_retry_lazy()
to show the user if needed a more meaningful message
about what is mounted was also ignored.

So in the end the current umount_mountpoint()
completely ignores the use cases why I had to
implement umount_mountpoint_retry_lazy() as I did.

So the current umount_mountpoint() is unacceptable
as a replacement for use cases in ReaR which need an
unintrusive behaviour of umount_mountpoint_retry_lazy()
(plus optional more meaningful messages to the user).

gdha commented at 2025-05-28 05:24:¶

@jsmeix Hopefully, the latest changes are the good one?

jsmeix commented at 2025-06-05 07:09:¶

Regarding enforced umount or killing non-ReaR processes see
https://github.com/rear/rear/issues/3478#issuecomment-2943009618

jsmeix commented at 2025-06-05 13:20:¶

You may have a look at
https://github.com/rear/rear/issues/3478#issuecomment-2943942969
what a user thinks how to appropriately deal with issues
when ReaR cannot umount something - basically:
Retry umount several times and then give up
and leave things to the admin.

jpbuecken commented at 2025-06-13 12:10:¶

You may have a look at #3478 (comment) what a user thinks how to appropriately deal with issues when ReaR cannot umount something - basically: Retry umount several times and then give up and leave things to the admin.

Hello, as a rear user and a long term linux admin I would support a umount retry loop as well as the best solution.
A admin can react to a error message and aborted processes.

A umount -f is a bad idea for exactly the reason you don't know what you kill. In this examle an on access virus protection solution.

Even umount -l is not the best idea. It is dangling in the background. You still don't know what caused the busy filesystem. A valid rear process? If you copy the efi img file in that state, is it really ok or corrupted? And it is almost impossible to get information about the root cause, why it was busy. You cannot run fuser or lsof. You fully lost control what will happen to the filesystem.
Just try with a usual filesystem: cd into it. Umount -l it from another terminal. Now let try to find out if and why the filesystem is still busy without knowing some bash is open there. And try to find out if it got cleaned up after you cd out of it.

Long story short, a analysis of e.g. defect rear efi img will get very difficult if not impossible with umount -l or umount -f.

jpbuecken commented at 2025-06-13 12:27:¶

Running fuser -m is a good idea to show what caused the busy filesystem.
I see it was suggested here https://github.com/rear/rear/issues/2908#issuecomment-1380425752 . That approach of @pcahyna would not even need the modern "-M" parameter, because it runs only if the umount fails, not everytime, so it is still a mountpoint and does not list filesystems like / and other misleading information as the comment in the code of this pull request describes.

jsmeix commented at 2025-06-13 13:55:¶

Regarding
https://github.com/rear/rear/pull/3447#issuecomment-2970237657

That approach of @pcahyna would not even need the
modern "-M" parameter, because it runs only if the umount fails

Perhaps

umount $mountpoint || fuser -m -v  $mountpoint

works only "mostly"?
I don't know if non-zero exit code of 'umount $mountpoint'
guarantees that something is mounted at $mountpoint.
If for whatever reason nothing is mounted at $mountpoint
'fuser -m -v $mountpoint' may output hundreds of lines, see
https://github.com/rear/rear/issues/2908#issuecomment-1405070332

jpbuecken commented at 2025-06-13 14:17:¶

Perhaps
umount $mountpoint || fuser -m -v  $mountpoint
works only "mostly"? I don't know if non-zero exit code of 'umount $mountpoint' guarantees that something is mounted at $mountpoint. If for whatever reason nothing is mounted at $mountpoint 'fuser -m -v $mountpoint' may output hundreds of lines, see #2908 (comment)

Yeah I see, e.g. umount on a already "umounted" filesystem will give you exit code 1.

Or the mount command could have failed before.

I see it is needed to add more logic if one want to really omit "-M"... So nevermind. To log open processes with fuser as a best afford approach is still a good idea.

EDIT: Something like:

if ! umount "$mountpoint"; then
   if is_mounted "$mountpoint" "$timeout_secs"; then
       # so it is still a mountpoint
       fuser -m -v  "$mountpoint"
   else
      Log "'umount' failed, but not mounted anyway "
   fi
fi

gdha commented at 2025-06-17 08:50:¶

@rear/contributors If there are no further comments would like to merge this PR by upcoming Friday.

[Export of Github issue for rear/rear.]

#3447 PR merged: Extend the umount_mountpoint function to try normal, forced and lazy umounts¶