#2610 Issue closed: Automounted NFS filesystems will cause rear to hang if NFS server fails

Labels: enhancement, fixed / solved / done

hicksdc opened issue at 2021-05-05 18:50:

Relax-and-Recover (ReaR) Issue Template

Fill in the following items before submitting a new issue
(quick response is not guaranteed with free support):

  • ReaR version ("/usr/sbin/rear -V"): 2.6-git.4317.be8b6ed.master / 2021-04-21

  • OS version ("cat /etc/os-release" or "lsb_release -a" or "cat /etc/rear/os.conf"):

NAME="Red Hat Enterprise Linux Server"
VERSION="7.9 (Maipo)"
...
  • ReaR configuration files ("cat /etc/rear/site.conf" and/or "cat /etc/rear/local.conf"): (unchanged)

  • Hardware (PC or PowerNV BareMetal or ARM) or virtual machine (KVM guest or PoverVM LPAR): PC

  • System architecture (x86 compatible or PPC64/PPC64LE or what exact ARM device): x86_64

  • Firmware (BIOS or UEFI or Open Firmware) and bootloader (GRUB or ELILO or Petitboot): BIOS

  • Storage (local disk or SSD) and/or SAN (FC or iSCSI or FCoE) and/or multipath (DM or NVMe): local SSD

  • Storage layout ("lsblk -ipo NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,SIZE,MOUNTPOINT" or "lsblk" as makeshift):

sda             8:0    0  20G  0 disk
├─sda1          8:1    0   1G  0 part /boot
└─sda2          8:2    0  19G  0 part
  ├─rhel-root 253:0    0  18G  0 lvm  /
  └─rhel-swap 253:1    0   1G  0 lvm  [SWAP]
  • Description of the issue (ideally so that others can reproduce it):
    Automounted NFS filesystems will cause rear process to hang indefinitely if the NFS server becomes unavailable while mount point still exists. The problem appears to be due to rear/prep/default/400_save_directories.sh collecting info about mount points that are owned by the automounter, which in my opinion is unnecessary (automount daemon creates these mount points, so why include them in a backup?).

For example:

# mount | grep /apps/unixengdev
uxengnas1:/export/unixengdev on /apps/unixengdev type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.2.15,local_lock=none,addr=28.97.76.59)

Walking through the 400_save_directories.sh code manually:

# USB_DEVICE_FILESYSTEM_LABEL='REAR-000'
# BUILD_DIR=/tmp/foo
# excluded_fs_types="anon_inodefs|autofs|bdev|cgroup|cgroup2|configfs|cpuset|debugfs|devfs|devpts|devtmpfs|dlmfs|efivarfs|fuse.gvfs-fuse-daemon|fusectl|hugetlbfs|mqueue|nfsd|none|nsfs|overlay|pipefs|proc|pstore|ramfs|rootfs|rpc_pipefs|securityfs|sockfs|spufs|sysfs|tmpfs"
# excluded_other_stuff="/sys/|$BUILD_DIR|$USB_DEVICE_FILESYSTEM_LABEL"
# mountpoints="$( mount | grep -vE "type ($excluded_fs_types) |$excluded_other_stuff" | awk '{print $3}' )"
# echo $mountpoints
/ /boot /vagrant /vagrant /home/vagrant/workspace /apps/unixengdev

If uxengnas1 NFS server has actually failed at this point, the subsequent stat ... loop waits indefinitely for the NFS server to respond.

  • Workaround, if any:
    In my opinion, better behaviour would be for the script to exclude mount points managed by the automounter, which isn't all that difficult to achieve because automount creates an autofs mount table entry for the parent of all of its potential mount points, eg. for the /apps/unixengdev example above there is an entry for /apps:
# mount | grep /apps
/etc/auto.apps on /apps type autofs (rw,relatime,fd=17,pgrp=1082,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=29508)
uxengnas1:/export/unixengdev on /apps/unixengdev type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.2.15,local_lock=none,addr=28.97.76.59)

So we can exclude all mount points that are direct descendants of all autofs parents, eg. like this:

local excluded_automounts=$(mount |awk '
  $5 == "autofs" { i++; mp[i]=$3 }
  END {
    if (i > 0) {
      line="^("mp[1]
      for (j=2;j<=i;j++) {line=line"|"mp[j]}
      print line")(/|$)"
    } else {
      print " "
    }
  }')
local mountpoints=$( mount |awk -v x="$excluded_automounts" '$3 !~ x' | grep -vE "type ($excluded_fs_types) |$excluded_other_stuff" | awk '{print $3}' )

Repeating the above example:

# mount | grep /apps/unixengdev
uxengnas1:/export/unixengdev on /apps/unixengdev type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.2.15,local_lock=none,addr=28.97.76.59)
# mount | grep autofs
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=34,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=13393)
/etc/auto.misc on /misc type autofs (rw,relatime,fd=5,pgrp=1082,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=29495)
-hosts on /net type autofs (rw,relatime,fd=11,pgrp=1082,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=29504)
/etc/auto.apps on /apps type autofs (rw,relatime,fd=17,pgrp=1082,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=29508)

# excluded_automounts=$(mount |awk '
<snip>
# echo $excluded_automounts
^(/proc/sys/fs/binfmt_misc|/misc|/net|/apps)(/|$)

# mountpoints=$( mount |awk -v x="$excluded_automounts" '$3 !~ x' | grep -vE "type ($excluded_fs_types) |$excluded_other_stuff" | awk '{print $3}' )
# echo $mountpoints
/ /boot /vagrant /vagrant /home/vagrant/workspace
  • Attachments, as applicable ("rear -D mkrescue/mkbackup/recover" debug log files):

jsmeix commented at 2021-05-06 09:11:

@hicksdc
thank you for your issue report and your proposed solution!

I am not at all an autofs and/or NFS expert so I may misunderstand things.

As far as I understand it the issue is that what mount shows as type autofs
is already automatically excluded but not their descendants
in particular not descendants that are mounted via NFS
as in your example the NFS share
uxengnas1:/export/unixengdev on /apps/unixengdev type nfs4

I wonder if it isn't simpler and more straightforward to add
what mount shows as type nfs* to the excluded_fs_types variable
instead of some rather complicated awk scripting?

According to what man 8 mount shows on my oenSUSE Leap 15.2 system
it seems the only type nfs* values are type nfs and type nfs4.

By the way:
I wonder if also other network file system types like cifs smbfs ncpfs
that are mentioned in man 8 mount should get excluded?

hicksdc commented at 2021-05-06 09:37:

Thanks for the quick response @jsmeix !

I did also think about excluding all nfs* mounts as a simpler approach, but assumed that would be unpopular because it would also exclude those mounts that are not being managed by automount, eg. an explicit /etc/fstab entry for a permanent NFS filesystem, for which I assume you would want to include the mount point stat in the backup? (Perhaps this comes down to a misunderstanding on my part of the purpose of this particular script, which is that it is simply attempting to capture details of all mount points that would need to be recreated when restoring the system...?)

The awk solution was just meant to show a working example really, I agree that its not pretty and am certain there is a far cleaner way (I'm no developer! ;) ). But yes, as you say, the key point is that the descendants of the type autofs entries should be excluded regardless of their type, which leads on to your final point...

Other network filesystems could presumably cause a similar problem, but I think they should simply be subject to the same logic as I've just discussed - if their mount points are permanent and not being managed separately by a service eg. automount, then details of those mount points should be included in the backup.

jsmeix commented at 2021-05-07 10:50:

In general I prefer to let others who already did the work
do it again for me instead of implementing it again from scratch
so I think I would prefer to let findmnt show parent with its descendants like

# findmnt -R -M $parent_mountpoint -n -o TARGET --raw

instead of a selfmade awk script.

@hicksdc
does

# parent_mountpoint=/apps
# findmnt -R -M $parent_mountpoint -n -o TARGET --raw | tr -s '[:space:]' ' '

show the right parent with its descendant mountpoints in your case?

jsmeix commented at 2021-05-07 12:29:

I made a preliminary proposal that is not yet at all tested by me via
https://github.com/rear/rear/pull/2613

@hicksdc
could you test if my proposed code there works in your case?

FWIW:
My implementation got more complicated than I had expected
so it does not look much simpler than your awk scripting.

jsmeix commented at 2021-05-07 13:40:

I wonder if all that additional rather complicated code
is worth the effort only to avoid that "rear mkrescue/mkbackup"
does not hang up when a NFS server becomes unavailable
while its NFS share mountpoint still exists?

I think any other program that tries to access that NFS share
would also hang up - so I wonder why exceptional code in ReaR
to avoid NFS issues outside of in ReaR?

I think only when it makes in general no sense to save and recreate
any mountpoints below mountpoints of "type autofs"
then such mountpoints should be excluded.

So the crucial question for me (I am not an autofs expert) is:
Are any mountpoints below mountpoints of "type autofs"
always owned by the automounter?
Or in other words:
Are all mountpoints below mountpoints of "type autofs"
always created by the automount daemon?

hicksdc commented at 2021-05-07 18:03:

Are any mountpoints below mountpoints of "type autofs" always owned by the automounter?

Yes, or at least they exist on an ancestor mount point that definitely is owned/created by the automounter, ie. it is possible to create a submountpoint on an automounted mountpoint, but the fact that the submountpoint is not local means it should be excluded from the backup. For example:

# mount /dev/mapper/rootvg-crashvol /apps/unixengdev/path/to/mnt

# mount | grep /apps
auto.apps on /apps type autofs (rw,relatime,fd=5,pgrp=1909,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=28889)
uxengnas1:/export/unixengdev on /apps/unixengdev type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=28.97.72.19,local_lock=none,addr=28.97.76.59)
/dev/mapper/rootvg-crashvol on /apps/unixengdev/path/to/mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota)

# findmnt -RT /apps
TARGET                           SOURCE                       FSTYPE OPTIONS
/apps                            auto.apps                    autofs rw,relatime,fd=5,pgrp=1909,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=28889
└─/apps/unixengdev               uxengnas1:/export/unixengdev nfs4   rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2
  └─/apps/unixengdev/path/to/mnt /dev/mapper/rootvg-crashvol  xfs    rw,relatime,seclabel,attr2,inode64,noquota

.../apps/unixengdev/path/to/mnt was not created or mounted by automount despite being below /apps, but from a backup point of view we're not interested in it because the mountpoint is not a local file that would need to be reproduced.

I have tested your PR 2613 code (with my findmnt -T syntax suggestion) against this scenario and it correctly excludes /apps/unixengdev/path/to/mnt.

So, going back to this discussion:

I wonder if all that additional rather complicated code is worth the effort only to avoid that "rear mkrescue/mkbackup" does not hang up when a NFS server becomes unavailable while its NFS share mountpoint still exists?

The NFS failure scenario was just an example intended to add some gravitas to the argument for excluding all autofs managed mountpoints, which is what I hope we are now agreeing should be implemented by your PR?

jsmeix commented at 2021-05-11 09:54:

@hicksdc
thank you for your review, your test, your fix, and your explanation
why it is right to exclude all mountpoints below mountpoints of "type autofs".

hicksdc commented at 2021-05-11 12:27:

No problem at all. Thank you for the rapid fix!

jsmeix commented at 2021-05-12 10:11:

With https://github.com/rear/rear/pull/2613 merged
this issue should be fixed.

@hicksdc
thank you in particular for your comprehensive descriptions
and explanations that helped me so much to properly
implement this improvement for ReaR.


[Export of Github issue for rear/rear.]