#1283 Issue closed: With NETFS_RESTORE_CAPABILITIES=yes 'getcap -r /' runs too long on big systems

Labels: enhancement, documentation, fixed / solved / done, external tool

tcerna opened issue at 2017-04-10 16:48:

Rear doesn't create backup when big file systems are mounted. It tries to copy files from network filesystems which are often very big and slow. Backup creation of mounted NFS is not task which should clients do. This should be backed up from server side.

Problem appears when variable NETFS_RESTORE_CAPABILITIES is set to yes. It goes through whole system by command 'getcap -r /' including NFS.

Current result:
Rear does a copy of large mounted NFS and it can last indefinitely. I waited for 8 hours and it was still working.

Expected result:
It will not do a copy of mounted NFS.

# rpm -q rear
rear-2.00-1.el7.x86_64

Reproducer:

# cat /etc/rear/local.conf 
OUTPUT=ISO
BACKUP=NETFS
BACKUP_URL=nfs://$IPADDR/mnt/rear/
GRUB_RESCUE=1

# cat /usr/share/rear/conf/default.conf | grep CAPABILITIES
NETFS_RESTORE_CAPABILITIES=y

# rear -v mkbackup
Relax-and-Recover 2.00 / Git
Using log file: /var/log/rear/rear-sweetpig-5.log
Using backup archive '/tmp/rear.gGE7FlLGrEl5541/tmp/isofs/backup/backup.tar.gz'
Creating disk layout
Creating root filesystem layout
^C

See in rear log:

...
2017-03-24 07:46:18 Including rescue/GNU/Linux/550_copy_ldconfig.sh
2017-03-24 07:46:18 Including rescue/default/550_vagrant.sh
2017-03-24 07:46:18 Including rescue/NETFS/default/600_store_NETFS_variables.sh
2017-03-24 07:46:18 Including rescue/GNU/Linux/600_unset_TMPDIR_in_rescue_conf.sh
2017-03-24 07:46:19 Including rescue/NETFS/default/610_save_capabilities.sh
2017-03-24 08:02:23 Running exit tasks.
2017-03-24 08:02:23 Finished in 973 seconds
2017-03-24 08:02:23 Removing build area /tmp/rear.YHBqaWCbzOjGzt0
removed directory: '/tmp/rear.YHBqaWCbzOjGzt0'
2017-03-24 08:02:23 End of program reached

During running backup creation:

# pstree -p
    ├─sshd(1043)─┬─sshd(4117)───bash(4700)───rear(16856)─┬─getcap(17941)
    │            │                                       └─grep(17942)

# cat /proc/17941/cmdline 
getcap-r/

gozora commented at 2017-04-11 06:10:

Indeed.

getcap -r / 2>/dev/null | grep -v $ISO_DIR > $VAR_DIR/recovery/capabilities || Log "Error while saving capabilities to file."

Is certainly not way to go, as root (/) can contain lots of files ...

@gdha, @jsmeix
Maybe

getcap -r $(cat $TMP_DIR/backup-include.txt) 2>/dev/null | grep -v $ISO_DIR > $VAR_DIR/recovery/capabilities || Log "Error while saving capabilities to file."

Would be better?

V.

tcerna commented at 2017-04-11 07:23:

One good news: step Creating root filesystem layout and whole process of rear -v mkbackup ended.
But there is problem with capabilities again. It was not saved.

# cat /var/lib/rear/recovery/capabilities 
#

See in /var/log/rear/rear-seetpig-1.log:

...
2017-04-11 03:13:39 Including rescue/NETFS/default/610_save_capabilities.sh
cat: /tmp/rear.mdzBvGEIXrxaihO/tmp/backup-include.txt: No such file or directory
2017-04-11 03:13:39 Error while saving capabilities to file.
...

gdha commented at 2017-04-11 07:25:

@mattihautameki Your commit https://github.com/rear/rear/commit/463a2651af982723f843beccdf33f71a0352fa53 need some modifications - see above. Any comments?
Also, this issue is related to #1175 - if only tar could handle correctly --acls and --xattr then this hack would not be necessary.

gdha commented at 2017-04-11 07:28:

@tcerna @gozora The error /tmp/rear.mdzBvGEIXrxaihO/tmp/backup-include.txt: No such file or directory is normal as ./rescue/NETFS/default/610_save_capabilities.sh is executed before ./backup/NETFS/default/500_make_backup.sh. To be precise script ./backup/NETFS/default/400_create_include_exclude_files.sh creates the backup-include.txt file.
Try to use the command rear -s mkbackup to get an overview of all scripts (and in the order these are executed)

tcerna commented at 2017-04-11 07:40:

I see that 610_save_capabilities.sh are run before other scripts. File backup-include.txt can't exist, so it is not possible to use it in this script.

# rear -s mkbackup
...
Source rescue/NETFS/default/610_save_capabilities.sh
...
Source backup/NETFS/default/500_make_backup.sh
Source backup/NETFS/GNU/Linux/600_start_selinux.sh
...

gozora commented at 2017-04-11 07:47:

@tcerna, @gdha
Sorry it was just a guess ;-)

gozora commented at 2017-04-11 07:51:

I'll try to take a closer look later today (unless someone else will solve this ;-)).

jsmeix commented at 2017-04-11 07:54:

First and foremost:
When the root cause is that 'tar' has a bug that is does no longer
handle --acls and --xattr corectly then this hack is just a band aid
and not meant to be a solution. But band aid hacks must be
propoerly documented as described at "Dirty hacks welcome"
in https://github.com/rear/rear/wiki/Coding-Style because
otherwise we get bug reports about ReaR like this one
that are actually no bugs in ReaR.

According to
https://github.com/rear/rear/issues/1175#issuecomment-282334872
the bug is in 'tar' and must be fixed there.

Regarding the dirty NETFS_RESTORE_CAPABILITIES hack:
I would only properly document it but leave it as is.
ReaR is not responsible to fix broken backup tools.

If someone likes to enhance NETFS_RESTORE_CAPABILITIES
to make it better usable I would recommend to make
NETFS_RESTORE_CAPABILITIES an array of directories
that is used for 'getcap -r directory' calls.
By default NETFS_RESTORE_CAPABILITIES could be
empty which means the hack is not done.

When 'tar' does not work as documented it is up to the user
to either get a fixed 'tar' or to manually specify the right
directories in his particular case to work around his broken 'tar'.

jsmeix commented at 2017-04-11 07:58:

Currently I have other work to do - but later as time permits
I may try to enhance NETFS_RESTORE_CAPABILITIES.

@tcerna
what about a fix for 'tar' in Red Hat?

gozora commented at 2017-04-11 08:02:

@jsmeix

NETFS_RESTORE_CAPABILITIES an array of directories
that is used for 'getcap -r directory' calls.
By default NETFS_RESTORE_CAPABILITIES could be
empty which menas the hack is not done.

This would lead back to problem we have currently, considering that user would not search manually for files/file systems containing capabilities and fallback to "/" (root) which would again search all net shares even if excluded from backup ...

V.

gozora commented at 2017-04-11 08:09:

@tcerna can you please provide version of RHEL you are using?

jsmeix commented at 2017-04-11 08:17:

See what I also wrote:

When 'tar' does not work as documented
it is up to the user to either get a fixed 'tar'
or to manually ... work around his broken 'tar'.

ReaR upstream is not there to fix bugs in other tools.

Any user can adapt and enhance our ReaR usptream scripts
according to his particular needs when he must work around
issues in his particular environment.

ReaR may even provide some kind of "template functionality"
to work around issues that is disabled by default
(like NETFS_RESTORE_CAPABILITIES).

But ReaR should not try to automatically work around issues
because when ReaR does something automatically then users
also "automatically" expect that ReaR does everything right.
But that fails more often than not for workarounds.
Any workaround has unwanted side effects.

Strictly speaking nothing at all should be done in ReaR
when a bug is not in ReaR because regardless what
sophisticated workarounds we try to implement in ReaR
we will then get subsequent reports that demand
"ReaR has a bug that needs to be fixed in ReaR".

jsmeix commented at 2017-04-11 08:28:

FYI regarding
https://github.com/rear/rear/issues/1283#issuecomment-293177186
that 'rescue' stage is run before 'backup' stage:

It is a general problem in lib/mkbackup-workflow.sh
that 'backup' stage is usually run last
but sometimes not (i.e. for backup in the ISO image), cf.
https://github.com/rear/rear/issues/1281#issuecomment-292534810

I am already thinking about if another ordering might be better
where 'backup' is not last so that one could do things
even after 'backup', for example things like
https://github.com/rear/rear/issues/1283#issuecomment-293160403

At least there should be one same ordering in any case
to avoid unexpected behaviour is some cases as in
https://github.com/rear/rear/issues/1281#issuecomment-292534810

gozora commented at 2017-04-11 08:29:

I'm fully share your opinion @jsmeix.
But we already have code for save/restore capabilities 610_save_capabilities.sh in ReaR present, and once it was merged it should work.
I personally think that scanning capabilities on "/" is simply not OK.
I'll try to fix this somehow and we can discuss my approach of solving this under PR, like always ;-).

V.

tcerna commented at 2017-04-11 08:31:

@gozora I'm using Red Hat Enterprise Linux Server release 7.3 (Maipo)

gozora commented at 2017-04-11 08:33:

@tcerna I guess Centos 7 is close enough for testing ;-)

jsmeix commented at 2017-04-11 08:34:

@gozora
because I fully agree with you that code that there is in ReaR
should work sufficiently well I wrote that I will try to enhance
the NETFS_RESTORE_CAPABILITIES implementation.
But currently I have no good idea how to make that code
"just work" out of the box - therefore I used my usual fallback
to let the user manually specify what ReaR should do
so that the user knows what he has set up and
ReaR upstream is not "guilty by default".

jsmeix commented at 2017-04-11 09:33:

@tcerna
have a look at my proposal in
https://github.com/rear/rear/pull/1284
whether or not this way you can better work around
the issue in 'tar'?

jsmeix commented at 2017-04-11 09:33:

Now I really need to do my other work...

gdha commented at 2017-04-11 09:37:

@gozora @jsmeix with the sentence "scanning capabilities on "/" is simply not OK" I full agree. I think we could fix this by moving the rescue script to the backup workflow, no? We only need the capabilities file during the restore phase anyway. So, why bother during the rescue phase?
@jsmeix there is a reason why the output workflow is before the backup workflow and that is OBDR (however, is anyone still using that?) - I guess HPE ;-)

jsmeix commented at 2017-04-11 09:41:

@gdha
many thanks for the reason why the 'output' stage
is usually before the 'backup' stage!

jsmeix commented at 2017-04-11 09:46:

@gdha
I think we cannot move rescue/NETFS/default/610_save_capabilities.sh
to the backup stage because the 'output' stage (and the 'build' stage)
is usually before the 'backup' stage which means
/var/lib/rear/recovery/capabilities will not get included in the
output (i.e. it will be missing in the ReaR recovery system).

jsmeix commented at 2017-04-11 09:56:

@gozora
we cannot use your proposal

getcap -r $(cat $TMP_DIR/backup-include.txt) ...

because on my SLES12 test system I get

# cat /tmp/rear.Kg1fzf8DAe2o37x/tmp/backup-include.txt
/

i.e. $( cat $TMP_DIR/backup-include.txt ) evaluates to '/'.
I have nothing specified in local.conf what exactly should
be in the backup - i.e. I use the ReaR default.

gozora commented at 2017-04-11 10:45:

@jsmeix this could work if we use getcap instead of getcap -r
-r should use recursion and collect all files inside "/" (this obviously means everything).
But if we remove -r getcap will stay inside specified file system boundaries so every other mount point should be skipped.

V.

gozora commented at 2017-04-11 11:01:

Something like this worked fine for me, to search only "/" (without crossing mount points)

# getcap /* 2>/dev/null  
/ping = cap_net_raw+ep

V.

jsmeix commented at 2017-04-11 12:18:

For me that does not work:

# getcap -r / 2>/dev/null
/usr/lib/gstreamer-1.0/gst-ptp-helper = cap_net_bind_service+ep
/usr/bin/ping6 = cap_net_raw+ep
/usr/bin/ping = cap_net_raw+ep

# getcap /* 2>/dev/null
[no output] 

gozora commented at 2017-04-11 12:20:

@jsmeix
Don't you have /usr as separate mount point?
If so

getcap /usr/* 2>/dev/null

Should be right thing to do ...

V.

jsmeix commented at 2017-04-11 14:50:

No, I have a single ext4 file system.

Even 'getcap /usr/* 2>/dev/null' does not work for me:

# getcap /usr/* 2>/dev/null
[no output]

in contrast to 'getcap -r /usr' which works

# getcap -r /usr 2>/dev/null
/usr/lib/gstreamer-1.0/gst-ptp-helper = cap_net_bind_service+ep
/usr/bin/ping6 = cap_net_raw+ep
/usr/bin/ping = cap_net_raw+ep

jsmeix commented at 2017-04-11 14:52:

With https://github.com/rear/rear/pull/1284 merged
NETFS_RESTORE_CAPABILITIES is more flexible to be used.
See default.conf how it can be used.
In particular on systems with zillions of files the user can
now explicitly specify on what directories 'getcap -r'
should be run to avoid that 'getcap -r /' runs too long.

jsmeix commented at 2017-04-12 14:11:

@tcerna
out of curiosity I am interested to learn why
on your particular system with NFS mounts
'getcap -r /' needs more than 8 hours.

On my test sytem without NFS mounts
I have all regular files in one ext4 filesystem

# find / -xdev | wc -l
125546

# time getcap -r / &>/dev/null
real    0m4.427s
user    0m0.126s
sys     0m2.246s

Now I mounted for testing a whole /usr/ from
another machine (107780 files) via NFS

# mount -v -t nfs -o nfsvers=3,nolock 10.160.4.244:/usr /tmp/mp

# time getcap -r / &>/dev/null
real    0m13.357s
user    0m0.285s
sys     0m8.564s

# find /tmp/mp -xdev | wc -l
107780

For me 107780 additional files mounted via NFS
make 'getcap -r /' somewhat slower but still
it completes within seconds.

Now I mount the same NFS share two additional times

# mount -v -t nfs -o nfsvers=3,nolock 10.160.4.244:/usr /tmp/mp2

# mount -v -t nfs -o nfsvers=3,nolock 10.160.4.244:/usr /tmp/mp3

# time getcap -r / &>/dev/null
real    0m18.393s
user    0m0.440s
sys     0m10.964s

I wonder what goes on on your system with NFS mounts
that could make 'getcap -r /' running for hours?

From my above tests it seems the current default
to run 'getcap -r /' is usually sufficiently fast.

For exceptional cases the user can now
do appropriate exceptional setup in ReaR.


[Export of Github issue for rear/rear.]