#3027 PR merged: Make sure rescue contains all COPY_AS_IS files

Labels: bug, fixed / solved / done

rmetrich opened issue at 2023-07-19 08:18:

Relax-and-Recover (ReaR) Pull Request Template

Please fill in the following items before submitting a new pull request:

Pull Request Details:
  • Type: Bug Fix

  • Impact: Normal

  • Reference to related issue (URL): N/A

  • How was this pull request tested?

Using an automated reproducer implemented using systemtap

  1. Create a file in /dev which will be embedded in the rescue environment

    # dd if=/dev/urandom of=/dev/shrinking bs=10K count=3
    3+0 records in
    3+0 records out
    30720 bytes (31 kB, 30 KiB) copied, 0.000689566 s, 44.5 MB/s
    
  2. Execute the systemtap script in charge of shrinking the file while being copied

    # stap -v -g ./shrinking.stp
    [...]
    Pass 5: starting run.
    
  3. Execute rear mkrescue from another terminal

  4. Brief description of the changes in this pull request:

The files copied as-is in the rescue are copied using a tar -c | tar -x command.
It may happen that inodes are shrinking during the copy (e.g. /dev/mqueue/nnsc (HPE device)), which can cause the tar -x command to stop extracting the next files, due to the padding zeros inserted in the shrank inode.
This leads to an error when checking the integrity of the rescue environment.

The solution is to add -i option when extracting, to continue the processing.

rmetrich commented at 2023-07-19 08:21:

Without the fix:

ERROR: ReaR recovery system in '/var/tmp/rear.tpfZyNy6ayS53wP/rootfs' not usable (required libraries are missing)
Some latest log messages since the last called script 990_verify_rootfs.sh:
  wipefs is /bin/wipefs
  mkfs is /bin/mkfs
  mkfs.xfs is /bin/mkfs.xfs
  xfs_admin is /bin/xfs_admin
  mkswap is /bin/mkswap
  cryptsetup is /bin/cryptsetup
  dmsetup is /bin/dmsetup
  ldconfig is /bin/ldconfig

The reason is everything after /dev/shrinking is not copied, because tar -x stopped silently (without error, believing the end of the archive was reached):

# grep -A 2 shrinking /var/tmp/rear.tpfZyNy6ayS53wP/tmp/copy-as-is-filelist 
/dev/shrinking
tar: /dev/shrinking: File shrank by 5120 bytes; padding with zeros
/dev/vcsa6
/dev/vcs6

# ls -l /var/tmp/rear.tpfZyNy6ayS53wP/rootfs/dev/shrinking /var/tmp/rear.tpfZyNy6ayS53wP/rootfs/dev/vcsa6
ls: cannot access '/var/tmp/rear.tpfZyNy6ayS53wP/rootfs/dev/vcsa6': No such file or directory
-rw-r--r--. 1 root root 30720 Jul 19 10:12 /var/tmp/rear.tpfZyNy6ayS53wP/rootfs/dev/shrinking

rmetrich commented at 2023-07-19 08:22:

systemtap shrinking.stp script:

global openat
global to_catch
global shrinking_fd
global nread

probe syscall.openat {
    if (execname() != "tar") next
    if (filename_unquoted != "shrinking") next
    to_catch[tid()] = 1
    openat[tid()] = 1
}

probe syscall.openat.return {
    if (! openat[tid()]) next
    delete openat[tid()]
    shrinking_fd = retval
    nread = 0
    printf("FD is %ld\n", retval)
}

probe syscall.close {
    if (! to_catch[tid()]) next
    if (fd != shrinking_fd) next
    delete to_catch[tid()]
}

probe syscall.read {
    if (! to_catch[tid()]) next
    if (fd != shrinking_fd) next
    nread++
    if (nread < 3) next
    printf("Truncating the file\n")
    system("truncate -s 25K /dev/shrinking")
    mdelay(2000)
    exit()
}

gdha commented at 2023-07-19 08:33:

@rmetrich Hi - is this related to Message Queue brokers? Could you say something about the background of this issue? I also saw from time to time that dd seems to hang, but this could be the same issue as I saw it on kubernetes nodes with RabbitMQ pods...

rmetrich commented at 2023-07-19 08:47:

@rmetrich Hi - is this related to Message Queue brokers? Could you say something about the background of this issue? I also saw from time to time that dd seems to hang, but this could be the same issue as I saw it on kubernetes nodes with RabbitMQ pods...

Yes this was seen with /dev/mqueue/nnsc device node, something shipped by HPE apparently. One of my customer constantly hits the issue when building the ISO.

pcahyna commented at 2023-07-19 09:00:

Is /dev/mqueue/nnsc really a device node, or a regular file? I think the content of device special files should not be copied - only the inode (metadata) ...

rmetrich commented at 2023-07-19 09:44:

Is /dev/mqueue/nnsc really a device node, or a regular file? I think the content of device special files should not be copied - only the inode (metadata) ...

It seems to be a regular file:

/dev/mqueue:
total 0
drwxrwxrwt.  2 0 0   60 Jun 17 20:04 .
drwxr-xr-x. 23 0 0 5740 Jun 18 19:30 ..
---x--x--T.  1 0 0   80 Jun 21 13:30 nnsc

But none of this really matters, all we must take care of is archives with zero padding can be extracted properly.

jsmeix commented at 2023-07-19 11:38:

@rmetrich
thank you for your prompt explanatory comment!

jsmeix commented at 2023-07-19 11:38:

@rear/contributors
I would like to merge it tomorrow afternoon
unless there are objections

jsmeix commented at 2023-07-19 11:50:

Only for the log (for completeness):

I will never fully understand how 'tar' works.
Neither the old GNU tar version 1.15.1 man page (see above)
nor the newer GNU tar version 1.34 man page (in openSUSE Leap 15.4)

-i, --ignore-zeros
Ignore zeroed blocks in archive.
Normally two consecutive 512-blocks filled with zeroes mean EOF
and tar stops reading after encountering them.
This option instructs it to read further and
is useful when reading archives created with the -A option.

explain why "blocks filled with zeroes mean EOF" in 'tar'.
At least for me this is unexpected behaviour of a program
when reading a sequence of zero bytes means EOF
so "sequence of zero bytes" == "end of file" ?

rmetrich commented at 2023-07-19 11:51:

@rmetrich nice find and thanks for making ReaR more robust in general.

Beyond that, does the rescue system actually need those mqueue files? If not then I'd suggest to also exclude them from the rescue image, as it will debloat or unclutter the rescue image.

If this is a common thing then we can also add it to the default exclude list, IMHO.

It's probably not useful, as most of the devices in /dev are (knowing that all is supposed to be generated, on RHEL at least).

jsmeix commented at 2023-07-19 12:00:

Related to this pull request
but not actually belonging to this pull request:

We copy by default all in /dev/ into the recovery system

# find /dev | wc -l
732

# usr/sbin/rear -D mkrescue
...

# find /var/tmp/rear.nSzlPYGQ1Nk9v9w/rootfs/dev | wc -l
728

because of

COPY_AS_IS+=( /dev ...

in usr/share/rear/conf/GNU/Linux.conf
https://github.com/rear/rear/blob/master/usr/share/rear/conf/GNU/Linux.conf#L231
which is there since the beginning according to

# git log --follow -p usr/share/rear/conf/GNU/Linux.conf

I made a separated issue
https://github.com/rear/rear/issues/3028

jsmeix commented at 2023-07-20 13:16:

@rmetrich
thank you for your problem analysis and your fix!

It is much appreciated to get ReaR working
more fail-safe even in corner cases like this one.


[Export of Github issue for rear/rear.]