#2808 PR `merged`: Exclude dev/watchdog* from recovery system¶

Labels: enhancement, fixed / solved / done

jsmeix opened issue at 2022-05-17 12:52:¶

Type: Enhancement
Impact: Critical
("critical" because it is related to system crash in some cases)
Reference to related issue (URL):
https://github.com/rear/rear/issues/2798
How was this pull request tested?
See https://github.com/rear/rear/issues/2798#issuecomment-1126145771
and https://github.com/rear/rear/issues/2798#issuecomment-1127721378
Brief description of the changes in this pull request:

In default.conf add dev/watchdog* to COPY_AS_IS_EXCLUDE
because /dev/watchdog /dev/watchdog* functionality
is not wanted in the ReaR rescue/recovery system
because we do not want any automated reboot
while disaster recovery happens via "rear recover".
Furthermore having dev/watchdog* during "rear mkrescue"
may even trigger a system "crash" that is caused by
TrendMicro ds_am module touching dev/watchdog
in ReaR's build area (/var/tmp/rear.XXX/rootfs).

gdha commented at 2022-05-18 07:58:¶

@jsmeix Be careful, by adding this device to COPY_AS_IS_EXCLUDE it is also excluded from the tar archive (with BACKUP=NETFS). We need to recreate this device after recovery. Perhaps, with an extra small script?

gdha commented at 2022-05-18 08:00:¶

FYI - TrendMicro is working on a hotfix to prevent crashes with their product due to /dev/watchdog.

jsmeix commented at 2022-05-18 11:10:¶

@gdha
I don't find code that things in COPY_AS_IS_EXCLUDE
are also excluded from the tar archive (with BACKUP=NETFS):

# find usr/sbin/rear usr/share/rear -type f | xargs grep 'COPY_AS_IS_EXCLUDE' | egrep -v 'COPY_AS_IS_EXCLUDE_|COPY_AS_IS_EXCLUDE\+=| *#'

usr/share/rear/build/GNU/Linux/100_copy_as_is.sh:Log "Files being excluded: ${COPY_AS_IS_EXCLUDE[@]}"
usr/share/rear/build/GNU/Linux/100_copy_as_is.sh:for excluded_file in "${COPY_AS_IS_EXCLUDE[@]}" ; do
usr/share/rear/build/GNU/Linux/100_copy_as_is.sh:    Error "Failed to copy files and directories in COPY_AS_IS minus COPY_AS_IS_EXCLUDE"
usr/share/rear/build/GNU/Linux/100_copy_as_is.sh:Log "Finished copying files and directories in COPY_AS_IS minus COPY_AS_IS_EXCLUDE"

usr/share/rear/conf/default.conf:COPY_AS_IS_EXCLUDE=( $VAR_DIR/output/\* dev/.udev dev/shm dev/shm/\* dev/oracleasm dev/mapper )

As far as I see build/GNU/Linux/100_copy_as_is.sh
is the only place where COPY_AS_IS_EXCLUDE is evaluated
therein in this code (excerpts):

local copy_as_is_exclude_file="$TMP_DIR/copy-as-is-exclude"
...
for excluded_file in "${COPY_AS_IS_EXCLUDE[@]}" ; do
    echo "$excluded_file"
done >$copy_as_is_exclude_file
...
if ! tar -v -X $copy_as_is_exclude_file -P -C / -c ${COPY_AS_IS[*]} 2>$copy_as_is_filelist_file | tar $v -C $ROOTFS_DIR/ -x 1>/dev/null ; then
    Error "Failed to copy files and directories in COPY_AS_IS minus COPY_AS_IS_EXCLUDE"
fi

where copy_as_is_exclude_file only appears in
build/GNU/Linux/100_copy_as_is.sh
and the file name copy-as-is-exclude also only appears in
build/GNU/Linux/100_copy_as_is.sh

So I see no relation between COPY_AS_IS_EXCLUDE
and what is in the backup.

I think if there was a relation between COPY_AS_IS_EXCLUDE
and what is in the backup it could be a severe bug
(unexpectedly suppressed files from the backup)
because COPY_AS_IS_EXCLUDE should only belong to
what gets excluded from the ReaR recovery system.

jsmeix commented at 2022-05-18 11:26:¶

@gdha
do you perhaps remember our meanwhile removed
finalize/default/100_populate_dev.sh
cf. https://github.com/rear/rear/issues/2045#issuecomment-464737610
which we now have in a modified form in
finalize/default/110_bind_mount_proc_sys_dev_run.sh
https://github.com/rear/rear/blob/master/usr/share/rear/finalize/default/110_bind_mount_proc_sys_dev_run.sh
as a workaround for older systems when it does not work
to bind-mount /dev on TARGET_FS_ROOT/dev
cf. https://github.com/rear/rear/pull/2113

pcahyna commented at 2022-05-18 11:38:¶

Sorry, I still don't understand how the issue arises and how does this PR fix it. What is the process triggering the crash? From the description of the issue it looks like there is a process opening the device file and not closing it, but how is ReaR doing it? In usr/share/rear/build/GNU/Linux/100_copy_as_is.sh seems to copy files using tar -c ... | tar -x, and tar does not open device files (I checked). If there is a process opening device special files, it can lead to other problems (IIRC tapes can get rewound if you open and close the corresponding device, for example.)

(The issue and comment mentions TrendMicro ds_am module, sorry for my ignorance, I don't know what it is and why is it related.)

gdha commented at 2022-05-18 11:56:¶

@pcahyna

Sorry, I still don't understand how the issue arises and how does this PR fix it. What is the process triggering the crash? From the description of the issue it looks like there is a process opening the device file and not closing it, but how is ReaR doing it? In usr/share/rear/build/GNU/Linux/100_copy_as_is.sh seems to copy files using tar -c ... | tar -x, and tar does not open device files (I checked). If there is a process opening device special files, it can lead to other problems (IIRC tapes can get rewound if you open and close the corresponding device, for example.)

(The issue and comment mentions TrendMicro ds_am module, sorry for my ignorance, I don't know what it is and why is it related.)

The tar command archives the complete /dev to the rescue image, which is sufficient to trigger TrendMicro ds_am module to trigger the crash. TrendMicro acknowledged that this is a bug and they are working on a fix.

jsmeix commented at 2022-05-18 11:59:¶

I made this PR only based on what I read on
https://github.com/rear/rear/issues/2798

In particular I do not understand what "crash" means therein.
As far as I understand what I read about /dev/watchdog
an automated reboot happens when there is no write
to /dev/watchdog for a certain time.
So I guess there was no actual "crash" in
https://github.com/rear/rear/issues/2798
but just the automated reboot happened all of a sudden
which looks to the user as if the system had crashed.

Regarding "TrendMicro ds_am module touch dev/watchdog
under the /tmp/rear.XXXX directory":
On my homeoffice laptop after a "rear mkrescue" (yesterday)
without the change in this PR I have

# ls -l /var/tmp/rear.IuB8N948ZgRegGv/rootfs/dev/watchdog*
crw------- 1 root root  10, 130 May 17 08:45 /var/tmp/rear.IuB8N948ZgRegGv/rootfs/dev/watchdog
crw------- 1 root root 248,   0 May 17 08:45 /var/tmp/rear.IuB8N948ZgRegGv/rootfs/dev/watchdog0

which are the same device major and minor numbers
as on my original system

# ls -l /dev/watchdog*
crw------- 1 root root  10, 130 May 18 09:55 /dev/watchdog
crw------- 1 root root 248,   0 May 18 09:55 /dev/watchdog0

so /dev/watchdog* in /var/tmp/rear.XXX/rootfs
are device nodes that behave same as the original ones.
Accordingly when some strange software writes to
a /dev/watchdog* in /var/tmp/rear.XXX/rootfs
and leaves it as is (i.e. no further writes)
the automated watchdog reboot happens.
At least that is how I imagine what actually happens.
But I am not at all a watchdog expert - I learned about
its existence via https://github.com/rear/rear/issues/2798

gdha commented at 2022-05-18 13:45:¶

@jsmeix A "crash" means a system panic - see https://github.com/rear/rear/issues/2798#issuecomment-1119659939

jsmeix commented at 2022-05-19 10:58:¶

A side note only for the "fun" of it:
It seems the TrendMicro cyber security software
sometimes messes around too much with kernel stuff
so the only way out for the kernel is to give up like
https://access.redhat.com/solutions/3100321
and it seems things don't always improve when several
security software things are running concurrently like
https://www.ibm.com/support/pages/ibm-security-guardium-potential-linux-kernel-reboot-when-running-trendmicro-deep-security-agent-and-guardium-stap-same-linux-server

jsmeix commented at 2022-05-19 11:01:¶

I think the crucial question here is:
Are /dev/watchdog* of any use inside the recovery system or not.

gdha commented at 2022-05-19 11:59:¶

I think the crucial question here is: Are /dev/watchdog* of any use inside the recovery system or not.

The watchdog devices are not required during recovery. However, we have to make sure that these devices are recovered (by the backup) or recreated somehow.

jsmeix commented at 2022-05-19 13:59:¶

I think nowadays /dev/ is only a mountpoint
(i.e. /dev/ itself is initially an empty directory)
and it is the special filesystem that is mounted at /dev/
which lets "somehow magically" device nodes appear therein
(e.g. via devtmpfs mounted at /dev/ together with udev).

So there should be no need to specially care about
the device nodes in the rebooted recreated system
(neither in the backup nor to recreate them by ReaR).

All what is needed is that the recreated system will mount
"the right thing" at /dev/ same as on the original system.
I think there is nothing what ReaR has to actively do
ruring "rear recover" to make that happen.
Only restoring all basic system files from the backup
(in particular kernel, udev, init/systemd, and so on)
should be sufficient so that after reboot the recreated
system will again mount "the right thing" at /dev/.

But I am not an expert regarding populating /dev/
so I may misunderstand things.

Cf.
https://unix.stackexchange.com/questions/619589/is-it-necessary-to-mount-devtmpfs-with-etc-fstab

gdha commented at 2022-05-19 14:06:¶

@jsmeix I don't know if the device will automatically be recreated or not. Not without testing it out. However, these were production systems which cannot be scraped to test it out. On the other hand, a missing device can easlily be recreated by hand.

pcahyna commented at 2022-05-19 17:10:¶

For me another question is, do we want to make changes that do not fix any issue in ReaR (nor in the OS or backup tool that ReaR is intended to integrate with), but in a 3rd party tool that has nothing to do with ReaR? I can see the value for users, but on the other hand such workarounds clutter and complicate the code. (Fortunately this change is simple and also well documented, so I am inclined to say yes, but I am not entirely sure it is a good idea in general.)

gdha commented at 2022-05-20 06:36:¶

Perhaps we better make some proper documentation around it as it is indeed not a flaw in ReaR. ReaR only suffers by it.

jsmeix commented at 2022-05-20 07:00:¶

@pcahyna
yes, we do want to make changes in ReaR
that do not fix any issue in ReaR itself
but that avoid bugs in whatever non-ReaR stuff,
see the section "Dirty hacks welcome" in
https://github.com/rear/rear/wiki/Coding-Style

jsmeix commented at 2022-05-20 07:12:¶

Now the question is whether or not
the change in this pull request is a dirty hack.

My current point of view is:

In general /dev/watchdog* inside the recovery system
does not make sense so it should be excluded in general.

The issue https://github.com/rear/rear/issues/2798
was only the reason to think about whether or not
/dev/watchdog* should be in the recovery system.

So the change in this pull request is not a dirty hack
and I made my comments in the code accordingly
(i.e. first the generic reason then the special case
that it avoids a crash under special circumstances).

I would not have done this pull request if I thought that
in general /dev/watchdog* should be in the recovery system
because then I would not exclude it from the recovery system
in default.conf but leave it to the specific users
of the buggy TrendMicro cyber security software
to deal with it specifically on their own via

COPY_AS_IS_EXCLUDE+=( dev/watchdog\* )

in their etc/rear/local.conf

jsmeix commented at 2022-05-20 07:14:¶

@gdha @pcahyna
if you regard the change in this pull request as a dirty hack
I think this pull request should not be merged.

jsmeix commented at 2022-05-20 07:23:¶

What I could currently imagine why /dev/watchdog*
could be even useful in the recovery system
is a use case of some kind of unattended "rear recover"
where a user has also some additional special things
set up in his recovery system to let it automatically reboot
when his recovery system "hangs up" or becomes unresponsive.
Because ReaR does not provide any watchdog functionality
such a use case would again be some "third-party issue"
(here some additional third party watchdog functionality).
I think even then excluding /dev/watchdog* in default.conf
should be still OK because we cannot make default.conf
to be right for every possible additional special things.

jsmeix commented at 2022-05-23 08:29:¶

I overestimated this pull request to be a "blocker"
(i.e. that we won't release ReaR 2.7 without a solution for it).

Actually it cannot be a blocker when an issue is caused
by some special (and not often used) third-party software,
in particular not when a solution exists how affected users can
avoid that the bug in their third-party software affects ReaR.

In general it is not ReaR's job to avoid (or work around)
that a bug in third-party software may cause "bad things".

jsmeix commented at 2022-05-25 11:53:¶

I tested it myself:
The only difference in /var/tmp/rear.XXX/rootfs/dev is
that dev/watchdog and dev/watchdog0 are no longer there.

[Export of Github issue for rear/rear.]

#2808 PR merged: Exclude dev/watchdog* from recovery system¶