#2798 Issue closed
: TrendMicro ds_am module cause system crash by touching dev/watchdog in ReaR's build area¶
Labels: support / question
, fixed / solved / done
,
external tool
, not ReaR / invalid
lecce17 opened issue at 2022-05-05 12:20:¶
- ReaR version ("/usr/sbin/rear -V"):
# rear -V
Relax-and-Recover 2.4 / Git
- OS version ("cat /etc/os-release" or "lsb_release -a" or "cat /etc/rear/os.conf"):
# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.9 (Maipo)
- ReaR configuration files ("cat /etc/rear/site.conf" and/or "cat /etc/rear/local.conf"):
OUTPUT=ISO
BACKUP_URL=nfs://<hostname>.<domain>/rear
BACKUP=NETFS
ONLY_INCLUDE_VG=sys
BACKUP_PROG_EXCLUDE=( "${BACKUP_PROG_EXCLUDE[@]}" '/dev' )
-
Hardware vendor/product (PC or PowerNV BareMetal or ARM) or VM (KVM guest or PowerVM LPAR):
HP DL380 Gen9 -
System architecture (x86 compatible or PPC64/PPC64LE or what exact ARM device):
Intel x86_64 -
Firmware (BIOS or UEFI or Open Firmware) and bootloader (GRUB or ELILO or Petitboot):
-
Storage (local disk or SSD) and/or SAN (FC or iSCSI or FCoE) and/or multipath (DM or NVMe):
local SAS Disks -
Storage layout ("lsblk -ipo NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,LABEL,SIZE,MOUNTPOINT"):
# lsblk -ipo NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,LABEL,SIZE,MOUNTPOINT
NAME KNAME PKNAME TRAN TYPE FSTYPE LABEL SIZE MOUNTPOINT
/dev/sda /dev/sda sas disk 279.4G
|-/dev/sda1 /dev/sda1 /dev/sda part xfs 500M /boot
`-/dev/sda2 /dev/sda2 /dev/sda part LVM2_member 278.9G
|-/dev/mapper/sys-root /dev/dm-0 /dev/sda2 lvm xfs 9.8G /
|-/dev/mapper/sys-swap /dev/dm-1 /dev/sda2 lvm swap 17.9G [SWAP]
|-/dev/mapper/sys-home /dev/dm-9 /dev/sda2 lvm xfs 8G /home
|-/dev/mapper/sys-tmp /dev/dm-10 /dev/sda2 lvm xfs 3.9G /tmp
|-/dev/mapper/sys-var /dev/dm-11 /dev/sda2 lvm xfs 40G /var
|-/dev/mapper/sys-opt /dev/dm-12 /dev/sda2 lvm xfs 7G /opt
|-/dev/mapper/sys-opt_oracle /dev/dm-13 /dev/sda2 lvm xfs 30G /opt/oracle
`-/dev/mapper/sys-crash /dev/dm-14 /dev/sda2 lvm xfs 100G /var/crash
/dev/sdb /dev/sdb sas disk LVM2_member 279.4G
|-/dev/mapper/vg01-database /dev/dm-4 /dev/sdb lvm xfs 200G /database
`-/dev/mapper/vg01-oractlvg01 /dev/dm-5 /dev/sdb lvm xfs 10G /oractlvg01
/dev/sdc /dev/sdc sas disk LVM2_member 279.4G
|-/dev/mapper/vg02-redo /dev/dm-2 /dev/sdc lvm xfs 50G /redo
`-/dev/mapper/vg02-oractlvg02 /dev/dm-3 /dev/sdc lvm xfs 10G /oractlvg02
/dev/sdd /dev/sdd sas disk LVM2_member 279.4G
|-/dev/mapper/vg03-traces /dev/dm-6 /dev/sdd lvm xfs 200G /traces
|-/dev/mapper/vg03-oractlvg03 /dev/dm-7 /dev/sdd lvm xfs 10G /oractlvg03
`-/dev/mapper/vg03-import /dev/dm-8 /dev/sdd lvm xfs 16G /import
- Description of the issue (ideally so that others can reproduce it):
After starting a backup with "rear mkbackup" the server is crashing
because rear does not close the /etc/watchdog correctly
May 5 11:29:02 saco3 kernel: watchdog: watchdog0: watchdog did not stop!
May 5 11:29:02 saco3 kernel: watchdog: watchdog0: watchdog did not stop!
We already opened a RedHat Case and the analyzed the vmcore.
They suggested to blacklist the /dev directory
what i did but unfortunately it didn't help.
-
Workaround, if any:
-
Attachments, as applicable ("rear -D mkrescue/mkbackup/recover" debug log files):
To paste verbatim text like command output or file content,
include it between a leading and a closing line of three backticks like
```
verbatim content
```
jsmeix commented at 2022-05-05 13:22:¶
Because
# find usr/sbin/rear usr/share/rear -type f | xargs grep -i 'watchdog'
shows nothing (i.e. watchdog
does not appear in any of ReaR's
scripts)
I assume ReaR does not do anything directly with /dev/watchdog
but instead something that is called by ReaR does something
with /dev/watchdog - but that is only a blind guess.
@lecce17
to proceed here at ReaR upstream have a look at the section
"Debugging issues with Relax-and-Recover" in
https://en.opensuse.org/SDB:Disaster_Recovery
Alternatively proceed at Red Hat with your case there.
gdha commented at 2022-05-05 14:17:¶
@lecce17 I know of an open case at HPE on the same issue, but so far no update. Did you update to the latest firmware already? I would advise to open a software case at HPE as it is primary HPE software resulting in a (panic) reboot.
pcahyna commented at 2022-05-05 14:21:¶
as ReaR itself apparently does not open /dev/watchdog
, I believe it
would be useful to do a lsof
or fuser
just before the machine
crashes to see which process has the device open.
gdha commented at 2022-05-06 14:04:¶
Please find the following findings from HPE Engineering team:
The good news here is that the vmcore provides definite confirmation that this is happening during ReaR backup. We can see the rear job running. This is the process ancestry:
PID: 0 TASK: ffffffff81a8d020 CPU: 0 COMMAND: "swapper"
PID: 1 TASK: ffff881029867520 CPU: 3 COMMAND: "init"
PID: 18695 TASK: ffff880fbca1aab0 CPU: 26 COMMAND: "anacron"
PID: 4108 TASK: ffff88201b0f1520 CPU: 11 COMMAND: "run-parts"
PID: 4114 TASK: ffff88201b2ad520 CPU: 19 COMMAND: "run-parts"
PID: 4117 TASK: ffff880fbd4f3520 CPU: 1 COMMAND: "rear"
PID: 25151 TASK: ffff882006b40ab0 CPU: 0 COMMAND: "gzip"
Looking at the open files for a few of these processes shows that rear is being run from /etc/cron.weekly, and it is writing to a log file named /var/log/rear/rear-ch01erp7003.log:
crash64> files 4114 4117
PID: 4114 TASK: ffff88201b2ad520 CPU: 19 COMMAND: "run-parts"
ROOT: / CWD: /
FD FILE DENTRY INODE TYPE PATH
0 ffff881026cb22c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
1 ffff8820287dbd80 ffff881c52af5ec0 ffff882011969358 FIFO
2 ffff8820287dbd80 ffff881c52af5ec0 ffff882011969358 FIFO
255 ffff880ff4782c00 ffff88075f696500 ffff88075f669d98 REG /etc/cron.weekly/rear
PID: 4117 TASK: ffff880fbd4f3520 CPU: 1 COMMAND: "rear"
ROOT: / CWD: /tmp/rear.hdfOL5g3ErJwMBN/rootfs
FD FILE DENTRY INODE TYPE PATH
0 ffff881026cb22c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
1 ffff880fb75cf6c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
2 ffff880fb9b132c0 ffff8802668e3d80 ffff881018b908d0 REG /var/log/rear/rear-ch01erp7003.log
7 ffff880fb75cf6c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
8 ffff880fbbd59240 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
255 ffff8820254905c0 ffff880fb174b6c0 ffff880fb174e8d0 REG /usr/sbin/rear
The hpwdt
modules are not the issue here – it’s /dev/watchdog
that
they need to prevent rear from accessing.
lecce17 commented at 2022-05-06 14:24:¶
as ReaR itself apparently does not open
/dev/watchdog
, I believe it would be useful to do alsof
orfuser
just before the machine crashes to see which process has the device open.
Redhat recommended to use SystemTap to monitor who access /dev/watchdog. I've tried to do it with lsof and fuser but i wasn't lucky to capture the pid.
lecce17 commented at 2022-05-06 14:29:¶
Please find the following findings from HPE Engineering team:
The good news here is that the vmcore provides definite confirmation that this is happening during ReaR backup. We can see the rear job running. This is the process ancestry:
PID: 0 TASK: ffffffff81a8d020 CPU: 0 COMMAND: "swapper" PID: 1 TASK: ffff881029867520 CPU: 3 COMMAND: "init" PID: 18695 TASK: ffff880fbca1aab0 CPU: 26 COMMAND: "anacron" PID: 4108 TASK: ffff88201b0f1520 CPU: 11 COMMAND: "run-parts" PID: 4114 TASK: ffff88201b2ad520 CPU: 19 COMMAND: "run-parts" PID: 4117 TASK: ffff880fbd4f3520 CPU: 1 COMMAND: "rear" PID: 25151 TASK: ffff882006b40ab0 CPU: 0 COMMAND: "gzip"
Looking at the open files for a few of these processes shows that rear is being run from /etc/cron.weekly, and it is writing to a log file named /var/log/rear/rear-ch01erp7003.log:
crash64> files 4114 4117 PID: 4114 TASK: ffff88201b2ad520 CPU: 19 COMMAND: "run-parts" ROOT: / CWD: / FD FILE DENTRY INODE TYPE PATH 0 ffff881026cb22c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null 1 ffff8820287dbd80 ffff881c52af5ec0 ffff882011969358 FIFO 2 ffff8820287dbd80 ffff881c52af5ec0 ffff882011969358 FIFO 255 ffff880ff4782c00 ffff88075f696500 ffff88075f669d98 REG /etc/cron.weekly/rear PID: 4117 TASK: ffff880fbd4f3520 CPU: 1 COMMAND: "rear" ROOT: / CWD: /tmp/rear.hdfOL5g3ErJwMBN/rootfs FD FILE DENTRY INODE TYPE PATH 0 ffff881026cb22c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null 1 ffff880fb75cf6c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null 2 ffff880fb9b132c0 ffff8802668e3d80 ffff881018b908d0 REG /var/log/rear/rear-ch01erp7003.log 7 ffff880fb75cf6c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null 8 ffff880fbbd59240 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null 255 ffff8820254905c0 ffff880fb174b6c0 ffff880fb174e8d0 REG /usr/sbin/rear
The
hpwdt
modules are not the issue here – it’s/dev/watchdog
that they need to prevent rear from accessing.
Similar answer from RedHat. The “watchdog: watchdog0: watchdog did not stop!” comes from watchdog_release(). This happens when the the /dev/watchdog file was closed unexpectedly. And to close the file correctyl it is needed to send a specific magic character 'V'.
gdha commented at 2022-05-13 14:53:¶
The crash was caused by TrendMicro ds_am module touch dev/watchdog
under the /tmp/rear.XXXX directory (the inital ramdisk creation
temporary location).
Test were performed with
COPY_AS_IS_EXCLUDE=( ${COPY_AS_IS_EXCLUDE[@]} dev/watchdog\* )
with
success - meaning no crash!
lecce17 commented at 2022-05-13 15:16:¶
The crash was caused by TrendMicro ds_am module touch dev/watchdog under the /tmp/rear.XXXX directory (the inital ramdisk creation temporary location). Test were performed with
COPY_AS_IS_EXCLUDE=( ${COPY_AS_IS_EXCLUDE[@]} dev/watchdog\* )
with success - meaning no crash!
Very interesting, since we also use Trendmicro. I will test the new setting next week and give you an update.
jsmeix commented at 2022-05-16 07:23:¶
@gdha @pcahyna
should we add dev/watchdog\*
by default to COPY_AS_IS_EXCLUDE
because I assume a copy of dev/watchdog can never be useful
in the ReaR recovery system?
If yes
I would "just add it" in default.conf in current master code.
lecce17 commented at 2022-05-16 14:06:¶
@gdha No crashes any more after adding the suggested line to the local.conf. Thank you very much!!!
gdha commented at 2022-05-20 06:34:¶
Please find the following findings from HPE Engineering team.
OK, the vmcore-dmesg without /dev/watchdog not excluded once again shows the improper watchdog close prior to the crash:
<2>hpwdt: Unexpected close, not stopping watchdog!
<0>Kernel panic - not syncing: An NMI occurred.
And in the log with it excluded, we see it definitely being excluded:
2022-05-17 16:16:35 Files being excluded: dev/shm dev/.udev /var/lib/rear/output/rear-itsusabsdnp011.iso dev/shm/* dev/watchdog*
:
2022-05-17 16:17:27 Exclude list:
:
2022-05-17 16:17:28 /dev/watchdog
These entries do not appear in the log where /dev/watchdog was not excluded. Excluding it is clearly working to prevent the panic (unless there's a vmcore they didn't share with us). If that's the case, we should be done here.
jsmeix commented at 2022-05-25 11:56:¶
With
https://github.com/rear/rear/pull/2808
merged
this issue is avoided by default by ReaR
so I close this issue
(further comments can be added nevertheless).
Let's wait and see when something appears
that does need /dev/watchdog functionality
in the ReaR rescue/recovery system, cf.
https://github.com/rear/rear/pull/2808#issuecomment-1132569411
[Export of Github issue for rear/rear.]