#2798 Issue closed: TrendMicro ds_am module cause system crash by touching dev/watchdog in ReaR's build area¶
Labels: support / question, fixed / solved / done,
external tool, not ReaR / invalid
lecce17 opened issue at 2022-05-05 12:20:¶
- ReaR version ("/usr/sbin/rear -V"):
# rear -V
Relax-and-Recover 2.4 / Git
- OS version ("cat /etc/os-release" or "lsb_release -a" or "cat /etc/rear/os.conf"):
# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.9 (Maipo)
- ReaR configuration files ("cat /etc/rear/site.conf" and/or "cat /etc/rear/local.conf"):
OUTPUT=ISO
BACKUP_URL=nfs://<hostname>.<domain>/rear
BACKUP=NETFS
ONLY_INCLUDE_VG=sys
BACKUP_PROG_EXCLUDE=( "${BACKUP_PROG_EXCLUDE[@]}" '/dev' )
-
Hardware vendor/product (PC or PowerNV BareMetal or ARM) or VM (KVM guest or PowerVM LPAR):
HP DL380 Gen9 -
System architecture (x86 compatible or PPC64/PPC64LE or what exact ARM device):
Intel x86_64 -
Firmware (BIOS or UEFI or Open Firmware) and bootloader (GRUB or ELILO or Petitboot):
-
Storage (local disk or SSD) and/or SAN (FC or iSCSI or FCoE) and/or multipath (DM or NVMe):
local SAS Disks -
Storage layout ("lsblk -ipo NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,LABEL,SIZE,MOUNTPOINT"):
# lsblk -ipo NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,LABEL,SIZE,MOUNTPOINT
NAME KNAME PKNAME TRAN TYPE FSTYPE LABEL SIZE MOUNTPOINT
/dev/sda /dev/sda sas disk 279.4G
|-/dev/sda1 /dev/sda1 /dev/sda part xfs 500M /boot
`-/dev/sda2 /dev/sda2 /dev/sda part LVM2_member 278.9G
|-/dev/mapper/sys-root /dev/dm-0 /dev/sda2 lvm xfs 9.8G /
|-/dev/mapper/sys-swap /dev/dm-1 /dev/sda2 lvm swap 17.9G [SWAP]
|-/dev/mapper/sys-home /dev/dm-9 /dev/sda2 lvm xfs 8G /home
|-/dev/mapper/sys-tmp /dev/dm-10 /dev/sda2 lvm xfs 3.9G /tmp
|-/dev/mapper/sys-var /dev/dm-11 /dev/sda2 lvm xfs 40G /var
|-/dev/mapper/sys-opt /dev/dm-12 /dev/sda2 lvm xfs 7G /opt
|-/dev/mapper/sys-opt_oracle /dev/dm-13 /dev/sda2 lvm xfs 30G /opt/oracle
`-/dev/mapper/sys-crash /dev/dm-14 /dev/sda2 lvm xfs 100G /var/crash
/dev/sdb /dev/sdb sas disk LVM2_member 279.4G
|-/dev/mapper/vg01-database /dev/dm-4 /dev/sdb lvm xfs 200G /database
`-/dev/mapper/vg01-oractlvg01 /dev/dm-5 /dev/sdb lvm xfs 10G /oractlvg01
/dev/sdc /dev/sdc sas disk LVM2_member 279.4G
|-/dev/mapper/vg02-redo /dev/dm-2 /dev/sdc lvm xfs 50G /redo
`-/dev/mapper/vg02-oractlvg02 /dev/dm-3 /dev/sdc lvm xfs 10G /oractlvg02
/dev/sdd /dev/sdd sas disk LVM2_member 279.4G
|-/dev/mapper/vg03-traces /dev/dm-6 /dev/sdd lvm xfs 200G /traces
|-/dev/mapper/vg03-oractlvg03 /dev/dm-7 /dev/sdd lvm xfs 10G /oractlvg03
`-/dev/mapper/vg03-import /dev/dm-8 /dev/sdd lvm xfs 16G /import
- Description of the issue (ideally so that others can reproduce it):
After starting a backup with "rear mkbackup" the server is crashing
because rear does not close the /etc/watchdog correctly
May 5 11:29:02 saco3 kernel: watchdog: watchdog0: watchdog did not stop!
May 5 11:29:02 saco3 kernel: watchdog: watchdog0: watchdog did not stop!
We already opened a RedHat Case and the analyzed the vmcore.
They suggested to blacklist the /dev directory
what i did but unfortunately it didn't help.
-
Workaround, if any:
-
Attachments, as applicable ("rear -D mkrescue/mkbackup/recover" debug log files):
To paste verbatim text like command output or file content,
include it between a leading and a closing line of three backticks like
```
verbatim content
```
jsmeix commented at 2022-05-05 13:22:¶
Because
# find usr/sbin/rear usr/share/rear -type f | xargs grep -i 'watchdog'
shows nothing (i.e. watchdog does not appear in any of ReaR's
scripts)
I assume ReaR does not do anything directly with /dev/watchdog
but instead something that is called by ReaR does something
with /dev/watchdog - but that is only a blind guess.
@lecce17
to proceed here at ReaR upstream have a look at the section
"Debugging issues with Relax-and-Recover" in
https://en.opensuse.org/SDB:Disaster_Recovery
Alternatively proceed at Red Hat with your case there.
gdha commented at 2022-05-05 14:17:¶
@lecce17 I know of an open case at HPE on the same issue, but so far no update. Did you update to the latest firmware already? I would advise to open a software case at HPE as it is primary HPE software resulting in a (panic) reboot.
pcahyna commented at 2022-05-05 14:21:¶
as ReaR itself apparently does not open /dev/watchdog, I believe it
would be useful to do a lsof or fuser just before the machine
crashes to see which process has the device open.
gdha commented at 2022-05-06 14:04:¶
Please find the following findings from HPE Engineering team:
The good news here is that the vmcore provides definite confirmation that this is happening during ReaR backup. We can see the rear job running. This is the process ancestry:
PID: 0 TASK: ffffffff81a8d020 CPU: 0 COMMAND: "swapper"
PID: 1 TASK: ffff881029867520 CPU: 3 COMMAND: "init"
PID: 18695 TASK: ffff880fbca1aab0 CPU: 26 COMMAND: "anacron"
PID: 4108 TASK: ffff88201b0f1520 CPU: 11 COMMAND: "run-parts"
PID: 4114 TASK: ffff88201b2ad520 CPU: 19 COMMAND: "run-parts"
PID: 4117 TASK: ffff880fbd4f3520 CPU: 1 COMMAND: "rear"
PID: 25151 TASK: ffff882006b40ab0 CPU: 0 COMMAND: "gzip"
Looking at the open files for a few of these processes shows that rear is being run from /etc/cron.weekly, and it is writing to a log file named /var/log/rear/rear-ch01erp7003.log:
crash64> files 4114 4117
PID: 4114 TASK: ffff88201b2ad520 CPU: 19 COMMAND: "run-parts"
ROOT: / CWD: /
FD FILE DENTRY INODE TYPE PATH
0 ffff881026cb22c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
1 ffff8820287dbd80 ffff881c52af5ec0 ffff882011969358 FIFO
2 ffff8820287dbd80 ffff881c52af5ec0 ffff882011969358 FIFO
255 ffff880ff4782c00 ffff88075f696500 ffff88075f669d98 REG /etc/cron.weekly/rear
PID: 4117 TASK: ffff880fbd4f3520 CPU: 1 COMMAND: "rear"
ROOT: / CWD: /tmp/rear.hdfOL5g3ErJwMBN/rootfs
FD FILE DENTRY INODE TYPE PATH
0 ffff881026cb22c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
1 ffff880fb75cf6c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
2 ffff880fb9b132c0 ffff8802668e3d80 ffff881018b908d0 REG /var/log/rear/rear-ch01erp7003.log
7 ffff880fb75cf6c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
8 ffff880fbbd59240 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
255 ffff8820254905c0 ffff880fb174b6c0 ffff880fb174e8d0 REG /usr/sbin/rear
The hpwdt modules are not the issue here – it’s /dev/watchdog that
they need to prevent rear from accessing.
lecce17 commented at 2022-05-06 14:24:¶
as ReaR itself apparently does not open
/dev/watchdog, I believe it would be useful to do alsoforfuserjust before the machine crashes to see which process has the device open.
Redhat recommended to use SystemTap to monitor who access /dev/watchdog. I've tried to do it with lsof and fuser but i wasn't lucky to capture the pid.
lecce17 commented at 2022-05-06 14:29:¶
Please find the following findings from HPE Engineering team:
The good news here is that the vmcore provides definite confirmation that this is happening during ReaR backup. We can see the rear job running. This is the process ancestry:
PID: 0 TASK: ffffffff81a8d020 CPU: 0 COMMAND: "swapper" PID: 1 TASK: ffff881029867520 CPU: 3 COMMAND: "init" PID: 18695 TASK: ffff880fbca1aab0 CPU: 26 COMMAND: "anacron" PID: 4108 TASK: ffff88201b0f1520 CPU: 11 COMMAND: "run-parts" PID: 4114 TASK: ffff88201b2ad520 CPU: 19 COMMAND: "run-parts" PID: 4117 TASK: ffff880fbd4f3520 CPU: 1 COMMAND: "rear" PID: 25151 TASK: ffff882006b40ab0 CPU: 0 COMMAND: "gzip"Looking at the open files for a few of these processes shows that rear is being run from /etc/cron.weekly, and it is writing to a log file named /var/log/rear/rear-ch01erp7003.log:
crash64> files 4114 4117 PID: 4114 TASK: ffff88201b2ad520 CPU: 19 COMMAND: "run-parts" ROOT: / CWD: / FD FILE DENTRY INODE TYPE PATH 0 ffff881026cb22c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null 1 ffff8820287dbd80 ffff881c52af5ec0 ffff882011969358 FIFO 2 ffff8820287dbd80 ffff881c52af5ec0 ffff882011969358 FIFO 255 ffff880ff4782c00 ffff88075f696500 ffff88075f669d98 REG /etc/cron.weekly/rear PID: 4117 TASK: ffff880fbd4f3520 CPU: 1 COMMAND: "rear" ROOT: / CWD: /tmp/rear.hdfOL5g3ErJwMBN/rootfs FD FILE DENTRY INODE TYPE PATH 0 ffff881026cb22c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null 1 ffff880fb75cf6c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null 2 ffff880fb9b132c0 ffff8802668e3d80 ffff881018b908d0 REG /var/log/rear/rear-ch01erp7003.log 7 ffff880fb75cf6c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null 8 ffff880fbbd59240 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null 255 ffff8820254905c0 ffff880fb174b6c0 ffff880fb174e8d0 REG /usr/sbin/rearThe
hpwdtmodules are not the issue here – it’s/dev/watchdogthat they need to prevent rear from accessing.
Similar answer from RedHat. The “watchdog: watchdog0: watchdog did not stop!” comes from watchdog_release(). This happens when the the /dev/watchdog file was closed unexpectedly. And to close the file correctyl it is needed to send a specific magic character 'V'.
gdha commented at 2022-05-13 14:53:¶
The crash was caused by TrendMicro ds_am module touch dev/watchdog
under the /tmp/rear.XXXX directory (the inital ramdisk creation
temporary location).
Test were performed with
COPY_AS_IS_EXCLUDE=( ${COPY_AS_IS_EXCLUDE[@]} dev/watchdog\* ) with
success - meaning no crash!
lecce17 commented at 2022-05-13 15:16:¶
The crash was caused by TrendMicro ds_am module touch dev/watchdog under the /tmp/rear.XXXX directory (the inital ramdisk creation temporary location). Test were performed with
COPY_AS_IS_EXCLUDE=( ${COPY_AS_IS_EXCLUDE[@]} dev/watchdog\* )with success - meaning no crash!
Very interesting, since we also use Trendmicro. I will test the new setting next week and give you an update.
jsmeix commented at 2022-05-16 07:23:¶
@gdha @pcahyna
should we add dev/watchdog\* by default to COPY_AS_IS_EXCLUDE
because I assume a copy of dev/watchdog can never be useful
in the ReaR recovery system?
If yes
I would "just add it" in default.conf in current master code.
lecce17 commented at 2022-05-16 14:06:¶
@gdha No crashes any more after adding the suggested line to the local.conf. Thank you very much!!!
gdha commented at 2022-05-20 06:34:¶
Please find the following findings from HPE Engineering team.
OK, the vmcore-dmesg without /dev/watchdog not excluded once again shows the improper watchdog close prior to the crash:
<2>hpwdt: Unexpected close, not stopping watchdog!
<0>Kernel panic - not syncing: An NMI occurred.
And in the log with it excluded, we see it definitely being excluded:
2022-05-17 16:16:35 Files being excluded: dev/shm dev/.udev /var/lib/rear/output/rear-itsusabsdnp011.iso dev/shm/* dev/watchdog*
:
2022-05-17 16:17:27 Exclude list:
:
2022-05-17 16:17:28 /dev/watchdog
These entries do not appear in the log where /dev/watchdog was not excluded. Excluding it is clearly working to prevent the panic (unless there's a vmcore they didn't share with us). If that's the case, we should be done here.
jsmeix commented at 2022-05-25 11:56:¶
With
https://github.com/rear/rear/pull/2808
merged
this issue is avoided by default by ReaR
so I close this issue
(further comments can be added nevertheless).
Let's wait and see when something appears
that does need /dev/watchdog functionality
in the ReaR rescue/recovery system, cf.
https://github.com/rear/rear/pull/2808#issuecomment-1132569411
[Export of Github issue for rear/rear.]