#2798 Issue `closed`: TrendMicro ds_am module cause system crash by touching dev/watchdog in ReaR's build area¶

Labels: support / question, fixed / solved / done, external tool, not ReaR / invalid

lecce17 opened issue at 2022-05-05 12:20:¶

ReaR version ("/usr/sbin/rear -V"):

# rear -V
Relax-and-Recover 2.4 / Git

OS version ("cat /etc/os-release" or "lsb_release -a" or "cat /etc/rear/os.conf"):

# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.9 (Maipo)

ReaR configuration files ("cat /etc/rear/site.conf" and/or "cat /etc/rear/local.conf"):

OUTPUT=ISO
BACKUP_URL=nfs://<hostname>.<domain>/rear
BACKUP=NETFS
ONLY_INCLUDE_VG=sys
BACKUP_PROG_EXCLUDE=( "${BACKUP_PROG_EXCLUDE[@]}" '/dev'  )

Hardware vendor/product (PC or PowerNV BareMetal or ARM) or VM (KVM guest or PowerVM LPAR):
HP DL380 Gen9
System architecture (x86 compatible or PPC64/PPC64LE or what exact ARM device):
Intel x86_64
Firmware (BIOS or UEFI or Open Firmware) and bootloader (GRUB or ELILO or Petitboot):
Storage (local disk or SSD) and/or SAN (FC or iSCSI or FCoE) and/or multipath (DM or NVMe):
local SAS Disks
Storage layout ("lsblk -ipo NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,LABEL,SIZE,MOUNTPOINT"):

# lsblk -ipo NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,LABEL,SIZE,MOUNTPOINT
NAME                           KNAME      PKNAME    TRAN TYPE FSTYPE      LABEL   SIZE MOUNTPOINT
/dev/sda                       /dev/sda             sas  disk                   279.4G
|-/dev/sda1                    /dev/sda1  /dev/sda       part xfs                 500M /boot
`-/dev/sda2                    /dev/sda2  /dev/sda       part LVM2_member       278.9G
  |-/dev/mapper/sys-root       /dev/dm-0  /dev/sda2      lvm  xfs                 9.8G /
  |-/dev/mapper/sys-swap       /dev/dm-1  /dev/sda2      lvm  swap               17.9G [SWAP]
  |-/dev/mapper/sys-home       /dev/dm-9  /dev/sda2      lvm  xfs                   8G /home
  |-/dev/mapper/sys-tmp        /dev/dm-10 /dev/sda2      lvm  xfs                 3.9G /tmp
  |-/dev/mapper/sys-var        /dev/dm-11 /dev/sda2      lvm  xfs                  40G /var
  |-/dev/mapper/sys-opt        /dev/dm-12 /dev/sda2      lvm  xfs                   7G /opt
  |-/dev/mapper/sys-opt_oracle /dev/dm-13 /dev/sda2      lvm  xfs                  30G /opt/oracle
  `-/dev/mapper/sys-crash      /dev/dm-14 /dev/sda2      lvm  xfs                 100G /var/crash
/dev/sdb                       /dev/sdb             sas  disk LVM2_member       279.4G
|-/dev/mapper/vg01-database    /dev/dm-4  /dev/sdb       lvm  xfs                 200G /database
`-/dev/mapper/vg01-oractlvg01  /dev/dm-5  /dev/sdb       lvm  xfs                  10G /oractlvg01
/dev/sdc                       /dev/sdc             sas  disk LVM2_member       279.4G
|-/dev/mapper/vg02-redo        /dev/dm-2  /dev/sdc       lvm  xfs                  50G /redo
`-/dev/mapper/vg02-oractlvg02  /dev/dm-3  /dev/sdc       lvm  xfs                  10G /oractlvg02
/dev/sdd                       /dev/sdd             sas  disk LVM2_member       279.4G
|-/dev/mapper/vg03-traces      /dev/dm-6  /dev/sdd       lvm  xfs                 200G /traces
|-/dev/mapper/vg03-oractlvg03  /dev/dm-7  /dev/sdd       lvm  xfs                  10G /oractlvg03
`-/dev/mapper/vg03-import      /dev/dm-8  /dev/sdd       lvm  xfs                  16G /import

Description of the issue (ideally so that others can reproduce it):

After starting a backup with "rear mkbackup" the server is crashing
because rear does not close the /etc/watchdog correctly

May  5 11:29:02 saco3 kernel: watchdog: watchdog0: watchdog did not stop!
May  5 11:29:02 saco3 kernel: watchdog: watchdog0: watchdog did not stop!

We already opened a RedHat Case and the analyzed the vmcore.
They suggested to blacklist the /dev directory
what i did but unfortunately it didn't help.

Workaround, if any:
Attachments, as applicable ("rear -D mkrescue/mkbackup/recover" debug log files):

To paste verbatim text like command output or file content,
include it between a leading and a closing line of three backticks like

```
verbatim content
```

jsmeix commented at 2022-05-05 13:22:¶

Because

# find usr/sbin/rear usr/share/rear -type f | xargs grep -i 'watchdog'

shows nothing (i.e. watchdog does not appear in any of ReaR's scripts)
I assume ReaR does not do anything directly with /dev/watchdog
but instead something that is called by ReaR does something
with /dev/watchdog - but that is only a blind guess.

@lecce17
to proceed here at ReaR upstream have a look at the section
"Debugging issues with Relax-and-Recover" in
https://en.opensuse.org/SDB:Disaster_Recovery

Alternatively proceed at Red Hat with your case there.

gdha commented at 2022-05-05 14:17:¶

@lecce17 I know of an open case at HPE on the same issue, but so far no update. Did you update to the latest firmware already? I would advise to open a software case at HPE as it is primary HPE software resulting in a (panic) reboot.

pcahyna commented at 2022-05-05 14:21:¶

as ReaR itself apparently does not open /dev/watchdog, I believe it would be useful to do a lsof or fuser just before the machine crashes to see which process has the device open.

gdha commented at 2022-05-06 14:04:¶

Please find the following findings from HPE Engineering team:

The good news here is that the vmcore provides definite confirmation that this is happening during ReaR backup. We can see the rear job running. This is the process ancestry:

PID: 0     TASK: ffffffff81a8d020 CPU: 0  COMMAND: "swapper"
 PID: 1     TASK: ffff881029867520 CPU: 3  COMMAND: "init"
 PID: 18695 TASK: ffff880fbca1aab0 CPU: 26 COMMAND: "anacron"
  PID: 4108  TASK: ffff88201b0f1520 CPU: 11 COMMAND: "run-parts"
   PID: 4114  TASK: ffff88201b2ad520 CPU: 19 COMMAND: "run-parts"
    PID: 4117  TASK: ffff880fbd4f3520 CPU: 1  COMMAND: "rear"
     PID: 25151 TASK: ffff882006b40ab0 CPU: 0  COMMAND: "gzip"

Looking at the open files for a few of these processes shows that rear is being run from /etc/cron.weekly, and it is writing to a log file named /var/log/rear/rear-ch01erp7003.log:

crash64> files 4114 4117
PID: 4114  TASK: ffff88201b2ad520 CPU: 19 COMMAND: "run-parts"
ROOT: /   CWD: /
 FD      FILE           DENTRY          INODE      TYPE PATH
 0 ffff881026cb22c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
 1 ffff8820287dbd80 ffff881c52af5ec0 ffff882011969358 FIFO
 2 ffff8820287dbd80 ffff881c52af5ec0 ffff882011969358 FIFO
255 ffff880ff4782c00 ffff88075f696500 ffff88075f669d98 REG /etc/cron.weekly/rear


PID: 4117  TASK: ffff880fbd4f3520 CPU: 1  COMMAND: "rear"
ROOT: /   CWD: /tmp/rear.hdfOL5g3ErJwMBN/rootfs
 FD      FILE           DENTRY          INODE      TYPE PATH
 0 ffff881026cb22c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
 1 ffff880fb75cf6c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
 2 ffff880fb9b132c0 ffff8802668e3d80 ffff881018b908d0 REG /var/log/rear/rear-ch01erp7003.log
 7 ffff880fb75cf6c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
 8 ffff880fbbd59240 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
255 ffff8820254905c0 ffff880fb174b6c0 ffff880fb174e8d0 REG /usr/sbin/rear

The hpwdt modules are not the issue here – it’s /dev/watchdog that they need to prevent rear from accessing.

lecce17 commented at 2022-05-06 14:24:¶

as ReaR itself apparently does not open /dev/watchdog, I believe it would be useful to do a lsof or fuser just before the machine crashes to see which process has the device open.

Redhat recommended to use SystemTap to monitor who access /dev/watchdog. I've tried to do it with lsof and fuser but i wasn't lucky to capture the pid.

lecce17 commented at 2022-05-06 14:29:¶

Please find the following findings from HPE Engineering team:

The good news here is that the vmcore provides definite confirmation that this is happening during ReaR backup. We can see the rear job running. This is the process ancestry:

PID: 0     TASK: ffffffff81a8d020 CPU: 0  COMMAND: "swapper"
 PID: 1     TASK: ffff881029867520 CPU: 3  COMMAND: "init"
 PID: 18695 TASK: ffff880fbca1aab0 CPU: 26 COMMAND: "anacron"
  PID: 4108  TASK: ffff88201b0f1520 CPU: 11 COMMAND: "run-parts"
   PID: 4114  TASK: ffff88201b2ad520 CPU: 19 COMMAND: "run-parts"
    PID: 4117  TASK: ffff880fbd4f3520 CPU: 1  COMMAND: "rear"
     PID: 25151 TASK: ffff882006b40ab0 CPU: 0  COMMAND: "gzip"

Looking at the open files for a few of these processes shows that rear is being run from /etc/cron.weekly, and it is writing to a log file named /var/log/rear/rear-ch01erp7003.log:

crash64> files 4114 4117
PID: 4114  TASK: ffff88201b2ad520 CPU: 19 COMMAND: "run-parts"
ROOT: /   CWD: /
 FD      FILE           DENTRY          INODE      TYPE PATH
 0 ffff881026cb22c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
 1 ffff8820287dbd80 ffff881c52af5ec0 ffff882011969358 FIFO
 2 ffff8820287dbd80 ffff881c52af5ec0 ffff882011969358 FIFO
255 ffff880ff4782c00 ffff88075f696500 ffff88075f669d98 REG /etc/cron.weekly/rear


PID: 4117  TASK: ffff880fbd4f3520 CPU: 1  COMMAND: "rear"
ROOT: /   CWD: /tmp/rear.hdfOL5g3ErJwMBN/rootfs
 FD      FILE           DENTRY          INODE      TYPE PATH
 0 ffff881026cb22c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
 1 ffff880fb75cf6c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
 2 ffff880fb9b132c0 ffff8802668e3d80 ffff881018b908d0 REG /var/log/rear/rear-ch01erp7003.log
 7 ffff880fb75cf6c0 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
 8 ffff880fbbd59240 ffff881028dc2c80 ffff8820293dfa38 CHR /dev/null
255 ffff8820254905c0 ffff880fb174b6c0 ffff880fb174e8d0 REG /usr/sbin/rear

The hpwdt modules are not the issue here – it’s /dev/watchdog that they need to prevent rear from accessing.

Similar answer from RedHat. The “watchdog: watchdog0: watchdog did not stop!” comes from watchdog_release(). This happens when the the /dev/watchdog file was closed unexpectedly. And to close the file correctyl it is needed to send a specific magic character 'V'.

gdha commented at 2022-05-13 14:53:¶

The crash was caused by TrendMicro ds_am module touch dev/watchdog under the /tmp/rear.XXXX directory (the inital ramdisk creation temporary location).
Test were performed with COPY_AS_IS_EXCLUDE=( ${COPY_AS_IS_EXCLUDE[@]} dev/watchdog\* ) with success - meaning no crash!

lecce17 commented at 2022-05-13 15:16:¶

The crash was caused by TrendMicro ds_am module touch dev/watchdog under the /tmp/rear.XXXX directory (the inital ramdisk creation temporary location). Test were performed with COPY_AS_IS_EXCLUDE=( ${COPY_AS_IS_EXCLUDE[@]} dev/watchdog\* ) with success - meaning no crash!

Very interesting, since we also use Trendmicro. I will test the new setting next week and give you an update.

jsmeix commented at 2022-05-16 07:23:¶

@gdha @pcahyna
should we add dev/watchdog\* by default to COPY_AS_IS_EXCLUDE
because I assume a copy of dev/watchdog can never be useful
in the ReaR recovery system?

If yes
I would "just add it" in default.conf in current master code.

lecce17 commented at 2022-05-16 14:06:¶

@gdha No crashes any more after adding the suggested line to the local.conf. Thank you very much!!!

gdha commented at 2022-05-20 06:34:¶

Please find the following findings from HPE Engineering team.

OK, the vmcore-dmesg without /dev/watchdog not excluded once again shows the improper watchdog close prior to the crash:

<2>hpwdt: Unexpected close, not stopping watchdog!
<0>Kernel panic - not syncing: An NMI occurred.

And in the log with it excluded, we see it definitely being excluded:

2022-05-17 16:16:35 Files being excluded: dev/shm dev/.udev /var/lib/rear/output/rear-itsusabsdnp011.iso dev/shm/* dev/watchdog*
   :
2022-05-17 16:17:27 Exclude list:
    :
2022-05-17 16:17:28  /dev/watchdog

These entries do not appear in the log where /dev/watchdog was not excluded. Excluding it is clearly working to prevent the panic (unless there's a vmcore they didn't share with us). If that's the case, we should be done here.

jsmeix commented at 2022-05-25 11:56:¶

With https://github.com/rear/rear/pull/2808 merged
this issue is avoided by default by ReaR
so I close this issue
(further comments can be added nevertheless).

Let's wait and see when something appears
that does need /dev/watchdog functionality
in the ReaR rescue/recovery system, cf.
https://github.com/rear/rear/pull/2808#issuecomment-1132569411

[Export of Github issue for rear/rear.]

#2798 Issue closed: TrendMicro ds_am module cause system crash by touching dev/watchdog in ReaR's build area¶