#2439 Issue closed: Boot from rescue ISO freezes at "pid_max: default: 32768 minimum: 301" on "ESXi 6.7 Update 2 and later" virtual machine

Labels: support / question, fixed / solved / done, external tool, not ReaR / invalid, special hardware or VM

will-code-for-pizza opened issue at 2020-06-25 13:32:

Relax-and-Recover (ReaR) Issue Template

Fill in the following items before submitting a new issue
(quick response is not guaranteed with free support):

  • ReaR version ("/usr/sbin/rear -V"):
    2.4 to 2.6

  • OS version ("cat /etc/os-release" or "lsb_release -a" or "cat /etc/rear/os.conf"):
    SuSE SLES 11.3

  • ReaR configuration files ("cat /etc/rear/site.conf" and/or "cat /etc/rear/local.conf"):

site.conf

BACKUP=NETFS
OUTPUT=ISO
RETENTION_TIME="Week"
USE_CFG2HTML=y
BACKUP_PROG=tar
BACKUP_PROG_COMPRESS_OPTIONS="--gzip"
LOGFILE="/var/log/rear/$HOSTNAME.log"
BACKUP_PROG_EXCLUDE=( '/mnt/*' '/cache/*' '/tmp/*' '/dev/shm/*' $VAR_DIR/output/\* )
BACKUP_PROG_INCLUDE=( '/*' '/boot/' '/home/*' '/opt/*' '/usr/*' '/var/*' '/oracleoem/*' )

local.conf

NETFS_KEEP_OLD_BACKUP_COPY=yes
BACKUP_URL=nfs://139.1.xxx.xxx/srv/IMAGES/rear/
MIGRATION_MODE='true'
BOOTLOADER="GRUB"
USE_STATIC_NETWORKING="yes"
NETWORKING_PREPARATION_COMMANDS=( 'ip addr add 139.1.xx.xx/25 dev eth0' 'ip link set dev eth0 up' 'ip route add default via 139.1.xx.1' 'return' )
KERNEL_CMDLINE="cgroup_disable=memory" # otherwise it throws some cgroup errors while booting
  • Hardware (PC or PowerNV BareMetal or ARM) or virtual machine (KVM guest or PoverVM LPAR):

ESX virtual guest

  • System architecture (x86 compatible or PPC64/PPC64LE or what exact ARM device):

amd64

  • Firmware (BIOS or UEFI or Open Firmware) and bootloader (GRUB or ELILO or Petitboot):

BIOS / Grub

  • Storage (local disk or SSD) and/or SAN (FC or iSCSI or FCoE) and/or multipath (DM or NVMe):

vmdk-virtual disks on SAN Storage Backend

  • Storage layout ("lsblk -ipo NAME,KNAME,PKNAME,TRAN,TYPE,FSTYPE,SIZE,MOUNTPOINT" or "lsblk" as makeshift):

Could not be determined cause of hanging boot

  • Description of the issue (ideally so that others can reproduce it):

We want to P2V migrate a physical server to ESX virtual machine.
While booting the rescue ISO into virtual machine,
the boot from rescue ISO freezes at

pid_max: default: 32768 minimum: 301

on ESX virtual machine

  • Workaround, if any:

NONE

jsmeix commented at 2020-06-25 14:03:

@will-code-for-pizza
on my current openSUSE Leap 15.1 x86_64 home-office laptop I get

# dmesg | grep -3 pid_max

[    0.000000] tsc: Fast TSC calibration using PIT
[    0.004000] tsc: Detected 2394.502 MHz processor
[    0.004000] Calibrating delay loop (skipped), value calculated using timer frequency.. 4789.00 BogoMIPS (lpj=9578008)
[    0.004000] pid_max: default: 32768 minimum: 301
[    0.004000] ACPI: Core revision 20170303
[    0.004000] Security Framework initialized
[    0.004000] AppArmor: AppArmor initialized
ault: 32768 minimum: 301

so the pid_max: default: 32768 minimum: 301 kernel message
sems to be a normal info message that does not indicate an error.

I assume it freezes at something that happens after the
pid_max: default: 32768 minimum: 301 kernel message
but I have not idea what that "something" is in your case
(I am not at all a kernel expert).

In particular I cannot help with "VMware ESXi (formerly ESX)" issues
because I neither have that nor did I ever use it.
https://en.wikipedia.org/wiki/VMware_ESXi
is all I know about it.
I use only QEMU/KVM virtual machines
(but I am also not at all a QEMU/KVM expert).

gdha commented at 2020-06-25 14:16:

@will-code-for-pizza Try to add MODULES=( 'all_modules' ) in your local.conf file

gozora commented at 2020-06-25 14:24:

@will-code-for-pizza can you please execute following command on your ORIGINAL system find /usr/share/rear -type l | wc -l and report back the output (should be single number) ?

Happened to me some time ago that copying ReaR over winscp discarded symbolic links which made ReaR recovery system un-bootable ...

V.

jsmeix commented at 2020-06-25 14:26:

Since ReaR 2.5 MODULES=( 'all_modules' ) is the default, cf.
https://github.com/rear/rear/blob/master/doc/rear-release-notes.txt#L924

@will-code-for-pizza
an offtopic question:
What is RETENTION_TIME="Week" intended to do?
I can only find NSR_RETENTION_TIME in the ReaR scripts
but that only applies if you use BACKUP=NSR.

will-code-for-pizza commented at 2020-06-25 14:31:

What is RETENTION_TIME="Week" intended to do?

You are right @jsmeix

I can only remember, this was an option in an older version of ReaR ( ~ 1.7 )
So I will remove this option.
Thanks.

jsmeix commented at 2020-06-25 14:43:

@gozora
but how could something missing below usr/share/rear
let booting freeze apparently while the kernel is in its startup phase
(as far as I imagine what freezes at "pid_max: default: 32768 minimum: 301" means)
i.e. while only the kernel seems to be running?

Or actually it freezes during ReaR recovery system startup scripts?
I have those symlinks of the ReaR recovery system startup scripts:

# find usr/share/rear -type l | grep skel
usr/share/rear/skel/default/etc/bash.bashrc
usr/share/rear/skel/default/usr/lib/systemd/system/multi-user.target.wants/sysinit.service
usr/share/rear/skel/default/usr/lib/systemd/system/multi-user.target.wants/getty.target
usr/share/rear/skel/default/usr/lib/systemd/system/multi-user.target.wants/systemd-journald.service
usr/share/rear/skel/default/usr/lib/systemd/system/multi-user.target.wants/sshd.service
usr/share/rear/skel/default/usr/lib/systemd/system/multi-user.target.wants/systemd-udev-trigger.service
usr/share/rear/skel/default/usr/lib/systemd/system/multi-user.target.wants/rear-boot-helper.service
usr/share/rear/skel/default/usr/lib/systemd/system/dbus.target.wants/dbus.service
usr/share/rear/skel/default/usr/lib/systemd/system/ctrl-alt-del.target
usr/share/rear/skel/default/usr/lib/systemd/system/sockets.target.wants/dbus.socket
usr/share/rear/skel/default/usr/lib/systemd/system/sockets.target.wants/systemd-logger.socket
usr/share/rear/skel/default/usr/lib/systemd/system/sockets.target.wants/systemd-shutdownd.socket
usr/share/rear/skel/default/usr/lib/systemd/system/sockets.target.wants/systemd-journald.socket
usr/share/rear/skel/default/usr/lib/systemd/system/sockets.target.wants/systemd-udevd-kernel.socket
usr/share/rear/skel/default/usr/lib/systemd/system/sockets.target.wants/syslog.socket
usr/share/rear/skel/default/usr/lib/systemd/system/sockets.target.wants/udev-kernel.socket
usr/share/rear/skel/default/usr/lib/systemd/system/sockets.target.wants/udev-control.socket
usr/share/rear/skel/default/usr/lib/systemd/system/sockets.target.wants/systemd-udevd-control.socket
usr/share/rear/skel/default/usr/lib/systemd/system/getty.target.wants/getty@tty0.service
usr/share/rear/skel/default/usr/lib/systemd/system/getty.target.wants/serial-getty@ttyS0.service
usr/share/rear/skel/default/usr/lib/systemd/system/default.target

Perhaps what actually freezes is PID1 a.k.a. systemd?

gozora commented at 2020-06-25 14:49:

@jsmeix discarded isn't the right word. The links were not missing. What happened in my particular case was that symbolic links were present but they were regular files with symbolic link target as their content.

Something like original symbolic link /bin/ls -> /usr/bin/ls after copied over winscp become:

# cat /bin/ls
/usr/bin/ls

gozora commented at 2020-06-25 14:51:

Guess our Systemd setup didn't like one of /usr/share/rear/skel/default/usr/lib/systemd/system/ symbolic links modified this way.

V.

will-code-for-pizza commented at 2020-06-25 14:54:

Thanks a lot for all your support.

I use ReaR as an extracted TAR archive in /root/bin/rear-2.6/, so I guess, @gozora 's issue is not my case today.
The creation of the rescue ISO and the backup archive on the NFS target runs fine.

Starting a libvirt/kvm virtual machine or another physical server with this rescue ISO went fine.

Only this ESX vm freezes. :-(

Regards

will-code-for-pizza commented at 2020-06-25 14:55:

@will-code-for-pizza Try to add MODULES=( 'all_modules' ) in your local.conf file

Unfortunatelly, this was NOT the solution.

will-code-for-pizza commented at 2020-06-25 14:56:

@will-code-for-pizza can you please execute following command on your ORIGINAL system find /usr/share/rear -type l | wc -l and report back the output (should be single number) ?

Happened to me some time ago that copying ReaR over winscp discarded symbolic links which made ReaR recovery system un-bootable ...

V.

149

will-code-for-pizza commented at 2020-06-25 14:58:

Thanks a lot for all your support.

I use ReaR as an extracted TAR archive in /root/bin/rear-2.6/, so I guess, @gozora 's issue is not my case today.
The creation of the rescue ISO and the backup archive on the NFS target runs fine.

Starting a libvirt/kvm virtual machine or another physical server with this rescue ISO went fine.

Only this ESX vm freezes. :-(

Regards

By the way... The ESX vm has 16 GB RAM and 8 vCPUs.

jsmeix commented at 2020-06-25 15:00:

For comparison:

# find usr/share/rear -type l | wc -l

149

I use latest ReaR upstream code.

By the way:
My QEMU/KVM test systems have 2 GiB RAM and one virtual CPU.
With FIRMWARE_FILES=( 'no' ) it also works with only 1 GiB RAM.

will-code-for-pizza commented at 2020-06-25 15:13:

By the way:
My QEMU/KVM test systems have 2 GiB RAM and one virtual CPU.
With FIRMWARE_FILES=( 'no' ) it also works with only 1 GiB RAM.

Well, but this is not the problem solver :-(

jsmeix commented at 2020-06-25 15:20:

@will-code-for-pizza
I did not mean FIRMWARE_FILES=( 'no' ) would help here
I only meant that the ReaR recovery system does not need much RAM
i.e. that 16 GB RAM should be more than enough.

By the way - also not meant as a problem solver:
On my SLES11 SP4 system I have those kernel messages

# dmesg | grep -3 pid_max

[    0.000000] please try 'cgroup_disable=memory' option if you don't want memory cgroups
[    0.000000] Detected 2394.456 MHz processor.
[    0.008000] Calibrating delay loop (skipped) preset value.. 4788.91 BogoMIPS (lpj=9577824)
[    0.008000] pid_max: default: 32768 minimum: 301
[    0.008000] kdb version 4.4 by Keith Owens, Scott Lurndal. Copyright SGI, All Rights Reserved
[    0.008000] Security Framework initialized
[    0.008000] AppArmor: AppArmor initialized

so directly after pid_max: default: 32768 minimum: 301
it seems there is something keyboard related happening.
Is there perhaps something special with the keyboard on your ESX VM?
I.e. is there perhaps not a "normal" (virtual) keyboard?
This is plain blind guesswork from me.

will-code-for-pizza commented at 2020-06-25 15:48:

Short info: I have booted another VM on the ESX cluster successfully.
The settings for the VM differs in

WORKING:
Guest OS: Ubuntu Linux (64 Bit)
Compatibility: ESX/ESXi 4.0 and later

vs.

NOT WORKING
Guest OS: SuSE Linux Enterprise 11 (64-Bit)
Compatibility: ESXi 6.7 Update 2 and later

I will now re-create a new virtual machine template and try again.

will-code-for-pizza commented at 2020-06-25 15:52:

SOLVED.

VM Option --> Compatibility: ESXi 6.7 Update 2 and later --> BAD SETTING

jsmeix commented at 2020-06-26 08:39:

@will-code-for-pizza
thank you for your feedback what the reason was and what the solution is.

jsmeix commented at 2020-06-26 10:02:

@will-code-for-pizza
I have a question out of curiosity:

I do not understand why the SLES11 Linux kernel freezes
when it runs on a particular newer version of a virtual machine
but works on an older version of that kind of virtual machine.

I could understand when some latest version of a guest program
that requires some latest newest hardware stuff fails when it is
run in an older version of a virtual machine because the older
virtual machine does not provide virtualization of the required
newest hardware.

But I do not understand the opposite because I would expect
that an old-style guest program would work on latest versions
of a virtual machine (i.e. I would expect newer virtual machine
versions to behave backward compatible).

Could you perhaps explain what the reason behind is why
it seems an "ESXi 6.7 Update 2" virtual machine does not behave
backward compatible with an "ESX/ESXi 4.0" virtual machine?

I found
https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vm_admin.doc/GUID-64D4B1C9-CD5D-4C68-8B50-585F6A87EBA0.html
but that does not answer my question (or I fail to see the answer).


[Export of Github issue for rear/rear.]