#1644 Issue closed: Serial ttyS0 respawns continuously on a VM without a serial console

Labels: bug, fixed / solved / done, blocker

gdha opened issue at 2017-12-12 15:35:

  • rear version (/usr/sbin/rear -V): 2.3pre (frozen)
  • OS version (cat /etc/rear/os.conf or lsb_release -a): ubuntu 16 / fedora26
  • Are you using legacy BIOS or UEFI boot? BIOS
  • Brief description of the issue: after applying PR https://github.com/rear/rear/pull/1615 we see issues with serial ttyS0 on ubuntu16 and with fedora26

fedora 26 had the following issue with a recover VM that contains a serial console (libvirt)(timeout after a couple of minutes), but afterwards everything seems normal:
screenshot_rear-recover-kvm_2017-12-12_16 02
53

fedora26 without a serial console has the following problem:

Dec 12 15:23:37 fedora agetty[525]: /dev/ttyS0: not a tty
Dec 12 15:23:40 fedora kernel: random: crng init done
Dec 12 15:23:47 fedora systemd[1]: Received SIGCHLD from PID 525 (agetty).
Dec 12 15:23:47 fedora systemd[1]: Child 525 (agetty) died (code=exited, status=1/FAILURE)
Dec 12 15:23:47 fedora systemd[1]: serial-getty@ttyS0.service: Child 525 belongs to serial-getty@ttyS0.service
Dec 12 15:23:47 fedora systemd[1]: serial-getty@ttyS0.service: Main process exited, code=exited, status=1/FAILURE
Dec 12 15:23:47 fedora systemd[1]: serial-getty@ttyS0.service: Changed running -> dead
Dec 12 15:23:47 fedora systemd[1]: serial-getty@ttyS0.service: Changed dead -> auto-restart
Dec 12 15:23:47 fedora systemd[1]: serial-getty@ttyS0.service: Service has no hold-off time, scheduling restart.
Dec 12 15:23:47 fedora systemd[1]: serial-getty@ttyS0.service: Trying to enqueue job serial-getty@ttyS0.service/restart/fail
Dec 12 15:23:47 fedora systemd[1]: serial-getty@ttyS0.service: Installed new job serial-getty@ttyS0.service/restart as 39
Dec 12 15:23:47 fedora systemd[1]: serial-getty@ttyS0.service: Enqueued job serial-getty@ttyS0.service/restart as 39
Dec 12 15:23:47 fedora systemd[528]: serial-getty@ttyS0.service: Executing: /sbin/agetty -s ttyS0 115200,38400,9600
Dec 12 15:23:47 fedora agetty[528]: /dev/ttyS0: not a tty

ubuntu 16.04 : The serial ttyS0 is getting re-spawned continuously. The test VM has no serial ttyS0 defined (in virtual box). We see the same messages as we saw in fedora26.

Has to do with #878 and PR #1615 - @didacog is aware of the issue and is looking into it.

gdha commented at 2017-12-13 08:58:

Document https://serverfault.com/questions/736624/systemd-service-automatic-restart-after-startlimitinterval contains some useful info to improve the situation IMHO

didacog commented at 2017-12-13 10:57:

@gdha

I've already tested with StartLimitBurst but it not worked in my test, I'll try with more effort if works.
Nevertheless I tested with Restart=no and it worked fine, with serial port started OK and without serial no errors in journal and the service is not present in systemctl status output. (with ubuntu 16.04 in virtualbox)

Tomorrow I will test it on real HW with RHEL7.2

On the other hand I've detected that, at the recovery boot process, the serial console has no output in last steps, and with Before=getty.service configured you get the serial console prompt before the whole process ends. But if not this way, with After=getty.service if you are only connected to the serial the wait is long and seems it hanged.

What option seems better in your opinion?
IMHO I prefer getting a fast login prompt, maybe is possible to put some warning to check if finished before run rear recover or better, if possible, mirror the ouptut form serial and console :P.

Regards,

didacog commented at 2017-12-13 18:25:

Hi again, I found a cleaner solution than Restart=no.

/usr/share/rear/skel/default/usr/lib/systemd/system/serial-getty@.service

#  This file is part of systemd.
#
[Unit]
Description=Serial Getty on %I
BindTo=dev-%i.device
After=dev-%i.device

# We must wait for ReaR boot script to finish.
# this prevents the login serial prompt to
# be ready before the whole recovery boot process is done
After=getty.target


[Service]
Environment=TERM=vt100
ExecStart=-/sbin/agetty -s %I 115200,38400,9600
Restart=on-failure
RestartSec=0
StartLimitAction=none
StartLimitBurst=3
StartLimitInterval=60
UtmpIdentifier=%I
KillMode=process


# Some login implementations ignore SIGTERM, so we send SIGHUP
# instead, to ensure that login terminates cleanly.
KillSignal=SIGHUP

[Install]
WantedBy=getty.target

The key is Restart= action, now is on-failure (see this: https://www.freedesktop.org/software/systemd/man/systemd.service.html) and in case of failure will use StartLimit* settings ( finally worked after some attemtps :P ). This prevents restarts of the serial if not present, and also aviods weird messages in console like "... Failed to start getty on ttyS0 ..." when StartLimitInterval is reached and no more auto-restart attempts will be done.

On the other hand, improving this also reduced some of the waiting time for login prompt in the serial console with the After=getty.target clause set. Now I rather think that is better to keep it and avoid running rear recover by mistake from serial connection before rear boot script is finished.
Maybe a future improvement could be possible to show the rear boot script output to console to provide better feedback to the user.

During the debugging of the issue I found that the getty.service clause DefautInstance=tty0 was not recognized because a typo (my fault), I also adjusted it to DefaultInstance=%I (see: https://www.freedesktop.org/software/systemd/man/systemd.unit.html#DefaultInstance=) and now is working Ok.

All this was tested on VBox with Ubuntu 16.04 VMs with and without serial connected. All tests worked well, and there are no more infinite serial-getty respawns in journal.

Tomorrow I will test these changes on physical HW with RHEL7.2 and serial console. If no issues a PR will be ready by tomorrow evening, I guess.

Correct me if I'm wrong, but a new PR is needed, isn't it?

Regards,
Didac

gdha commented at 2017-12-14 07:23:

@didacog great news indeed! I think you have found the fix 👍 And, yes please prepare a fresh PR which will be accepted with great pleasure :-)

didacog commented at 2017-12-14 11:25:

@gdha, last changes worked also on SuperdomeX servers with RHEL7.2 with serial console.

Later today I will send the PR with updated changes, this should solve all problems from issue #878 and PR #1615 without the infinite respawn problem. :-)

Regards,
Didac

schabrolles commented at 2017-12-14 11:39:

@didacog did you have a look on /usr/lib/systemd/system-generators/systemd-getty-generator ?
It should be responsible of serial console detection and automatically create serial-ttyS0@.service or serial-hvc0@.service. Have a look on #1442 and #1448

didacog commented at 2017-12-14 15:32:

@schabrolles yes I've tested, as it is, with "/usr/lib/systemd/system-generators/systemd-getty-generator" in COPY_AS_IS without changes, but issue #878 happens.

Maybe with some bigger changes/cleanups in getty services in systemd could work with the generator at the end, but at least for now, I prefer to solve these issues and after, with them solved, we may start a discussion in how to improve it to have a better implementation?

didacog commented at 2017-12-14 15:54:

PR is ready! :-P

jsmeix commented at 2017-12-15 10:34:

With https://github.com/rear/rear/pull/1649 merged
I consider this issue to be fixed (if not it can be reopened).


[Export of Github issue for rear/rear.]