#1712 Issue closed
: during "rear mkbackup" [Ctrl]+[C] does not terminate 'tar' background process¶
Labels: enhancement
, bug
, cleanup
, fixed / solved / done
jsmeix opened issue at 2018-01-26 13:53:¶
Current master code after ReaR 2.3.
When during "rear mkbackup" the 'tar' background process is already
running
pressing [Ctrl]+[C] (i.e. sending SIGINT) only terminates the
foreground
process (i.e. the bash script) but neither the 'tar' background
process
nor the 'dd' background process:
# usr/sbin/rear -D mkbackup ... Creating tar archive '/tmp/rear.lswe2CuMGSACoyi/outputfs/g241/backup.tar.gz' Preparing archive operation Archived 39 MiB [avg 4528 KiB/sec] Archived 57 MiB [avg 3948 KiB/sec] Archived 73 MiB [avg 4168 KiB/sec] Archived 143 MiB [avg 5074 KiB/sec] ^CYou should also rm -Rf /tmp/rear.lswe2CuMGSACoyi rear mkbackup failed, check /root/rear.master/var/log/rear/rear-g241.log for details # ps auxw | grep tar root 14524 19.4 0.5 12808 4628 pts/0 S 14:40 0:08 tar --warning=no-xdev --sparse --block-number --totals --verbose --no-wildcards-match-slash --one-file-system --ignore-failed-read --anchored --gzip -X /tmp/rear.lswe2CuMGSACoyi/tmp/backup-exclude.txt -C / -c -f - / /root/rear.master/var/log/rear/rear-g241.log root 14526 0.8 0.0 8296 612 pts/0 S 14:40 0:00 dd of=/tmp/rear.lswe2CuMGSACoyi/outputfs/g241/backup.tar.gz root 14583 0.0 0.0 5732 812 pts/0 R+ 14:40 0:00 grep tar
The background 'tar' and 'dd' processes run until they finish
and at the very end the mounted NFS share is still mounted:
# mount | grep rear 10.160.4.244:/nfs on /tmp/rear.lswe2CuMGSACoyi/outputfs type nfs (rw,nfsvers=3,nolock,addr=10.160.4.244)
I think pressing [Ctrl]+[C] during any "rear WORKFLOW"
should also terminate all background processes
and clean up e.g. umount ReaR specific things
(i.e. ensure the EXIT_TASKS can run successfully).
I think the current behaviour is at most a "minor bug".
Or is there a reason why the current behaviour
is perhaps even intentional?
jsmeix commented at 2018-01-26 14:37:¶
As a first idea I tried in lib/_input-output-functions.sh adding
... pkill -P $MASTER_PID ; wait ...
before finally kill $MASTER_PID
to the USR1 trap like
builtin trap " ... pkill -P $MASTER_PID ; wait ; kill $MASTER_PID" USR1
but that did not help.
I also tried in lib/_input-output-functions.sh
adding the same to the DoExitTasks() function like
function DoExitTasks () { LogPrint "Running exit tasks" LogPrint "Terminating child processes" LogPrint "$( pgrep -a -P $MASTER_PID )" pkill -P $MASTER_PID LogPrint "Waiting for child processes to finish..." wait LogPrint "Child processes finished" ...
but that also did not help.
Somehow that "kill all running jobs" in the DoExitTasks() function
does not do what it should do...
gozora commented at 2018-01-27 11:29:¶
Oddly enough, I'm not suffering by this. Recently I had to interrupt
ReaR with ^C multiple times as well, but I did not had dd
command
left running.
Running ReaR:
sp2:/ # rear -D mkbackup
...
Making ISO image
Wrote ISO image: /var/lib/rear/output/rear-sp2.iso (183M)
Copying resulting files to nfs location
Saving /var/log/rear/rear-sp2.log as rear-sp2.log to nfs location
Creating tar archive '/tmp/rear.2HQ5NB3tdmpSCID/outputfs/sp2/backup.tar.gz'
Archived 169 MiB [avg 9614 KiB/sec]
sp2:/ # ps auxw | grep tar
root 17251 4.4 0.6 19196 3448 pts/6 S+ 12:16 0:00 tar --warning=no-xdev --sparse --block-number --totals --verbose --no-wildcards-match-slash --one-file-system --ignore-failed-read --anchored --xattrs --xattrs-include=security.capability --xattrs-include=security.selinux --acls --gzip -X /tmp/rear.2HQ5NB3tdmpSCID/tmp/backup-exclude.txt -C / -c -f - /var/log /var/lib/pgsql /var/tmp /srv /var/spool /tmp /boot/grub2/x86_64-efi /var/lib/named /home /usr/local /boot/grub2/i386-pc /var/lib/mariadb /var/lib/machines /var/lib/mailman /var/opt /var/lib/mysql /var/lib/libvirt/images /var/cache /opt / /boot/efi /var/log/rear/rear-sp2.log
root 17253 1.0 0.1 4248 712 pts/6 S+ 12:16 0:00 dd of=/tmp/rear.2HQ5NB3tdmpSCID/outputfs/sp2/backup.tar.gz
root 17285 0.0 0.3 10508 1592 pts/5 R+ 12:16 0:00 grep --color=auto tar
After ^C
Creating tar archive '/tmp/rear.2HQ5NB3tdmpSCID/outputfs/sp2/backup.tar.gz'
Archived 169 MiB [avg 9614 KiB/sec]
^C
You should also rm -Rf /tmp/rear.2HQ5NB3tdmpSCID
rear mkbackup failed, check /var/log/rear/rear-sp2.log for details
sp2:/ # ps auxw | grep tar
root 17359 0.0 0.3 10508 1592 pts/5 R+ 12:16 0:00 grep --color=auto tar
V.
jsmeix commented at 2018-01-30 13:11:¶
Phew! The analysis took much longer than expected:
Also for me Ctrl+C usually works - but sometimes not.
Summary:
At least sometimes that "kill all running jobs" in the DoExitTasks()
function
does not do what it should do because it does not terminate all child
processes
because it fails to determine all child processses ('jobs -p' is
insufficient for that).
Usually 'jobs -p' reports only one child and that gets killed with a
brutal SIGKILL
that leaves the poor child (and all its own grandchildren) no chance
to do any own cleanup.
Therefore the "kill all running jobs" in the DoExitTasks() function
needs to be enhanced to determine all child processes
and to recursively terminate each of them more gracefully
first by sending SIGTERM and waiting a bit for them to terminate
and finally sending SIGKILL only to remaining child processes.
Details on my SLES12-SP2 system:
For analysis I enhanced lib/_input-output-functions.sh
function DoExitTasks () { Log "Running exit tasks." # terminate all running jobs JOBS=( $( jobs -p ) ) # when "jobs -p" results nothing then JOBS is still an unbound variable so that # an empty default value is used to avoid 'set -eu' error exit if $JOBS is unset: if test -n ${JOBS:-""} ; then LogPrint "Terminating..." LogPrint "The following ReaR (background) jobs are active and get terminated:" LogPrint "Output of: jobs -p" LogPrint "$( jobs -p )" LogPrint "Output of: pstree -plau $MASTER_PID" LogPrint "$( pstree -plau $MASTER_PID )" LogPrint "Output of: pgrep -a -P $MASTER_PID" LogPrint "$( pgrep -a -P $MASTER_PID )" LogPrint "Output of: ps auxw | egrep 'rear|tar'" LogPrint "$( ps auxw | egrep 'rear|tar' )" LogPrint "Terminating jobs (PIDs ${JOBS[*]})" for job in "${JOBS[@]}" ; do LogPrint "Terminating job (PID $job)" kill $job 1>&2 done sleep 1 ...
and backup/NETFS/default/500_make_backup.sh
sleep 1 # Give the backup software a good chance to start working echo "ps auxw | egrep 'rear|tar'" 1>&7 ps auxw | egrep -i1 'rear|tar' 1>&7 echo "pstree -plau $MASTER_PID" 1>&7 pstree -plau $MASTER_PID 1>&7
then I did "rear mkbackup" with Ctrl+C during "archive operation"
# usr/sbin/rear -D mkbackup ... Preparing archive operation ps auxw | egrep 'rear|tar' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 9056 4.6 0.6 16080 6260 pts/1 S+ 13:43 0:01 /bin/bash usr/sbin/rear -D mkbackup root 27352 0.0 0.4 16080 4996 pts/1 S+ 13:44 0:00 /bin/bash usr/sbin/rear -D mkbackup root 27366 8.1 0.2 18256 2480 pts/1 S+ 13:44 0:00 tar --warning=no-xdev --sparse ... root 27368 1.0 0.0 4248 804 pts/1 S+ 13:44 0:00 dd of=/tmp/rear.czvXpnbkbMPDjq1/outputfs/f48/back.gz root 27377 0.0 0.3 16080 3328 pts/1 R+ 13:44 0:00 /bin/bash usr/sbin/rear -D mkbackup pstree -plau 9056 rear,9056 usr/sbin/rear -D mkbackup |-pstree,27378 -plau 9056 `-rear,27352 usr/sbin/rear -D mkbackup |-cat,27367 |-dd,27368 of=/tmp/rear.czvXpnbkbMPDjq1/outputfs/f48/backup.tar.gz `-tar,27366 --warning=no-xdev --sparse ... `-gzip,27371 Archived 46 MiB [avg 11960 KiB/sec] Archived 64 MiB [avg 8220 KiB/sec] ^CTerminating... The following ReaR (background) jobs are active and get terminated: Output of: jobs -p 27352 Output of: pstree -plau 9056 rear,9056 usr/sbin/rear -D mkbackup |-rear,27352 usr/sbin/rear -D mkbackup | `-dd,27368 of=/tmp/rear.czvXpnbkbMPDjq1/outputfs/f48/backup.tar.gz `-rear,27434 usr/sbin/rear -D mkbackup `-pstree,27435 -plau 9056 Output of: pgrep -a -P 9056 27352 /bin/bash usr/sbin/rear -D mkbackup 27440 /bin/bash usr/sbin/rear -D mkbackup Output of: ps auxw | egrep 'rear|tar' root 9056 3.7 0.5 16080 5932 pts/1 S+ 13:43 0:01 /bin/bash usr/sbin/rear -D mkbackup root 27352 0.0 0.4 16080 4996 pts/1 S+ 13:44 0:00 /bin/bash usr/sbin/rear -D mkbackup root 27368 1.2 0.0 4248 804 pts/1 D+ 13:44 0:00 dd of=/tmp/rear.czvXpnbkbMPDjq1/outputfs/f48/back.gz root 27446 0.0 0.4 16080 4612 pts/1 S+ 13:44 0:00 /bin/bash usr/sbin/rear -D mkbackup root 27448 0.0 0.0 4320 780 pts/1 S+ 13:44 0:00 egrep rear|tar Terminating jobs (PIDs 27352) Terminating job (PID 27352) You should also rm -Rf /tmp/rear.czvXpnbkbMPDjq1 rear mkbackup failed, check /root/rear.master/var/log/rear/rear-f48.log for details
Note how directly after Ctrl+C the following sub-sub-processes are already gone
`-tar,27366 --warning=no-xdev --sparse ... `-gzip,27371
only the 'dd' is left (shown by ps auxw | egrep 'rear|tar'
).
In some cases even the 'dd' was already gone directly after Ctrl+C for me.
This seems to indicate that under normal circumstances
the whole "kill all running jobs" in the DoExitTasks() function
is probably not needed - plain Ctrl+C should automatically
terminate all child processes.
In the end perhaps the root of all evil is only the SIGKILL
in the "kill all running jobs" in the DoExitTasks() function
that might disturb the normal Ctrl+C termination of all children.
jsmeix commented at 2018-01-31 16:03:¶
According to my above analysis this issue is no longer a minor bug
(it only happens not often in practice because users don't Ctrl+C
ReaR)
because I think how the DoExitTasks() function currently tries
to terminate descendant processes is descructive (i.e. just wrong).
jsmeix commented at 2018-01-31 16:06:¶
Via
https://github.com/rear/rear/pull/1720
I tried to implement clean termination of descendant processes
in the DoExitTasks() function via a new descendants_pids() function.
I found the basic idea (determine recursively children of children) in
https://bbs.archlinux.org/viewtopic.php?id=88081
schlomo commented at 2018-01-31 16:40:¶
What is the expected behavior when a user presses Ctrl-C?
For me it would be to cleanly abort ReaR which means:
- abort all background processes
- umount everything we mounted
- remove temporary files unless running in debug mode
- in general, leave the system in a clean state as it was before running ReaR.
So yes, we should kill everything. Maybe process groups can help us to catch all subprocesses by sending the TERM or KILL signal to the entire group.
jsmeix commented at 2018-02-01 10:50:¶
We should never "just kill" any process (i.e. send SIGKILL to it)
unless SIGTERM was sent a bit of time (e.g. a second) before
because SIGKILL is always descructive and no "rear WORKFLOW"
should do anything that is possibly descructive in particular
never ever when running in the living original system.
Process groups do not help, at least not with bash default behaviour.
It was my main finding that with bash default behaviour
one gets sub-processes that are not in the same process group
as their parent process.
On my SLES12-SP2 system:
# type -a descendants_pids descendants_pids is a function descendants_pids () { local parent_pid=$1; kill -0 $parent_pid 2> /dev/null || return 0; local child_pid=""; local children_pids=$( ps --ppid $parent_pid -o pid= ); if test "$children_pids"; then for child_pid in $children_pids; do descendants_pids $child_pid; done; fi; kill -0 $parent_pid 2> /dev/null && echo $parent_pid || return 0 } # { sleep 5 | grep foo | grep $( sleep 2 ; echo bar ) ; } & { sleep 7 | grep this | grep $( sleep 2 ; echo that ) ; } & echo ; sleep 1 ; pstree -plaug $$ ; descendants_pids $$ ; echo ; sleep 2 ; pstree -plaug $$ ; descendants_pids $$ [1] 25527 [2] 25528 bash,15851,15851 ├─bash,25527,25527 │ ├─bash,25535,25527 │ │ └─bash,25537,25527 │ │ └─sleep,25539,25527 2 │ ├─grep,25534,25527 --color=auto foo │ └─sleep,25533,25527 5 ├─bash,25528,25528 │ ├─bash,25532,25528 │ │ └─bash,25536,25528 │ │ └─sleep,25538,25528 2 │ ├─grep,25531,25528 --color=auto this │ └─sleep,25530,25528 7 └─pstree,25544,25544 -plaug 15851 25533 25534 25539 25537 25535 25527 25530 25531 25538 25536 25532 25528 15851 bash,15851,15851 ├─bash,25527,25527 │ ├─grep,25534,25527 --color=auto foo │ ├─grep,25535,25527 --color=auto bar │ └─sleep,25533,25527 5 ├─bash,25528,25528 │ ├─grep,25531,25528 --color=auto this │ ├─grep,25532,25528 --color=auto that │ └─sleep,25530,25528 7 └─pstree,25784,25784 -plaug 15851 25533 25534 25535 25527 25530 25531 25532 25528 15851
FYI: The '-g' option is not supported by older 'pstree' like the 'pstree' on SLE11.
Sub-processes that get run via command &
can get a different
process group, cf. "man bash" (on my SLES12-SP2 system):
To facilitate the implementation of the user interface to job control, the operating system maintains the notion of a current terminal process group ID. Members of this process group (processes whose process group ID is equal to the current terminal process group ID) receive keyboard-generated signals such as SIGINT. These processes are said to be in the foreground. Background processes are those whose process group ID differs from the terminal's; such processes are immune to keyboard-generated signals.
To cleanly terminate all descendant processes in any case
regardless how those processes have been launched
one must recursively determine all child processes PIDs
and send SIGTERM to each of them.
jsmeix commented at 2018-02-06 15:59:¶
With
https://github.com/rear/rear/pull/1720
merged
I consider this issue to be fixed.
[Export of Github issue for rear/rear.]