#1712 Issue closed: during "rear mkbackup" [Ctrl]+[C] does not terminate 'tar' background process

Labels: enhancement, bug, cleanup, fixed / solved / done

jsmeix opened issue at 2018-01-26 13:53:

Current master code after ReaR 2.3.

When during "rear mkbackup" the 'tar' background process is already running
pressing [Ctrl]+[C] (i.e. sending SIGINT) only terminates the foreground
process (i.e. the bash script) but neither the 'tar' background process
nor the 'dd' background process:

# usr/sbin/rear -D mkbackup
...
Creating tar archive '/tmp/rear.lswe2CuMGSACoyi/outputfs/g241/backup.tar.gz'
Preparing archive operation
Archived 39 MiB [avg 4528 KiB/sec] 
Archived 57 MiB [avg 3948 KiB/sec] 
Archived 73 MiB [avg 4168 KiB/sec] 
Archived 143 MiB [avg 5074 KiB/sec] 
^CYou should also rm -Rf /tmp/rear.lswe2CuMGSACoyi
rear mkbackup failed, check /root/rear.master/var/log/rear/rear-g241.log for details

# ps auxw | grep tar
root     14524 19.4  0.5  12808  4628 pts/0    S    14:40   0:08 tar --warning=no-xdev --sparse --block-number --totals --verbose --no-wildcards-match-slash --one-file-system --ignore-failed-read --anchored --gzip -X /tmp/rear.lswe2CuMGSACoyi/tmp/backup-exclude.txt -C / -c -f - / /root/rear.master/var/log/rear/rear-g241.log
root     14526  0.8  0.0   8296   612 pts/0    S    14:40   0:00 dd of=/tmp/rear.lswe2CuMGSACoyi/outputfs/g241/backup.tar.gz
root     14583  0.0  0.0   5732   812 pts/0    R+   14:40   0:00 grep tar

The background 'tar' and 'dd' processes run until they finish
and at the very end the mounted NFS share is still mounted:

# mount | grep rear
10.160.4.244:/nfs on /tmp/rear.lswe2CuMGSACoyi/outputfs type nfs (rw,nfsvers=3,nolock,addr=10.160.4.244)

I think pressing [Ctrl]+[C] during any "rear WORKFLOW"
should also terminate all background processes
and clean up e.g. umount ReaR specific things
(i.e. ensure the EXIT_TASKS can run successfully).

I think the current behaviour is at most a "minor bug".

Or is there a reason why the current behaviour
is perhaps even intentional?

jsmeix commented at 2018-01-26 14:37:

As a first idea I tried in lib/_input-output-functions.sh adding

 ... pkill -P $MASTER_PID ; wait ...

before finally kill $MASTER_PID to the USR1 trap like

builtin trap " ... pkill -P $MASTER_PID ; wait ; kill $MASTER_PID" USR1

but that did not help.

I also tried in lib/_input-output-functions.sh
adding the same to the DoExitTasks() function like

function DoExitTasks () {
    LogPrint "Running exit tasks"
    LogPrint "Terminating child processes"
    LogPrint "$( pgrep -a -P $MASTER_PID )"
    pkill -P $MASTER_PID
    LogPrint "Waiting for child processes to finish..."
    wait
    LogPrint "Child processes finished"
...

but that also did not help.

Somehow that "kill all running jobs" in the DoExitTasks() function
does not do what it should do...

gozora commented at 2018-01-27 11:29:

Oddly enough, I'm not suffering by this. Recently I had to interrupt ReaR with ^C multiple times as well, but I did not had dd command left running.
Running ReaR:

sp2:/ # rear -D mkbackup
...

Making ISO image
Wrote ISO image: /var/lib/rear/output/rear-sp2.iso (183M)
Copying resulting files to nfs location
Saving /var/log/rear/rear-sp2.log as rear-sp2.log to nfs location
Creating tar archive '/tmp/rear.2HQ5NB3tdmpSCID/outputfs/sp2/backup.tar.gz'
Archived 169 MiB [avg 9614 KiB/sec]

sp2:/ # ps auxw | grep tar
root     17251  4.4  0.6  19196  3448 pts/6    S+   12:16   0:00 tar --warning=no-xdev --sparse --block-number --totals --verbose --no-wildcards-match-slash --one-file-system --ignore-failed-read --anchored --xattrs --xattrs-include=security.capability --xattrs-include=security.selinux --acls --gzip -X /tmp/rear.2HQ5NB3tdmpSCID/tmp/backup-exclude.txt -C / -c -f - /var/log /var/lib/pgsql /var/tmp /srv /var/spool /tmp /boot/grub2/x86_64-efi /var/lib/named /home /usr/local /boot/grub2/i386-pc /var/lib/mariadb /var/lib/machines /var/lib/mailman /var/opt /var/lib/mysql /var/lib/libvirt/images /var/cache /opt / /boot/efi /var/log/rear/rear-sp2.log
root     17253  1.0  0.1   4248   712 pts/6    S+   12:16   0:00 dd of=/tmp/rear.2HQ5NB3tdmpSCID/outputfs/sp2/backup.tar.gz
root     17285  0.0  0.3  10508  1592 pts/5    R+   12:16   0:00 grep --color=auto tar

After ^C

Creating tar archive '/tmp/rear.2HQ5NB3tdmpSCID/outputfs/sp2/backup.tar.gz'
Archived 169 MiB [avg 9614 KiB/sec] 
^C
You should also rm -Rf /tmp/rear.2HQ5NB3tdmpSCID
rear mkbackup failed, check /var/log/rear/rear-sp2.log for details

sp2:/ # ps auxw | grep tar
root     17359  0.0  0.3  10508  1592 pts/5    R+   12:16   0:00 grep --color=auto tar

V.

jsmeix commented at 2018-01-30 13:11:

Phew! The analysis took much longer than expected:

Also for me Ctrl+C usually works - but sometimes not.

Summary:

At least sometimes that "kill all running jobs" in the DoExitTasks() function
does not do what it should do because it does not terminate all child processes
because it fails to determine all child processses ('jobs -p' is insufficient for that).

Usually 'jobs -p' reports only one child and that gets killed with a brutal SIGKILL
that leaves the poor child (and all its own grandchildren) no chance
to do any own cleanup.

Therefore the "kill all running jobs" in the DoExitTasks() function
needs to be enhanced to determine all child processes
and to recursively terminate each of them more gracefully
first by sending SIGTERM and waiting a bit for them to terminate
and finally sending SIGKILL only to remaining child processes.

Details on my SLES12-SP2 system:

For analysis I enhanced lib/_input-output-functions.sh

function DoExitTasks () {
    Log "Running exit tasks."
    # terminate all running jobs
    JOBS=( $( jobs -p ) )
    # when "jobs -p" results nothing then JOBS is still an unbound variable so that
    # an empty default value is used to avoid 'set -eu' error exit if $JOBS is unset:
    if test -n ${JOBS:-""} ; then
        LogPrint "Terminating..."
        LogPrint "The following ReaR (background) jobs are active and get terminated:"
        LogPrint "Output of: jobs -p"
        LogPrint "$( jobs -p )"
        LogPrint "Output of: pstree -plau $MASTER_PID"
        LogPrint "$( pstree -plau $MASTER_PID )"
        LogPrint "Output of: pgrep -a -P $MASTER_PID"
        LogPrint "$( pgrep -a -P $MASTER_PID )"
        LogPrint "Output of: ps auxw | egrep 'rear|tar'"
        LogPrint "$( ps auxw | egrep 'rear|tar' )"
        LogPrint "Terminating jobs (PIDs ${JOBS[*]})"
        for job in "${JOBS[@]}" ; do
            LogPrint "Terminating job (PID $job)"
            kill $job 1>&2
        done
        sleep 1
    ...

and backup/NETFS/default/500_make_backup.sh

sleep 1 # Give the backup software a good chance to start working

echo "ps auxw | egrep 'rear|tar'" 1>&7
ps auxw | egrep -i1 'rear|tar' 1>&7
echo "pstree -plau $MASTER_PID" 1>&7
pstree -plau $MASTER_PID 1>&7

then I did "rear mkbackup" with Ctrl+C during "archive operation"

# usr/sbin/rear -D mkbackup
...
Preparing archive operation
ps auxw | egrep 'rear|tar'
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      9056  4.6  0.6  16080  6260 pts/1    S+   13:43   0:01 /bin/bash usr/sbin/rear -D mkbackup
root     27352  0.0  0.4  16080  4996 pts/1    S+   13:44   0:00 /bin/bash usr/sbin/rear -D mkbackup
root     27366  8.1  0.2  18256  2480 pts/1    S+   13:44   0:00 tar --warning=no-xdev --sparse ...
root     27368  1.0  0.0   4248   804 pts/1    S+   13:44   0:00 dd of=/tmp/rear.czvXpnbkbMPDjq1/outputfs/f48/back.gz
root     27377  0.0  0.3  16080  3328 pts/1    R+   13:44   0:00 /bin/bash usr/sbin/rear -D mkbackup
pstree -plau 9056
rear,9056 usr/sbin/rear -D mkbackup
  |-pstree,27378 -plau 9056
  `-rear,27352 usr/sbin/rear -D mkbackup
      |-cat,27367
      |-dd,27368 of=/tmp/rear.czvXpnbkbMPDjq1/outputfs/f48/backup.tar.gz
      `-tar,27366 --warning=no-xdev --sparse ...
          `-gzip,27371
Archived 46 MiB [avg 11960 KiB/sec] 
Archived 64 MiB [avg 8220 KiB/sec] 
^CTerminating...
The following ReaR (background) jobs are active and get terminated:
Output of: jobs -p
27352
Output of: pstree -plau 9056
rear,9056 usr/sbin/rear -D mkbackup
  |-rear,27352 usr/sbin/rear -D mkbackup
  |   `-dd,27368 of=/tmp/rear.czvXpnbkbMPDjq1/outputfs/f48/backup.tar.gz
  `-rear,27434 usr/sbin/rear -D mkbackup
      `-pstree,27435 -plau 9056
Output of: pgrep -a -P 9056
27352 /bin/bash usr/sbin/rear -D mkbackup
27440 /bin/bash usr/sbin/rear -D mkbackup
Output of: ps auxw | egrep 'rear|tar'
root      9056  3.7  0.5  16080  5932 pts/1    S+   13:43   0:01 /bin/bash usr/sbin/rear -D mkbackup
root     27352  0.0  0.4  16080  4996 pts/1    S+   13:44   0:00 /bin/bash usr/sbin/rear -D mkbackup
root     27368  1.2  0.0   4248   804 pts/1    D+   13:44   0:00 dd of=/tmp/rear.czvXpnbkbMPDjq1/outputfs/f48/back.gz
root     27446  0.0  0.4  16080  4612 pts/1    S+   13:44   0:00 /bin/bash usr/sbin/rear -D mkbackup
root     27448  0.0  0.0   4320   780 pts/1    S+   13:44   0:00 egrep rear|tar
Terminating jobs (PIDs 27352)
Terminating job (PID 27352)
You should also rm -Rf /tmp/rear.czvXpnbkbMPDjq1
rear mkbackup failed, check /root/rear.master/var/log/rear/rear-f48.log for details

Note how directly after Ctrl+C the following sub-sub-processes are already gone

      `-tar,27366 --warning=no-xdev --sparse ...
          `-gzip,27371

only the 'dd' is left (shown by ps auxw | egrep 'rear|tar').

In some cases even the 'dd' was already gone directly after Ctrl+C for me.

This seems to indicate that under normal circumstances
the whole "kill all running jobs" in the DoExitTasks() function
is probably not needed - plain Ctrl+C should automatically
terminate all child processes.

In the end perhaps the root of all evil is only the SIGKILL
in the "kill all running jobs" in the DoExitTasks() function
that might disturb the normal Ctrl+C termination of all children.

jsmeix commented at 2018-01-31 16:03:

According to my above analysis this issue is no longer a minor bug
(it only happens not often in practice because users don't Ctrl+C ReaR)
because I think how the DoExitTasks() function currently tries
to terminate descendant processes is descructive (i.e. just wrong).

jsmeix commented at 2018-01-31 16:06:

Via https://github.com/rear/rear/pull/1720
I tried to implement clean termination of descendant processes
in the DoExitTasks() function via a new descendants_pids() function.

I found the basic idea (determine recursively children of children) in
https://bbs.archlinux.org/viewtopic.php?id=88081

schlomo commented at 2018-01-31 16:40:

What is the expected behavior when a user presses Ctrl-C?

For me it would be to cleanly abort ReaR which means:

  • abort all background processes
  • umount everything we mounted
  • remove temporary files unless running in debug mode
  • in general, leave the system in a clean state as it was before running ReaR.

So yes, we should kill everything. Maybe process groups can help us to catch all subprocesses by sending the TERM or KILL signal to the entire group.

jsmeix commented at 2018-02-01 10:50:

We should never "just kill" any process (i.e. send SIGKILL to it)
unless SIGTERM was sent a bit of time (e.g. a second) before
because SIGKILL is always descructive and no "rear WORKFLOW"
should do anything that is possibly descructive in particular
never ever when running in the living original system.

Process groups do not help, at least not with bash default behaviour.

It was my main finding that with bash default behaviour
one gets sub-processes that are not in the same process group
as their parent process.

On my SLES12-SP2 system:

# type -a descendants_pids
descendants_pids is a function
descendants_pids () 
{ 
    local parent_pid=$1;
    kill -0 $parent_pid 2> /dev/null || return 0;
    local child_pid="";
    local children_pids=$( ps --ppid $parent_pid -o pid= );
    if test "$children_pids"; then
        for child_pid in $children_pids;
        do
            descendants_pids $child_pid;
        done;
    fi;
    kill -0 $parent_pid 2> /dev/null && echo $parent_pid || return 0
}

# { sleep 5 | grep foo | grep $( sleep 2 ; echo bar ) ; } & { sleep 7 | grep this | grep $( sleep 2 ; echo that ) ; } & echo ; sleep 1 ; pstree -plaug $$ ; descendants_pids $$ ; echo ; sleep 2 ; pstree -plaug $$ ; descendants_pids $$
[1] 25527
[2] 25528

bash,15851,15851
  ├─bash,25527,25527
  │   ├─bash,25535,25527
  │   │   └─bash,25537,25527
  │   │       └─sleep,25539,25527 2
  │   ├─grep,25534,25527 --color=auto foo
  │   └─sleep,25533,25527 5
  ├─bash,25528,25528
  │   ├─bash,25532,25528
  │   │   └─bash,25536,25528
  │   │       └─sleep,25538,25528 2
  │   ├─grep,25531,25528 --color=auto this
  │   └─sleep,25530,25528 7
  └─pstree,25544,25544 -plaug 15851
25533
25534
25539
25537
25535
25527
25530
25531
25538
25536
25532
25528
15851

bash,15851,15851
  ├─bash,25527,25527
  │   ├─grep,25534,25527 --color=auto foo
  │   ├─grep,25535,25527 --color=auto bar
  │   └─sleep,25533,25527 5
  ├─bash,25528,25528
  │   ├─grep,25531,25528 --color=auto this
  │   ├─grep,25532,25528 --color=auto that
  │   └─sleep,25530,25528 7
  └─pstree,25784,25784 -plaug 15851
25533
25534
25535
25527
25530
25531
25532
25528
15851

FYI: The '-g' option is not supported by older 'pstree' like the 'pstree' on SLE11.

Sub-processes that get run via command & can get a different
process group, cf. "man bash" (on my SLES12-SP2 system):

To facilitate the implementation of the user interface to job control,
the operating system maintains the notion of a current terminal
process group ID. Members of this process group (processes
whose process group ID is equal to the current terminal process
group ID) receive keyboard-generated signals such as SIGINT.
These processes are said to be in the foreground.
Background processes are those whose process group ID
differs from the terminal's; such processes are immune to
keyboard-generated signals.

To cleanly terminate all descendant processes in any case
regardless how those processes have been launched
one must recursively determine all child processes PIDs
and send SIGTERM to each of them.

jsmeix commented at 2018-02-06 15:59:

With https://github.com/rear/rear/pull/1720 merged
I consider this issue to be fixed.


[Export of Github issue for rear/rear.]