4

I wrote a network daemon that forks off children to handle TCP connections. On SIGINT the main process triggers a kill for each child in order to clean up and to collect some final statistics.

In almost all cases that works fine, and the child processes terminate really fast. However occasionally a child process just refuses to die within a short timeout (like 5 seconds).

I had no idea what happened then, so I added some verbose output to diagnose that case. I found out that using netcat to open a connection, then suspending that netcat process, sometimes causes the effect.

When I was able to reproduce the effect the debug output was:

REST-server(cleanup_queue): deleting children
REST-server(cleanup_queue): deleting PID 23344 handling localhost:48114
child_delete: Killing child 23344
child_delete: killed child with PID 23344
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting up to 5 seconds for condition
_limited_wait(PID 23344 terminated): waiting 0.02 (of 5 remaining) seconds
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting 0.04 (of 4.98 remaining) seconds
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting 0.08 (of 4.94 remaining) seconds
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting 0.16 (of 4.86 remaining) seconds
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting 0.32 (of 4.7 remaining) seconds
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting 0.64 (of 4.38 remaining) seconds
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting 1.28 (of 3.74 remaining) seconds
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting 2.46 (of 2.46 remaining) seconds
(r1, r2) = (1, Interrupted system call)
child_delete: PID 23344 refused to terminate within 5s
failed to delete child PID 23344

The "condition" to wait for in that case was the result of this closure:

sub {
    my $r1 = kill(0, $child_pid);
    my $r2 = $!;
    print "(r1, r2) = ($r1, $r2)\n";
    $r1 != 1 && $r2 == Errno::ESRCH;
}

So the expected outcome would be that the main process is unable to "kill" the PID, because it does no longer exist (and not because of a "permission denied").

However for some reasons I get an "Interrupted system call" repeatedly.

The main process uses signal handlers like this:

$SIG{'INT'} = $SIG{'TERM'} = sub ($) {
    my $signal = 'SIG' . $_[0];
    my $me = "signal handler[$$, $signal]";

    print "$me: cleaning up\n"
        if ($verbose > 0);
    cleanup();
    print "$me: executing default action\n"
        if ($verbose > 1);
    $SIG{$_[0]} = 'DEFAULT';
    kill($_[0], $$);                    # execute default action
};

And when forking a child process, I reset the signal handlers like this:

sub child_create($)
{
    my ($child) = @_;
    my $pid;

    reaper(0);                          # disable for the child
    if ($pid = fork()) {                # parent
        reaper(1);                      # enable for the parent
    } elsif (defined($pid)) {           # child
        my ($child_fun, @child_param) = @$child;
        my $ret;

        # prevent double-cleanup
        $SIG{'INT'} = $SIG{'TERM'} = $SIG{'__DIE__'} = 'DEFAULT';
        $ret = $child_fun->(@child_param);
        exit($ret);                     # avoid returning from function call
    } else {                            # error
        print STDERR "child_create: fork(): $!\n";
    }
    return $pid;
}

The reaper() just handles SIGCHLD.

What could cause the effect seen? The child processes basically do a while (defined(my $req = $conn->get_request)) {...} (using HTTP::Daemon), so they should be waiting for input in the netcat case.

Additional info

Just in case it might matter: OS is SLES12 SP5 (using Perl 5.18.2) running on VMware.

The code in the main server loop looks like this:

while (defined(my $conn = $daemon->accept) || $! == Errno::EINTR) {
    my $errno = $!;

    if ($quit_flag != 0) {
        last;
    }
    if ($errno == Errno::EINTR) {
        next;
    }
    #... handle $req->uri->path()
}

0

You must log in to answer this question.

Browse other questions tagged .