1

Use-case

I have a backup software able to read data to be backed up from named pipes. I want to use that feature to backup database dumps of e.g. MySQL and PostgreSQL which are hosted within multiple different VMs using the VM-host only. The approach is to connect into the VMs using SSH, start the corresponding dump tool returning output on STDOUT, forward that through SSH and pipe that STDOUT of SSH into a named pipe created using mkfifo.

The important thing to note is that reading from the pipe blocks until something is written into it, so the SSH processes and the backup software need to run at the same time. SSH writes, the backup software reads all available named pipes one after another. I've already tested this manually using SSH and cat or vi and things work in general: The dump within the VMs only start at the moment some reader is attached to the pipe and all data from the dump is available in the end. That can easily be tested especially with mysqldump, as it outputs easy to debug plain text by default.

Problem

The backup software supports calling hook scripts before actual backup is processed. So I implemented such a script starting the SSH processes in the background, with the assumption that when the hook script itself returns, the backup software will start reading the pipes. The hook mechanism itself works, I'm using the same approach successfully with creating/destroying file system snapshots within the VMs before/after the backup.

The important thing to note here is that the backup software needs to wait for the first level hook script to finish, because only afterwards it's assured that all SSH processes are running in the background. That is exactly how the other hooks creating snapshots within the VMs work already.

Though, the backup software MUST NOT wait for the children of the first level hook script, because those need to write the data the backup software needs to read. Both need to happen in parallel, one after another for each created named pipe. But it currently seems that the backup software does wait for all child processes for some reason.

I've debugged things and am somewhat sure that the main hook script really finishes. Tracing the function calls using BASH seems that way and the PID is simply gone at some point as well. Only all of the child processes stay around, like expected. After killing ALL of those using pkill, the backup software continues it's processing, so it really seems to wait for all children. As no writers to the pipe are available anymore, the backup waits forever on those again, but that is as expected.

Research

The backup software is implemented in Python and waits for the hook script output by default. That is correct and expected, but some people claim that under some circumstances Python waits for child processes as well. Though, the workaround to prevent that by using shell=True seems to be used by the backup software already.

So there are only two choices: Either the backup software really is waiting for all child processes to retrieve their output and it's implementation needs to be changed. Or I'm simply doing something wrong when executing SSH, those processes are not properly detached from the main hook script and it really doesn't finish entirely or something like that, making the backup software keep waiting.

Backup software code

The following is how the hook script gets executed:

                execute.execute_command(
                    [command],
                    output_log_level=logging.ERROR
                    if description == 'on-error'
                    else logging.WARNING,
                    shell=True,
                )

The following is how the process gets started:

    process = subprocess.Popen(
        command,
        stdin=input_file,
        stdout=None if do_not_capture else (output_file or subprocess.PIPE),
        stderr=None if do_not_capture else (subprocess.PIPE if output_file else subprocess.STDOUT),
        shell=shell,
        env=environment,
        cwd=working_directory,
    )

    if not run_to_completion:
        return process

    log_outputs(
        (process,), (input_file, output_file), output_log_level, borg_local_path=borg_local_path
    )

The following is an excerpt of reading output of processes:

    buffer_last_lines = collections.defaultdict(list)
    process_for_output_buffer = {
        output_buffer_for_process(process, exclude_stdouts): process
        for process in processes
        if process.stdout or process.stderr
    }
    output_buffers = list(process_for_output_buffer.keys())
    # Log output for each process until they all exit.
    while True:
        if output_buffers:
            (ready_buffers, _, _) = select.select(output_buffers, [], [])
            for ready_buffer in ready_buffers:
                ready_process = process_for_output_buffer.get(ready_buffer)
                # The "ready" process has exited, but it might be a pipe destination with other
                # processes (pipe sources) waiting to be read from. So as a measure to prevent
                # hangs, vent all processes when one exits.
                if ready_process and ready_process.poll() is not None:
                    for other_process in processes:
                        if (
                            other_process.poll() is None
                            and other_process.stdout
                            and other_process.stdout not in output_buffers
                        ):
                            # Add the process's output to output_buffers to ensure it'll get read.
                            output_buffers.append(other_process.stdout)
                line = ready_buffer.readline().rstrip().decode()
                if not line or not ready_process:
                    continue
[...]
        still_running = False
        for process in processes:
            exit_code = process.poll() if output_buffers else process.wait()
[...]
        if not still_running:
            break
    # Consume any remaining output that we missed (if any).
    for process in processes:
        output_buffer = output_buffer_for_process(process, exclude_stdouts)
        if not output_buffer:
            continue
[...]

Running processes

The following are the running process when the backup software doesn't move forward. All of the bash processes at the bottom contain my started SSH instances most likely, though Im wondering why they are still associated with their parent bash. OTOH, all of those bash instances seem properly released from my hook shell script, which should be the one zombie process mentioned. I guess that zombie simply needs to stop to get this fixed...

   1641 ?        Ss     0:02 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
1835356 ?        Ss     0:00  \_ sshd: [USR1] [priv]
1835380 ?        S      0:00  |   \_ sshd: [USR1]@pts/1
1835381 pts/1    Ss     0:00  |       \_ -bash
1835418 pts/1    S      0:00  |           \_ sudo -i
1835612 pts/1    S      0:00  |               \_ -bash
1835621 pts/1    S      0:00  |                   \_ sudo -u [USR2] -i
1835622 pts/1    S      0:00  |                       \_ -bash
1840864 pts/1    S+     0:00  |                           \_ sudo borgmatic create --config /[...]/[HOST]_piped.yaml
1840865 pts/1    S+     0:00  |                               \_ /usr/bin/python3 /usr/local/bin/borgmatic create --config /[...]
1840874 pts/1    Z+     0:00  |                                   \_ [sh] <defunct>
1840918 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840920 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840922 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840924 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840926 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840928 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840930 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840932 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840934 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840936 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840938 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840940 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840942 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840944 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]
1840946 pts/1    S+     0:00 /bin/bash /[...]/other_mysql_dumps.sh [HOST] [TOPIC]

SSH-calls

This is what I'm doing in some loop. The individual arguments shouldn't be too important, so the focus is on using nohup, handling input/output, putting things in the background etc.

nohup ssh -F ${PATH_LOCAL_SSH_CFG} -f -n ${HOST_OTHER} ${cmd} < '/dev/null' > "${PATH_LOCAL_MNT2}/${db_name}" 2> '/dev/null' &

So, is this a correct approach to make a locally async SSH-call or am I doing something wrong already?

I need to know if I should focus my debugging at my shell calls or the backup software.

Thanks!

11
  • "all of the child processes stay around, like expected" – Adopted by PID 1 or by the backup software? I'm not a programmer, I cannot tell by reading code. If you check this with pstree then maybe you will advance the research. Commented Apr 25, 2022 at 4:20
  • I don't see my process with pstree at all, but ps axf clearly shows them. Have a look at the updated question. Looks to me like they are properly detached from my hook script. Commented Apr 29, 2022 at 16:33
  • Further debugging found the problem: I'm redirecting output of SSH into some file and that keeps the backup software from waiting forever or the SSH process from detaching from the parent bash. Not sure yet... When redirecting the SSH output to /dev/null, I can see many individual SSH processes and the backup software is not waiting forever anymore. So I need a way to start SSH with redirection and detaching from the parent shell. Commented Apr 29, 2022 at 17:00
  • I know now to have an open descriptor, the point is that I want output of SSH being redirected to a file. Though, the parent shell starting SSH shouldn't wait on SSH. I need a way that someone else takes care of the redirection, sadly SSH can't be simply configured to write into a file on its own. :-/ Commented Apr 29, 2022 at 17:49
  • I find this question very interesting (I'm the one who voted up). Unfortunately IMO your answer does not really explain what happens, I think it's partially voodoo. Don't get me wrong, I believe you it works, but not necessarily in the described way. I have a hypothesis that may explain everything, I guess, but there's an aspect I'm not sure because I have no experience with Python. If you don't mind helping me solve the puzzle then there's one simple test: if in your original nohup ssh … line you change the order of redirections from < … > … 2> … to < … 2> … > …, will it work? Commented Apr 29, 2022 at 21:40

1 Answer 1

0
nohup ssh -F ${PATH_LOCAL_SSH_CFG} -f -n ${HOST_OTHER} ${cmd} < '/dev/null' > "${PATH_LOCAL_MNT2}/${db_name}" 2> '/dev/null' &

The above is wrong, if one wants to really have totally independent background process, because of the redirection of SSH's output into a file. Who does that redirection in the end? It's the bash process starting the actual command, which in my case is the hook script of the backup software and which makes that hook script a zombie in the end. That zombie tries to take care of the redirection and that is the reason why all of those bash instances are still available in my output of ps axf. When the redirection to the file is changed to /dev/null, the hook script really terminates entirely, the backup software continues and the SSH instances are visible as such in ps axf.

Of course, in my concrete use-case I need the redirected output to the file. That's what this is all about. Though, I "simply" need it differently: The shell executing my hook script needs to start a process taking care of the redirection on its own, while the starting shell is fully detached from that started process. That would be really easy if SSH would be able to write to a file on its own, which doesn't seem to be the case. Instead, it seems to entirely rely on the shell to take care of redirection. This means I need an additional shell instance: One executing SSH and taking care of the redirection into a file, while at the same time this shell instance is executed in the background of its parent and the parent ignores all input/output.

In fact, I was pretty close to the solution all the time already, but simply didn't fully understand what I was doing. The examples with nohup made things only more difficult, because I need an additional shell instance in the end. The following is what I had before already with the help of some other SO questions:

(
  trap '' HUP INT
  ${cmd_exec} <<< "${cmd}"
) < '/dev/null' > "${PATH_LOCAL_MNT2}/${db_name}" 2> '/dev/null' &

https://gist.github.com/bluekezza/e511f3f4429939a0f9ecb6447099b3dc https://stackoverflow.com/a/54688673/2055163

The above executes my SSH command as a compound command, allowing to have SIGHUP etc. getting ignored. Depending on the config of the current shell,. this might be necessary to keep the executed command running in the background when the parent ends.

The important point to understand is that this actually creates an additional subshell already, so exactly what I need! The problem with the above was that that the parent shell waits for the output of the subshell, because the redirection of STDOUT was defined for the parent shell, pretty much like with my nohup examples. And this is the easy part now: As we have a subshell with individual STDOUT, that can simply be redirected into a file on its own! This way the parent shell can be fully detached from all channels, the compound command is executed in the background, the backup software continues to run because the hook script doesn't become a zombie and is finally able to read from the pipes as the database dumpers write into those using SSH. :-)

(
  trap '' HUP INT
  ${cmd_exec} <<< "${cmd}" > "${PATH_LOCAL_MNT2}/${db_name}"
) < '/dev/null' > '/dev/null' 2> '/dev/null' &

I've looked at the files the backup software created/restored and they look OK. I can see lots of SQL commands and data and stuff. What DOESN'T work is the following:

(
  trap '' HUP INT
  ${cmd_exec} <<< "${cmd}" < '/dev/null' > "${PATH_LOCAL_MNT2}/${db_name}" 2> '/dev/null'
) < '/dev/null' > '/dev/null' 2> '/dev/null' &

I've thought to give that a try simply to be safe, because I don't need STDIN and STDERR anyway. But with the above I don't get any content in the pipes at all, really only EOF, because the backup software creates empty files in the end. Additionally, I can see that the database dumpers are not even executed in their hosts, compared to the working solution for which I can easily see CPU and I/O load in the system. Might have something to do with how I executed SSH using some function or whatever, don't care too much anymore.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .