1

I am working with a cluster of 20 hosts, all running CentOS 7.3.

I am attempting to create an automated test to check that:

  1. Our expected network connectivity is in place
  2. Our SSH Single Sign On (SSO) solution is working

The expected network connectivity is very simple: It's a single, flat subnet. Every host should be able to reach every other host.

Our SSH SSO solution (FreeIPA) uses Kerberos to authenticate users, and it uses SSH public keys to authenticate hosts. A user's Kerberos Ticket Granting Ticket (TGT) is set to forward to any host that user connects to using SSH.

The test is very simple:

Have every host try to use SSH to execute hostname as a remote command on every other host.

To do this, I use a utility named pdsh.

In a nutshell, this utility uses SSH to execute a remote command on a set of hosts. It does so in parallel by spawning a thread for each host and in each of those threads, executing the command ssh .

My use of this command is as follows. On a machine that is not one of the 20 cluster hosts, I execute this command:

pdsh -g all 'pdsh -g all "hostname"'

-g all specifies that the remote command should be run on all of the cluster hosts. As stated, I have 20 cluster hosts.

The command to be executed on every remote host is:

pdsh -g all "hostname"

So, as stated above, every host tries to execute the command "hostname" on every other host as a remote command via SSH.

So, this results in 20 invocations (one per cluster host) of:

ssh 'pdsh -g all "hostname"'

In turn, this results in 20 * 20 = 400 invocations of:

ssh <hostname> hostname

So, I've got a total of 20 + 400 = 420 SSH authentications occurring within a very short period of time.

The problem I'm seeing is a small handful of authentication failures. The hosts on which the failure occur are arbitrary. There's no rhyme or reason. A failure looks like this:

host-5: host-3: Permission denied, please try again.
host-5: host-3: Permission denied, please try again.
host-5: host-3: Received disconnect from UNKNOWN: 2: Too many authentication failures for myuser
host-5: pdsh@host-5: host-3: ssh exited with exit code 255

I have the following configured in /etc/ssh/sshd_config to allow for many to-be-authenticated sessions to exist simultaneously:

MaxStartups 500:30:600

Note that this is way overkill--it accounts for the number of authentications going on across the cluster, but it really only need account for the number of authentications occurring on a given host. So, I think the problem lies elsewhere.

So, in summary, I've got a large number of SSH user authentications occurring via Kerberos across a cluster of 20 hosts in a very short period of time. Random failures of user authentication are occurring.

Why might such user authentication failures occur?

2
  • Did you find a solution to this problem?
    – user34930
    Commented Mar 14, 2019 at 17:25
  • Never did, and have moved off of the project.
    – Dave
    Commented Mar 14, 2019 at 17:34

1 Answer 1

1

If your re-opening SSH sessions to the same hosts I'd recommend to use control master connections. Basically an authenticated session is held by client and server avoiding repeating authentication.

See also: ssh_config(5) -- ControlMaster

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .