sporadic network timeouts detected by 1 or 2 nagios instances not making sense

Question

I'm working for a small company as a linux and DB sysadmin. Since this is more commodity hardware I thought I'd try superuser first. We've been having odd networking problems occasionally and intermittently. Every time I try to pinpoint the problem, something negates it.

Mostly we are receiving timeouts of over 10 seconds in Nagios with no correlation to the problem. They are intermittent and appear random, however there are certain machines/internet connections that it continues to occur on. I setup a second copy of Nagios on a second development Server. Both of these machines were installed by me and are not heavily used. CentOS minimal install with most packages yum installed as needed.

The issues are detected on both Nagios' sometimes and other times just the main Nagios. Some checks are every 5 minutes and a few are every minute so I know about overlap, however it doesn't add up.

We've rebooted the machines and network stuff. We have 4 different internet connections we are hitting from Nagios, all going out on a business FIOS. One problem is our connection from Nagios to our machines in a rack at a real data center. The machines there do not report or have any network issues during the timeout, only our Nagios detects a problem getting to them. Luckily those machines hit our other location now, but we'd like them to come to the location Nagios is at for DR.

Another location has FIOS and Comcast cable, sometimes Nagios times out those. Lastly, it hits my own server at my house on just a consumer comcast cable, which occasionally also timesout and I able to check connectivity while being on it during the outage.

SO, starting over from the beginning. I have 2 Nagios instances running at location A on FIOS the same checks to servers at 3 other locations. Sometimes timeouts show up on both Nagios' so those appear to be 'real' issues, other times only one Nagios starts flapping which makes no sense since both servers live right next to each other physically and on the network. I just rebooted the main Nagios again.

I'm going to wait for the next issue and report what both Nagios' detect. What should I be looking for to troubleshoot? I've looked at all the logs, setup logging on our router, check the network while the issue is happening, and can't figure out what the problem is.

Thank you for your help!

Lastest Update

I caught one of the problems that is not related to ssh attacks or any other server side problems. We had a 5 minute outage that I could not get to certain locations from the Nagios location. Ping stopped dead, but I was able to ping those locations/servers from other locations. I tracepath'ed during and after the outage, which after the 5 minute period sporadically happened again and again, and found that it stops at

   G0-3-4-4.PHLAPA-LCR-22.verizon-gni.net (130.81.180.248)

Any ideas or help? Here is the tracepath during the problem, names changed on our things:

    [root@nagiosServer ~]# tracepath datacenterServer
     1:  ourdomain.com (192.168.1.55)                               0.076ms pmtu 1500
     1:  ourRouter (192.168.1.1)                                  0.297ms
     1:  ourRouter (192.168.1.1)                                  0.258ms
     2:  L300.PHLAPA-VFTTP-164.verizon-gni.net (72.94.203.1)    4.817ms asymm  3
     3:  G0-3-4-4.PHLAPA-LCR-22.verizon-gni.net (130.81.180.248)   5.696ms
     4:  no reply
     5:  no reply
     6:  no reply
     7:  no reply
     8:  no reply
     9:  no reply
    10:  no reply
    11:  no reply
    12:  no reply
    13:  no reply
    14:  no reply
    15:  no reply
    16:  no reply
    17:  no reply
    18:  no reply
    19:  no reply
    20:  no reply
    21:  no reply
    22:  no reply
    23:  no reply
    24:  no reply
    25:  no reply
    26:  no reply
    27:  no reply
    28:  no reply
    29:  no reply
    30:  no reply
    31:  no reply
         Too many hops: pmtu 1500
         Resume: pmtu 1500
    [root@nagiosServer ~]# date
    Fri Apr  4 12:04:30 EDT 2014
    [root@nagiosServer ~]#

And after the problem resolved:

    [root@nagiosServer ~]# date
    Fri Apr  4 12:04:51 EDT 2014
    [root@nagiosServer ~]# tracepath datacenterServer
     1:  ourdomain.com (192.168.1.55)                               0.081ms pmtu 1500
     1:  ourRouter (192.168.1.1)                                  0.253ms
     1:  ourRouter (192.168.1.1)                                  0.295ms
     2:  L300.PHLAPA-VFTTP-164.verizon-gni.net (72.94.203.1)    2.631ms asymm  3
     3:  G0-3-4-4.PHLAPA-LCR-22.verizon-gni.net (130.81.180.248)   6.390ms
     4:  so-3-1-0-0.PHIL-BB-RTR2.verizon-gni.net (130.81.22.60)  20.953ms asymm  5
     5:  0.xe-2-1-0.BR2.IAD8.ALTER.NET (152.63.5.245)          13.855ms asymm  7
     6:  ae-20.r04.asbnva02.us.bb.gin.ntt.net (129.250.8.33)   13.123ms asymm  5
     7:  ge-100-0-0-20.r04.asbnva02.us.ce.gin.ntt.net (168.143.97.190)  14.057ms
     8:  core1-ten-2-1.nwrk1.hostmysite.net (67.59.145.33)     12.873ms asymm 15
     9:  ae5-dist1.nwk01.hosting.com (67.59.145.89)            12.912ms asymm 15
    10:  no reply
    11:  no reply
    12:  no reply
    13:  no reply
    14:  no reply
    15:  no reply
    16:  no reply
    17:  no reply
    18:  no reply
    19:  no reply
    20:  no reply
    21:  no reply
    22:  no reply
    23:  no reply
    24:  no reply
    25:  no reply
    26:  no reply
    27:  no reply
    28:  no reply
    29:  no reply
    30:  no reply
    31:  no reply
         Too many hops: pmtu 1500
         Resume: pmtu 1500
    [root@nagiosServer ~]#

Update: The server Nagios1 is on went up and down (SSH check failing) for 3 hours, which I find to be a very weird problem since its Nagios itself checking the machine itself lives on. I checked the logs and noticed a cracking attempt during that exact time in the secure log, an IP from China. Obviously we need to lock down ssh and not have root exposed, but I guess if ssh has a max of 10 establishing connections at a time, this is exceeding it preventing sporadic Nagios ssh checks to fail during that time. Hopefully after locking down ssh/firewalling/ or using keys this will eliminate most of the time out problems.

    ...
    Host started flapping[03-22-2014 11:44:43] HOST FLAPPING ALERT: Nagios1;STARTED; Host appears to have started flapping (23.7% change > 20.0% threshold)
    Host Up[03-22-2014 11:44:43] HOST ALERT: Nagios1;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-22-2014 11:44:23] HOST ALERT: Nagios1;DOWN;HARD;2;Server answer:
    Host Down[03-22-2014 11:43:33] HOST ALERT: Nagios1;DOWN;SOFT;1;Server answer:
    Host Up[03-22-2014 11:43:23] HOST ALERT: Nagios1;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-22-2014 11:42:53] HOST ALERT: Nagios1;DOWN;HARD;2;Server answer:
    Host Down[03-22-2014 11:42:23] HOST ALERT: Nagios1;DOWN;SOFT;1;Server answer:

secure log:

    ...
    Mar 22 11:43:02 Nagios1 sshd[12004]: Failed password for root from 59.63.167.224 port 60767 ssh2
    Mar 22 11:43:02 Nagios1 sshd[11941]: Failed password for root from 59.63.167.224 port 58905 ssh2
    Mar 22 11:43:02 Nagios1 sshd[11942]: Disconnecting: Too many authentication failures for root
    Mar 22 11:43:02 Nagios1 sshd[11941]: PAM 5 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=59.63.167.224  user=root
    Mar 22 11:43:02 Nagios1 sshd[11941]: PAM service(sshd) ignoring max retries; 6 > 3
    Mar 22 11:43:02 Nagios1 sshd[11995]: Failed password for root from 59.63.167.224 port 60545 ssh2
    Mar 22 11:43:02 Nagios1 sshd[12009]: Failed password for root from 59.63.167.224 port 60919 ssh2
    Mar 22 11:43:02 Nagios1 sshd[11997]: Failed password for root from 59.63.167.224 port 60632 ssh2
    Mar 22 11:43:02 Nagios1 sshd[11952]: Failed password for root from 59.63.167.224 port 59362 ssh2
    Mar 22 11:43:02 Nagios1 sshd[11960]: Failed password for root from 59.63.167.224 port 59716 ssh2
    Mar 22 11:43:03 Nagios1 sshd[11943]: Failed password for root from 59.63.167.224 port 59237 ssh2
    Mar 22 11:43:03 Nagios1 sshd[11988]: Failed password for root from 59.63.167.224 port 60277 ssh2
    Mar 22 11:43:04 Nagios1 sshd[12004]: Failed password for root from 59.63.167.224 port 60767 ssh2
    Mar 22 11:43:04 Nagios1 sshd[12001]: Failed password for root from 59.63.167.224 port 60672 ssh2
    Mar 22 11:43:04 Nagios1 sshd[11995]: Failed password for root from 59.63.167.224 port 60545 ssh2
    Mar 22 11:43:04 Nagios1 sshd[12009]: Failed password for root from 59.63.167.224 port 60919 ssh2
    Mar 22 11:43:04 Nagios1 sshd[11997]: Failed password for root from 59.63.167.224 port 60632 ssh2
    Mar 22 11:43:04 Nagios1 sshd[11952]: Failed password for root from 59.63.167.224 port 59362 ssh2

Update: Only one new timeout, but this time to the DB host over the FIOS connection not the Cable connection. Nagios1 detected it but Nagios2 did not, it was brief so it could of missed it.

Nagios1:

    Service Ok[03-20-2014 21:34:53] SERVICE ALERT: hostDBFios;PG BACKENDS;OK;SOFT;2;POSTGRES_BACKENDS OK: DB "postgres" 9 of 100 connections (9%)
    Service Critical[03-20-2014 21:34:03] SERVICE ALERT: hostDBFios;PG BACKENDS;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.

On the DB host this is in /var/log/messages:

    Mar 20 21:34:01 hostdb nrpe[28248]: Could not read request from client, bailing out...
    Mar 20 21:34:01 hostdb nrpe[28248]: INFO: SSL Socket Shutdown.

I haven't figured out what this could be, searching the errors but most are for steady problems not intermittent ones, perhaps something to do with SSL/SSH?

UPDATE: New timeout below for Issue 2.

Issue 2

Before Issue 1 I posted below, there was an actual outage at [03-17-2014 19:37:50] on one Internet connection we have. It is a business Comcast connection. This happens occasionally and happened before I started working over a year ago, but it shows up as an alert from DNS made easy by the owner as well as Nagios1 and Nagios2. We can write this outage off, have never got an answer from Comcast about it, but at least it's pinpointed to one connection and is a total loss that recovers quickly. It would be nice to troubleshoot this as well, however its less important. Both Nagios' report the same for this outage. We have one server on this connection with a number of alerts and another server named hostDBCable here, that only has a host alert because its dual WAN to FIOS and I monitor the rest of the services on that connection:

Nagios 2:

    [03-17-2014 19:45:40] SERVICE ALERT: host81;DISK SPACE;OK;HARD;1;DISK OK - free space: / 410086 MB (94% inode=99%): /boot 81 MB (87% inode=99%): /dev/shm 3994 MB (100% inode=99%):
    Service Ok[03-17-2014 19:45:10] SERVICE ALERT: host81;PGB BACKENDS;OK;HARD;1;POSTGRES_PGBOUNCER_BACKENDS OK: DB "pgbouncer" 1 of 1000 connections (1%)
    Service Ok[03-17-2014 19:45:10] SERVICE ALERT: host81;PGB MAXWAIT;OK;HARD;1;POSTGRES_PGB_POOL_MAXWAIT OK: DB "pgbouncer" pgbouncer=0 * phoneworks=0 * phoneworksNew=0
    Service Ok[03-17-2014 19:44:40] SERVICE ALERT: host81;MEMORY;OK;HARD;1;OK - 89.0% (7284288 kB) free.
    Service Ok[03-17-2014 19:44:30] SERVICE ALERT: host81;MEMORY SWAP;OK;HARD;1;SWAP OK - 100% free (1983 MB out of 1983 MB)
    Service Ok[03-17-2014 19:43:30] SERVICE ALERT: host81;CPU LOAD;OK;HARD;1;OK - load average: 0.14, 0.19, 0.18
    Host Up[03-17-2014 19:41:50] HOST ALERT: host81;UP;HARD;1;SSH OK - OpenSSH_4.3 (protocol 2.0)
    Service Ok[03-17-2014 19:41:40] SERVICE ALERT: host81;PGB MAX BACKEND;OK;HARD;2;OK - 2 connections
    Service Ok[03-17-2014 19:41:40] SERVICE ALERT: host81;PGB MAXMAXWAIT;OK;HARD;2;OK - queries waiting 0.00 seconds
    Service Ok[03-17-2014 19:41:40] SERVICE ALERT: host81;IVR LONG QUERY;OK;HARD;2;OK - No flag file
    Service Ok[03-17-2014 19:41:40] SERVICE ALERT: host81;ERROR ASTERISK REBOOT;OK;HARD;2;OK - No ERROR found
    Host Up[03-17-2014 19:41:10] HOST ALERT: hostDBCable;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Service Critical[03-17-2014 19:40:40] SERVICE ALERT: host81;DISK SPACE;CRITICAL;HARD;1;Connection refused or timed out
    Service Critical[03-17-2014 19:40:20] SERVICE ALERT: host81;PGB MAXWAIT;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
    Service Critical[03-17-2014 19:40:10] SERVICE ALERT: host81;PGB BACKENDS;CRITICAL;HARD;1;Connection refused or timed out
    Host Down[03-17-2014 19:40:10] HOST ALERT: hostDBCable;DOWN;HARD;2;CRITICAL - Socket timeout after 10 seconds
    Service Critical[03-17-2014 19:39:50] SERVICE ALERT: host81;MEMORY;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
    Service Critical[03-17-2014 19:39:30] SERVICE ALERT: host81;MEMORY SWAP;CRITICAL;HARD;1;Connection refused or timed out
    Host Down[03-17-2014 19:39:00] HOST ALERT: host81;DOWN;HARD;2;CRITICAL - Socket timeout after 10 seconds
    Host Down[03-17-2014 19:38:50] HOST ALERT: hostDBCable;DOWN;SOFT;1;No route to host
    Service Critical[03-17-2014 19:38:50] SERVICE ALERT: host81;PGB MAXMAXWAIT;CRITICAL;HARD;2;CHECK_NRPE: Socket timeout after 10 seconds.
    Service Critical[03-17-2014 19:38:50] SERVICE ALERT: host81;ERROR ASTERISK REBOOT;CRITICAL;HARD;2;CHECK_NRPE: Socket timeout after 10 seconds.
    Service Critical[03-17-2014 19:38:40] SERVICE ALERT: host81;PGB MAX BACKEND;CRITICAL;HARD;2;Connection refused or timed out
    Service Critical[03-17-2014 19:38:40] SERVICE ALERT: host81;IVR LONG QUERY;CRITICAL;HARD;2;Connection refused or timed out
    Service Critical[03-17-2014 19:38:30] SERVICE ALERT: host81;CPU LOAD;CRITICAL;HARD;1;Connection refused or timed out
    Host Down[03-17-2014 19:38:00] HOST ALERT: host81;DOWN;SOFT;1;CRITICAL - Socket timeout after 10 seconds
    Service Critical[03-17-2014 19:37:50] SERVICE ALERT: host81;PGB MAXMAXWAIT;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
    Service Critical[03-17-2014 19:37:50] SERVICE ALERT: host81;PGB MAX BACKEND;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
    Service Critical[03-17-2014 19:37:50] SERVICE ALERT: host81;IVR LONG QUERY;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
    Service Critical[03-17-2014 19:37:50] SERVICE ALERT: host81;ERROR ASTERISK REBOOT;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.

Nagios 2:

    Host Up[03-19-2014 11:18:10] HOST ALERT: hostDBCable;UP;SOFT;2;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-19-2014 11:17:00] HOST ALERT: hostDBCable;DOWN;SOFT;1;CRITICAL - Socket timeout after 10 seconds

Issue 1

Replying to comments:

It's hard to catch some of these or if they are occurring during troubleshooting I can't find a problem. The most recent timeout was yesterday and actually host down, not service timeouts like previous issues. The host check is check_ssh.

During the problem I was working from the location hostbw is at. I had no internet problems and could get to all locations from hostbw. Nagios1 could not ssh to hostbw, Nagios2 could. It seemed to be a DNS resolving problem, Nagios1 server could nslookup hostbw finding the correct IP (this is a domain name for a DynDNS) but it could not ssh, said could not resolve host name I believe, but Nagios2 could SSH. I checked both servers and they are setup the same. They both have nearly the same /etc/hosts and /etc/resolv.conf using the router as their DNS. Any ideas?

Nagios1:

    Host stopped flapping[03-17-2014 19:23:50] HOST FLAPPING ALERT: hostbw;STOPPED; Host appears to have stopped flapping (3.9% change < 5.0% threshold)
    Host Up[03-17-2014 15:39:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 15:36:20] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 15:35:10] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 15:29:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 15:19:00] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 15:18:30] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 14:59:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 14:57:50] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 14:56:40] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 14:54:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 14:51:30] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host started flapping[03-17-2014 14:50:20] HOST FLAPPING ALERT: hostbw;STARTED; Host appears to have started flapping (23.9% change > 20.0% threshold)
    Host Down[03-17-2014 14:50:20] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 14:47:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 14:45:10] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 14:44:00] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 14:40:20] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 11:03:30] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 11:02:20] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:

Nagios2:

    Host stopped flapping[03-17-2014 16:36:40] HOST FLAPPING ALERT: hostbw;STOPPED; Host appears to have stopped flapping (4.7% change < 5.0% threshold)
    Host started flapping[03-17-2014 15:39:00] HOST FLAPPING ALERT: hostbw;STARTED; Host appears to have started flapping (20.1% change > 20.0% threshold)
    Host Up[03-17-2014 15:39:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 15:34:00] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 15:33:30] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 15:27:40] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 15:22:40] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 15:22:00] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 12:46:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 12:16:00] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 12:15:40] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
    Host Up[03-17-2014 11:22:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
    Host Down[03-17-2014 11:02:40] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
    Host Down[03-17-2014 11:02:10] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:

How are the outages detected in Nagios? How long are the periods where the problems are detected? — Oliver Salzburg, Commented Mar 18, 2014 at 16:29
When the timeouts are happening, have you pinpointed how far along the path you can make it? Like are you able to communicate with the interface, the switch, the router, the gateway, other LAN members, etc.? — Ƭᴇcʜιᴇ007, Commented Mar 18, 2014 at 16:33
To answer the comments above, the Nagios NRPE plugin checks timeout at 10 seconds for the various hosts at a connection. I've checked the plugins themselves and they are returning results very quickly. The timeouts are sporadic/intermittent so when I check everything, I'm able to go the full path and not detect a problem, or I'll experience a delay like during initial ssh but nothing I can reproduce. Thanks for your help, any ideas? — apuschak, Commented Mar 19, 2014 at 20:05

apuschak · Accepted Answer · 2014-05-20 14:10:31Z

I opened tickets with Verizon support and after a few tries got someone who kept a ticket open and worked to understand the problem.

I setup trace route (linux tracepath) to run continuously from the Nagios server to the datacenter we had timeouts to.

After sending a trace route that showed the route stopping at a Verizon router during the same time we experienced another outage (detected in Nagios) the issue has not happened again.

I continued to send timeouts to the Verizon tech for over a month, however they were few and none are as severe as the issues we were having and none of them appear to be Verizon network related.

He says "it appears the issue resolved itself". Since we had 5-10 minute outages for over a year and now have none like that I think something was fixed.

Stack Exchange Network

sporadic network timeouts detected by 1 or 2 nagios instances not making sense

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
networking
centos
timeout
nagios
.

Hot Network Questions

sporadic network timeouts detected by 1 or 2 nagios instances not making sense

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged networkingcentostimeoutnagios.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
networking
centos
timeout
nagios
.