I'm working for a small company as a linux and DB sysadmin. Since this is more commodity hardware I thought I'd try superuser first. We've been having odd networking problems occasionally and intermittently. Every time I try to pinpoint the problem, something negates it.
Mostly we are receiving timeouts of over 10 seconds in Nagios with no correlation to the problem. They are intermittent and appear random, however there are certain machines/internet connections that it continues to occur on. I setup a second copy of Nagios on a second development Server. Both of these machines were installed by me and are not heavily used. CentOS minimal install with most packages yum installed as needed.
The issues are detected on both Nagios' sometimes and other times just the main Nagios. Some checks are every 5 minutes and a few are every minute so I know about overlap, however it doesn't add up.
We've rebooted the machines and network stuff. We have 4 different internet connections we are hitting from Nagios, all going out on a business FIOS. One problem is our connection from Nagios to our machines in a rack at a real data center. The machines there do not report or have any network issues during the timeout, only our Nagios detects a problem getting to them. Luckily those machines hit our other location now, but we'd like them to come to the location Nagios is at for DR.
Another location has FIOS and Comcast cable, sometimes Nagios times out those. Lastly, it hits my own server at my house on just a consumer comcast cable, which occasionally also timesout and I able to check connectivity while being on it during the outage.
SO, starting over from the beginning. I have 2 Nagios instances running at location A on FIOS the same checks to servers at 3 other locations. Sometimes timeouts show up on both Nagios' so those appear to be 'real' issues, other times only one Nagios starts flapping which makes no sense since both servers live right next to each other physically and on the network. I just rebooted the main Nagios again.
I'm going to wait for the next issue and report what both Nagios' detect. What should I be looking for to troubleshoot? I've looked at all the logs, setup logging on our router, check the network while the issue is happening, and can't figure out what the problem is.
Thank you for your help!
Lastest Update
I caught one of the problems that is not related to ssh attacks or any other server side problems. We had a 5 minute outage that I could not get to certain locations from the Nagios location. Ping stopped dead, but I was able to ping those locations/servers from other locations. I tracepath'ed during and after the outage, which after the 5 minute period sporadically happened again and again, and found that it stops at
G0-3-4-4.PHLAPA-LCR-22.verizon-gni.net (130.81.180.248)
Any ideas or help? Here is the tracepath during the problem, names changed on our things:
[root@nagiosServer ~]# tracepath datacenterServer
1: ourdomain.com (192.168.1.55) 0.076ms pmtu 1500
1: ourRouter (192.168.1.1) 0.297ms
1: ourRouter (192.168.1.1) 0.258ms
2: L300.PHLAPA-VFTTP-164.verizon-gni.net (72.94.203.1) 4.817ms asymm 3
3: G0-3-4-4.PHLAPA-LCR-22.verizon-gni.net (130.81.180.248) 5.696ms
4: no reply
5: no reply
6: no reply
7: no reply
8: no reply
9: no reply
10: no reply
11: no reply
12: no reply
13: no reply
14: no reply
15: no reply
16: no reply
17: no reply
18: no reply
19: no reply
20: no reply
21: no reply
22: no reply
23: no reply
24: no reply
25: no reply
26: no reply
27: no reply
28: no reply
29: no reply
30: no reply
31: no reply
Too many hops: pmtu 1500
Resume: pmtu 1500
[root@nagiosServer ~]# date
Fri Apr 4 12:04:30 EDT 2014
[root@nagiosServer ~]#
And after the problem resolved:
[root@nagiosServer ~]# date
Fri Apr 4 12:04:51 EDT 2014
[root@nagiosServer ~]# tracepath datacenterServer
1: ourdomain.com (192.168.1.55) 0.081ms pmtu 1500
1: ourRouter (192.168.1.1) 0.253ms
1: ourRouter (192.168.1.1) 0.295ms
2: L300.PHLAPA-VFTTP-164.verizon-gni.net (72.94.203.1) 2.631ms asymm 3
3: G0-3-4-4.PHLAPA-LCR-22.verizon-gni.net (130.81.180.248) 6.390ms
4: so-3-1-0-0.PHIL-BB-RTR2.verizon-gni.net (130.81.22.60) 20.953ms asymm 5
5: 0.xe-2-1-0.BR2.IAD8.ALTER.NET (152.63.5.245) 13.855ms asymm 7
6: ae-20.r04.asbnva02.us.bb.gin.ntt.net (129.250.8.33) 13.123ms asymm 5
7: ge-100-0-0-20.r04.asbnva02.us.ce.gin.ntt.net (168.143.97.190) 14.057ms
8: core1-ten-2-1.nwrk1.hostmysite.net (67.59.145.33) 12.873ms asymm 15
9: ae5-dist1.nwk01.hosting.com (67.59.145.89) 12.912ms asymm 15
10: no reply
11: no reply
12: no reply
13: no reply
14: no reply
15: no reply
16: no reply
17: no reply
18: no reply
19: no reply
20: no reply
21: no reply
22: no reply
23: no reply
24: no reply
25: no reply
26: no reply
27: no reply
28: no reply
29: no reply
30: no reply
31: no reply
Too many hops: pmtu 1500
Resume: pmtu 1500
[root@nagiosServer ~]#
Update: The server Nagios1 is on went up and down (SSH check failing) for 3 hours, which I find to be a very weird problem since its Nagios itself checking the machine itself lives on. I checked the logs and noticed a cracking attempt during that exact time in the secure log, an IP from China. Obviously we need to lock down ssh and not have root exposed, but I guess if ssh has a max of 10 establishing connections at a time, this is exceeding it preventing sporadic Nagios ssh checks to fail during that time. Hopefully after locking down ssh/firewalling/ or using keys this will eliminate most of the time out problems.
...
Host started flapping[03-22-2014 11:44:43] HOST FLAPPING ALERT: Nagios1;STARTED; Host appears to have started flapping (23.7% change > 20.0% threshold)
Host Up[03-22-2014 11:44:43] HOST ALERT: Nagios1;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
Host Down[03-22-2014 11:44:23] HOST ALERT: Nagios1;DOWN;HARD;2;Server answer:
Host Down[03-22-2014 11:43:33] HOST ALERT: Nagios1;DOWN;SOFT;1;Server answer:
Host Up[03-22-2014 11:43:23] HOST ALERT: Nagios1;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
Host Down[03-22-2014 11:42:53] HOST ALERT: Nagios1;DOWN;HARD;2;Server answer:
Host Down[03-22-2014 11:42:23] HOST ALERT: Nagios1;DOWN;SOFT;1;Server answer:
secure log:
...
Mar 22 11:43:02 Nagios1 sshd[12004]: Failed password for root from 59.63.167.224 port 60767 ssh2
Mar 22 11:43:02 Nagios1 sshd[11941]: Failed password for root from 59.63.167.224 port 58905 ssh2
Mar 22 11:43:02 Nagios1 sshd[11942]: Disconnecting: Too many authentication failures for root
Mar 22 11:43:02 Nagios1 sshd[11941]: PAM 5 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=59.63.167.224 user=root
Mar 22 11:43:02 Nagios1 sshd[11941]: PAM service(sshd) ignoring max retries; 6 > 3
Mar 22 11:43:02 Nagios1 sshd[11995]: Failed password for root from 59.63.167.224 port 60545 ssh2
Mar 22 11:43:02 Nagios1 sshd[12009]: Failed password for root from 59.63.167.224 port 60919 ssh2
Mar 22 11:43:02 Nagios1 sshd[11997]: Failed password for root from 59.63.167.224 port 60632 ssh2
Mar 22 11:43:02 Nagios1 sshd[11952]: Failed password for root from 59.63.167.224 port 59362 ssh2
Mar 22 11:43:02 Nagios1 sshd[11960]: Failed password for root from 59.63.167.224 port 59716 ssh2
Mar 22 11:43:03 Nagios1 sshd[11943]: Failed password for root from 59.63.167.224 port 59237 ssh2
Mar 22 11:43:03 Nagios1 sshd[11988]: Failed password for root from 59.63.167.224 port 60277 ssh2
Mar 22 11:43:04 Nagios1 sshd[12004]: Failed password for root from 59.63.167.224 port 60767 ssh2
Mar 22 11:43:04 Nagios1 sshd[12001]: Failed password for root from 59.63.167.224 port 60672 ssh2
Mar 22 11:43:04 Nagios1 sshd[11995]: Failed password for root from 59.63.167.224 port 60545 ssh2
Mar 22 11:43:04 Nagios1 sshd[12009]: Failed password for root from 59.63.167.224 port 60919 ssh2
Mar 22 11:43:04 Nagios1 sshd[11997]: Failed password for root from 59.63.167.224 port 60632 ssh2
Mar 22 11:43:04 Nagios1 sshd[11952]: Failed password for root from 59.63.167.224 port 59362 ssh2
Update: Only one new timeout, but this time to the DB host over the FIOS connection not the Cable connection. Nagios1 detected it but Nagios2 did not, it was brief so it could of missed it.
Nagios1:
Service Ok[03-20-2014 21:34:53] SERVICE ALERT: hostDBFios;PG BACKENDS;OK;SOFT;2;POSTGRES_BACKENDS OK: DB "postgres" 9 of 100 connections (9%)
Service Critical[03-20-2014 21:34:03] SERVICE ALERT: hostDBFios;PG BACKENDS;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
On the DB host this is in /var/log/messages:
Mar 20 21:34:01 hostdb nrpe[28248]: Could not read request from client, bailing out...
Mar 20 21:34:01 hostdb nrpe[28248]: INFO: SSL Socket Shutdown.
I haven't figured out what this could be, searching the errors but most are for steady problems not intermittent ones, perhaps something to do with SSL/SSH?
UPDATE: New timeout below for Issue 2.
Issue 2
Before Issue 1 I posted below, there was an actual outage at [03-17-2014 19:37:50] on one Internet connection we have. It is a business Comcast connection. This happens occasionally and happened before I started working over a year ago, but it shows up as an alert from DNS made easy by the owner as well as Nagios1 and Nagios2. We can write this outage off, have never got an answer from Comcast about it, but at least it's pinpointed to one connection and is a total loss that recovers quickly. It would be nice to troubleshoot this as well, however its less important. Both Nagios' report the same for this outage. We have one server on this connection with a number of alerts and another server named hostDBCable here, that only has a host alert because its dual WAN to FIOS and I monitor the rest of the services on that connection:
Nagios 2:
[03-17-2014 19:45:40] SERVICE ALERT: host81;DISK SPACE;OK;HARD;1;DISK OK - free space: / 410086 MB (94% inode=99%): /boot 81 MB (87% inode=99%): /dev/shm 3994 MB (100% inode=99%):
Service Ok[03-17-2014 19:45:10] SERVICE ALERT: host81;PGB BACKENDS;OK;HARD;1;POSTGRES_PGBOUNCER_BACKENDS OK: DB "pgbouncer" 1 of 1000 connections (1%)
Service Ok[03-17-2014 19:45:10] SERVICE ALERT: host81;PGB MAXWAIT;OK;HARD;1;POSTGRES_PGB_POOL_MAXWAIT OK: DB "pgbouncer" pgbouncer=0 * phoneworks=0 * phoneworksNew=0
Service Ok[03-17-2014 19:44:40] SERVICE ALERT: host81;MEMORY;OK;HARD;1;OK - 89.0% (7284288 kB) free.
Service Ok[03-17-2014 19:44:30] SERVICE ALERT: host81;MEMORY SWAP;OK;HARD;1;SWAP OK - 100% free (1983 MB out of 1983 MB)
Service Ok[03-17-2014 19:43:30] SERVICE ALERT: host81;CPU LOAD;OK;HARD;1;OK - load average: 0.14, 0.19, 0.18
Host Up[03-17-2014 19:41:50] HOST ALERT: host81;UP;HARD;1;SSH OK - OpenSSH_4.3 (protocol 2.0)
Service Ok[03-17-2014 19:41:40] SERVICE ALERT: host81;PGB MAX BACKEND;OK;HARD;2;OK - 2 connections
Service Ok[03-17-2014 19:41:40] SERVICE ALERT: host81;PGB MAXMAXWAIT;OK;HARD;2;OK - queries waiting 0.00 seconds
Service Ok[03-17-2014 19:41:40] SERVICE ALERT: host81;IVR LONG QUERY;OK;HARD;2;OK - No flag file
Service Ok[03-17-2014 19:41:40] SERVICE ALERT: host81;ERROR ASTERISK REBOOT;OK;HARD;2;OK - No ERROR found
Host Up[03-17-2014 19:41:10] HOST ALERT: hostDBCable;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
Service Critical[03-17-2014 19:40:40] SERVICE ALERT: host81;DISK SPACE;CRITICAL;HARD;1;Connection refused or timed out
Service Critical[03-17-2014 19:40:20] SERVICE ALERT: host81;PGB MAXWAIT;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Critical[03-17-2014 19:40:10] SERVICE ALERT: host81;PGB BACKENDS;CRITICAL;HARD;1;Connection refused or timed out
Host Down[03-17-2014 19:40:10] HOST ALERT: hostDBCable;DOWN;HARD;2;CRITICAL - Socket timeout after 10 seconds
Service Critical[03-17-2014 19:39:50] SERVICE ALERT: host81;MEMORY;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Critical[03-17-2014 19:39:30] SERVICE ALERT: host81;MEMORY SWAP;CRITICAL;HARD;1;Connection refused or timed out
Host Down[03-17-2014 19:39:00] HOST ALERT: host81;DOWN;HARD;2;CRITICAL - Socket timeout after 10 seconds
Host Down[03-17-2014 19:38:50] HOST ALERT: hostDBCable;DOWN;SOFT;1;No route to host
Service Critical[03-17-2014 19:38:50] SERVICE ALERT: host81;PGB MAXMAXWAIT;CRITICAL;HARD;2;CHECK_NRPE: Socket timeout after 10 seconds.
Service Critical[03-17-2014 19:38:50] SERVICE ALERT: host81;ERROR ASTERISK REBOOT;CRITICAL;HARD;2;CHECK_NRPE: Socket timeout after 10 seconds.
Service Critical[03-17-2014 19:38:40] SERVICE ALERT: host81;PGB MAX BACKEND;CRITICAL;HARD;2;Connection refused or timed out
Service Critical[03-17-2014 19:38:40] SERVICE ALERT: host81;IVR LONG QUERY;CRITICAL;HARD;2;Connection refused or timed out
Service Critical[03-17-2014 19:38:30] SERVICE ALERT: host81;CPU LOAD;CRITICAL;HARD;1;Connection refused or timed out
Host Down[03-17-2014 19:38:00] HOST ALERT: host81;DOWN;SOFT;1;CRITICAL - Socket timeout after 10 seconds
Service Critical[03-17-2014 19:37:50] SERVICE ALERT: host81;PGB MAXMAXWAIT;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Critical[03-17-2014 19:37:50] SERVICE ALERT: host81;PGB MAX BACKEND;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Critical[03-17-2014 19:37:50] SERVICE ALERT: host81;IVR LONG QUERY;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Critical[03-17-2014 19:37:50] SERVICE ALERT: host81;ERROR ASTERISK REBOOT;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout after 10 seconds.
Nagios 2:
Host Up[03-19-2014 11:18:10] HOST ALERT: hostDBCable;UP;SOFT;2;SSH OK - OpenSSH_5.3 (protocol 2.0)
Host Down[03-19-2014 11:17:00] HOST ALERT: hostDBCable;DOWN;SOFT;1;CRITICAL - Socket timeout after 10 seconds
Issue 1
Replying to comments:
It's hard to catch some of these or if they are occurring during troubleshooting I can't find a problem. The most recent timeout was yesterday and actually host down, not service timeouts like previous issues. The host check is check_ssh.
During the problem I was working from the location hostbw is at. I had no internet problems and could get to all locations from hostbw. Nagios1 could not ssh to hostbw, Nagios2 could. It seemed to be a DNS resolving problem, Nagios1 server could nslookup hostbw finding the correct IP (this is a domain name for a DynDNS) but it could not ssh, said could not resolve host name I believe, but Nagios2 could SSH. I checked both servers and they are setup the same. They both have nearly the same /etc/hosts and /etc/resolv.conf using the router as their DNS. Any ideas?
Nagios1:
Host stopped flapping[03-17-2014 19:23:50] HOST FLAPPING ALERT: hostbw;STOPPED; Host appears to have stopped flapping (3.9% change < 5.0% threshold)
Host Up[03-17-2014 15:39:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
Host Down[03-17-2014 15:36:20] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
Host Down[03-17-2014 15:35:10] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
Host Up[03-17-2014 15:29:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
Host Down[03-17-2014 15:19:00] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
Host Down[03-17-2014 15:18:30] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
Host Up[03-17-2014 14:59:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
Host Down[03-17-2014 14:57:50] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
Host Down[03-17-2014 14:56:40] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
Host Up[03-17-2014 14:54:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
Host Down[03-17-2014 14:51:30] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
Host started flapping[03-17-2014 14:50:20] HOST FLAPPING ALERT: hostbw;STARTED; Host appears to have started flapping (23.9% change > 20.0% threshold)
Host Down[03-17-2014 14:50:20] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
Host Up[03-17-2014 14:47:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
Host Down[03-17-2014 14:45:10] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
Host Down[03-17-2014 14:44:00] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
Host Up[03-17-2014 14:40:20] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
Host Down[03-17-2014 11:03:30] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
Host Down[03-17-2014 11:02:20] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
Nagios2:
Host stopped flapping[03-17-2014 16:36:40] HOST FLAPPING ALERT: hostbw;STOPPED; Host appears to have stopped flapping (4.7% change < 5.0% threshold)
Host started flapping[03-17-2014 15:39:00] HOST FLAPPING ALERT: hostbw;STARTED; Host appears to have started flapping (20.1% change > 20.0% threshold)
Host Up[03-17-2014 15:39:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
Host Down[03-17-2014 15:34:00] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
Host Down[03-17-2014 15:33:30] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
Host Up[03-17-2014 15:27:40] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
Host Down[03-17-2014 15:22:40] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
Host Down[03-17-2014 15:22:00] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
Host Up[03-17-2014 12:46:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
Host Down[03-17-2014 12:16:00] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
Host Down[03-17-2014 12:15:40] HOST ALERT: hostbw;DOWN;SOFT;1;Usage:
Host Up[03-17-2014 11:22:00] HOST ALERT: hostbw;UP;HARD;1;SSH OK - OpenSSH_5.3 (protocol 2.0)
Host Down[03-17-2014 11:02:40] HOST ALERT: hostbw;DOWN;HARD;2;Usage:
Host Down[03-17-2014 11:02:10] HOST ALERT: hostbw;DOWN;SOFT;1;Usage: