1

On all of my machines I'm using sar (sysstat) to get the current network bandwidth using sar -n DEV 1 1 that I parse out later, but on one of my machines this command no longer gives its output in 1 second like the other machines and takes more like 20-30 seconds. How do I debug what is happening here?

3
  • Could you please tell us which Debian version you are using and where the "sar" utility comes from? I just have looked into my Debian Lenny and Debian Jessie boxes, and it seems that there is no such utility. Further, there is no package "sysstats" and no program "sysstats" there. I have never used EC2, though, so if it is some Amazon proprietary utility, I am out of game.
    – Binarus
    Commented Feb 22, 2017 at 8:30
  • Oh, that is my bad, it's named "sysstat" (without the extra 'S' I had at the end before my edit just now). I just installed it via the usual apt-get install sysstat, so I don't think it came from any special repos. The Debian version though is testing (stretch) Commented Feb 22, 2017 at 14:46
  • You are right. I now have found it. The reason why I initially had difficulties finding it was that its original name is sar.sysstat (and not sar) and that sar is linked to sar.sysstat no sooner than when you install the sysstat package, i.e. sar is not in the packages somewhere, but is dynamically created during installation of the sysstat package. So I didn't find sar in the first place.
    – Binarus
    Commented Feb 23, 2017 at 8:07

1 Answer 1

0

I haven't used sar yet, but I just have read the manual and some articles, and I don't think that you are doing something wrong or that sar itself causes the problem. Unfortunately, you are not telling us about further circumstances on the machine affected, so I'll try to give some general guideline.

  • I have seen cases where only one application program or one specific part of an OS had been slowed down extremely by a defective disk. This could happen if the affected application tried reading the same defective sector(s) again and again, every time waiting for the timeout, or if it tried writing to defective sectors (Note: for some reason, disks sometimes do not recognize defective sectors appropriately or are not able to remap them in a timely manner).

    I have seen this in real life on production machines which otherwise were well, on various operating systems. So the first thing I would do is looking through the log files and searching for signs of disk IO errors and timeouts.

    If dmesg, last and friends don't show anything, perhaps run a S.M.A.R.T. test.

  • Of course, there could be another application which is taking up all the CPU time. But I am assuming that you already have used top and friends, and if that would be the case, the other applications (not only sar) would suffer as well. I think you would have noticed such behavior.

  • Eventually there is a problem with the NIC. For example, there could be I/O errors with PCI/PCI-E, meaning that the NIC or the mainboard is damaged or flawed. But in that case, the other networked applications would also experience dramatic slowdown, and again, I think you would have noticed such behavior.

You have tagged your question "amazon-ec2", so I don't know if you could replace the system or parts of it. If the system were mine and I could access it, I first would clone and replace the disk(s). Could you have Amazon do that? If not, I would make a full backup, dump that system and move to another one (don't know if and how this works with Amazon, though).

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .