0

I am working on a SUSE Linux based system that runs 24/7 and has done so for around 3 years although I cannot say that it has not been rebooted in that time.
At 9th June there apparently was some sort of site shutdown, but I don't know how this shutdown was handled.
Since then there have been uncountable instantaneous reboots.

The fault has occurred under virtually every conceivable test situation, e.g. user applications running or not, archiving data or not, recording new data or not, running fsck after 20 or more crashes or just simply doing nothing.

  • RAM has been replaced.

  • A cooling fan on the CPU heatsink was replaced as it was quite noisy (although there is a bigger fan blowing across it only about 20 mm away).

  • The power supply has been replaced and the green wire hard grounded to prevent it shutting down.

This appears to have the effect that the reboots are a little less frequent.

Apparently although I did not do it, fsck [probably] was run on it [250GB drive has data partition of some 220GB], but I don't know if all partitions were checked. But it has apparently been running continuously for 2 days now.

Can anyone suggest what sorts of problems can cause Linux to instantly die and reboot?

1
  • As others have said, it sounds like a hardware issue. But for the future, you should consider having regular scheduled reboots, just as you should have regularly scheduled downtime for patching. As it is now, you don't know if any of the config changes done during the past 3 years may be responsible for the problem. Reducing that time window is worth a lot.
    – Jenny D
    Commented Jun 19, 2013 at 7:58

1 Answer 1

0

This sounds like a hw issue to me. Could be temperature, PSU or mobo.

You could check the logs /var/log/*, or output of dmesg command for clues

RAM has been replaced. A tired cooling fan on CPU heatsink replaced as quite noisy although there is a bigger fan blowing across it only about 20 mm away Power supply replaced and the green wire hard grounded to prevent it shutting down, but this just meant the reboots are a little less delayed.

I'd check with lm-sensors or within "/proc/acpi/thermal_zone" (if applicable to you) for any sign of overheating.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .