I run a server using Debian Squeeze with several OpenVZ containers. The containers run mostly Squeeze, some Lenny, and some already updated to Wheezy. The host doesn't do that much beyond iptables and DHCP. File servers, proxies, mail servers, kerberos, LDAP, ... are all put into containers. The system ran stable for many years and had no major changes except some firewall rules for over a year.
2 days ago all of a sudden the system crashed. I had a lot of problems bringing it up again. At first it wouldn't let me log in via ssh. root login was denied by 'You do not exists. Go away!' Local login was fine. Some time later ssh worked again. By coincidence I didn't re-use the line from the bash history, but typed a new command, which triply checked was identical to the line, which didn't work before but worked before the crash.
Then the system ran, but network traffic on most protocols was blocked following SYN ACK. DNS, Telnet, and SSH were fine, but the rest was a mess. After a couple of hours fishing in the dark and reloading the firewall several times all of a sudden everything went fine again. I couldn't find anything suspicious in the logs - but I'm not a forensic expert.
Today the nscd of the file server went out of sockets to contact the LDAP due to the container quota. Something that never happened before. I also saw a lot (> 30) of sockets claimed by smbd.
/var/log/messages looked quite the same as syslog. /var/log/kern.log had this additional information on crash reasons:
/var/log/kern.log:2950:Sep 19 10:46:57 asgard kernel: [6529441.320086] INFO: task sendmail:32181 blocked for more than 120 seconds.
/var/log/kern.log:2982:Sep 19 10:48:57 asgard kernel: [6529561.324525] INFO: task kdmflush:1932 blocked for more than 120 seconds.
/var/log/kern.log:3005:Sep 19 10:48:57 asgard kernel: [6529561.324694] INFO: task xfssyncd:10162 blocked for more than 120 seconds.
/var/log/kern.log:3027:Sep 19 10:48:57 asgard kernel: [6529561.324934] INFO: task postgres:16827 blocked for more than 120 seconds.
/var/log/kern.log:3060:Sep 19 10:49:51 asgard kernel: [6529561.325129] INFO: task imapd:31749 blocked for more than 120 seconds.
/var/log/kern.log:3084:Sep 19 10:49:51 asgard kernel: [6529561.325248] INFO: task cleanup:32194 blocked for more than 120 seconds.
/var/log/kern.log:3106:Sep 19 10:50:57 asgard kernel: [6529681.324028] INFO: task flush-253:3:3216 blocked for more than 120 seconds.
/var/log/kern.log:3142:Sep 19 10:50:57 asgard kernel: [6529681.324224] INFO: task kjournald:6859 blocked for more than 120 seconds.
/var/log/kern.log:3166:Sep 19 10:50:57 asgard kernel: [6529681.324366] INFO: task syslogd:11720 blocked for more than 120 seconds.
/var/log/kern.log:3198:Sep 19 10:50:57 asgard kernel: [6529681.324574] INFO: task postgres:16827 blocked for more than 120 seconds.
/var/log/kern.log:7152:Sep 19 19:29:41 asgard kernel: [ 1440.617090] INFO: task sendmail:11892 blocked for more than 120 seconds.
The final 'sendmail' crash was after rebooting the machine. Since then no more such events occurred. 'imapd' and 'postgres' definitely run in different containers.
Well, I do not see any smoking gun, but I'm probably just blind. Setting up the system from known / presumed good backups would hit me too hard to try it without very good reasons.
I'd appreciate any advice what to check next.
Thanks for your help.
Update: Putting more effort in searching for some pre-cursor of the crash I found the following in syslog:
Sep 19 10:09:56 asgard ntop[7965]: **WARNING** packet truncated (8754->8232)
Sep 19 10:09:56 asgard ntop[7965]: **WARNING** packet truncated (8754->8232)
Sep 19 10:09:56 asgard ntop[7965]: **WARNING** packet truncated (10490->8232)
Sep 19 10:09:56 asgard ntop[7965]: **WARNING** packet truncated (8754->8232)
Sep 19 10:09:56 asgard ntop[7965]: **WARNING** packet truncated (8754->8232)
Sep 19 10:09:56 asgard ntop[7965]: **WARNING** packet truncated (17442->8232)
Sep 19 10:11:02 asgard ntop[7965]: **WARNING** packet truncated (11650->8232)
Sep 19 10:11:02 asgard ntop[7965]: **WARNING** packet truncated (10202->8232)
Sep 19 10:11:29 asgard ntop[7965]: **WARNING** packet truncated (8754->8232)
Sep 19 10:13:27 asgard ntop[7965]: **WARNING** packet truncated (8754->8232)
Sep 19 10:20:33 asgard ntop[7965]: **WARNING** packet truncated (8754->8232)
I know this is deemed uncritical, but it seems to be a rare event. Packet truncation only exists on the day of the second crash. Nowhere else in all available log files.