Server instability issues, another update Wed, May 18. 2016
I've checked for bad blocks on all disks that are part of the RAID array connected to my SAS controller: none found, which in itself is a good thing, but still doesn't provide an explanation about the server freezes. As it's a RAID 5 array, I could only take out one disk, check it for bad blocks, re-add it to the array, wait for the disk to have re-synced with the array, before I could go ahead with the next disk. That all takes quite some time.
There's also the possibility to check for bad blocks with a read/write test, instead of just read-only. This test takes a whole lot longer to complete, but should I have done it anyway? What's the advantage of this test over the read-only test? Does it find bad blocks that would otherwise not be found?
I'm running with the Samba daemon stopped for a few weeks now. No freezes anymore, but making do without Samba is a real pain, as this means there's no NAS on the LAN anymore. That's why checking for bad blocks took over two weeks to complete, because I was copying a folder of more than 2 TiBs over SFTP to my desktop pc, which makes the re-sync on the RAID array very, very slow. As it does seem to be the culprit, but I still refuse to believe the freezes are caused by software but instead by hardware, there are only two possibilities left in my opinion: the SAS controller or the CPU, with the SAS controller being the most likely candidate. Sigh, decisions, decisions…
And now for some good news... Sun, May 1. 2016
I've switched to using Let's Encrypt for my certificates and am now also using SNI, so I don't have to use several ports for the different websites over SSL (sorry, Internet Exploder on Windows XP users...)! It looks like it's working just fine.
Thanks to StartSSL for their free certificates over the years, but Let's Encrypt issues free certificates too and takes the pain out of renewing, so that's why I've switched.
Server instability issues, update Sun, May 1. 2016
I've run some CPU stress tests for a few hours, but no errors or freezes...
Stopping Samba didn't work, as yesterday evening my server again froze, but at least I didn't see any kernel errors about stuck processes. I've been talking about the COM port stopping, but I have to rephrase that: the UPS software reported it lost communications with the UPS. Maybe just stopping and starting the UPS software could solve that; I'll have to find out when that issue pops up again. Anyway, of course it's still weird, because the UPS software never reported having lost communications before in the last year or so this server is running.
Maybe the SAS controller makes my server hang, by slurping up all I/O or something?
I really have to think about how to proceed further. Until the next update!