[Oct 28 2017] - Unexpected Server Outage - [restored]

Stay up to date with shard happenings
Locked
User avatar
Red Squirrel
Posts: 29193
Joined: Wed Dec 18, 2002 12:14 am
Location: Northern Ontario
Contact:

[Oct 28 2017] - Unexpected Server Outage - [restored]

Post by Red Squirrel »

I got called in to work but as it happened the power went out. For some reason the shard is now offline and I can't connect to the network.

Unfortunately I won't be able to get to it till after 7pm today.

I will monitor for the odd chance it comes back up as last time it was the ONT that was acting up, but I have a feeling a server took a power hit from the initial brownout despite there being a UPS. It's not dual conversion. I do have plans to upgrade to one eventually.


UPDATE 13:40ET: I went home real quick just to get an idea of damage and if it's something that just needs rebooting. At quick glance it looks like we lost the VM server that hosts pretty much everything. Also for some reason despite having a backup DNS on a physical server, DNS is also not working.

Good news is the file server is ok. There may still be OS level corruption from the VMs having been improperly shut down but worse case scenario I need to re-image each one. There should be no actual loss of data. If yes there are backups.

Shard will remain down for the rest of day till I get home and can figure out what is going on in more detail.


UPDATE 20:29ET: I am now home and investigating this issue. I am hoping it's nothing bad as this affects all my own stuff too not just the shard. Pretty much dead in the water here.


UPDATE 21:01ET: This is going to require more coffee and possibly an all nighter. I don't even have heat because I can't access the environmental control server. This is very bad, I'm still trying to wrap my head around this outage. I just hope there's no corruption once I can get to the point where I can see the VMs. Right now the storage is not linking properly and there's all sorts of weird DNS issues. I setup a backup DNS server a while back but aparantly everything is still trying to connect to the main one which is a VM. I think I should just make the physical box the primary, though part of the issue is even that box is acting weird.


UPDATE 22:17ET: I may have some progress. Was able to get the storage subsystem back online. I am powering up my VMs one by one to make sure it's stable then will get to the shard VM shortly.


UPDATE 22:33ET: Looks like the database is corrupted. I will work on restoring it. If this is successful and that there are no other issues, the shard should be back up very shortly. TC1 was corrupted too, oddly dev was ok. I fear I have a lot more data corruption to deal with though but I'll probably find it over the next few weeks, months even years as I need it.


FINAL UPDATE 23:36ET: Restored a backup from Sat Oct 28 06:00:08 EDT 2017 and tested server ok.

I still have a lot of other stuff to restore such as TC1 and my own personal stuff. I lost my VPN server completely, the VM is destroyed, and other stuff like that. But as far as shard goes, it is back online and should be stable. I need to start looking into upgrading my UPS to 48v dual conversion, I was putting that off, but it's the second time something like this happens so I will have to get on it. A simple power bump should not be this disastrous. I'm also looking into simply adding a large capacitor bank that hooks into each server's PSU but that is more invasive and complex as I need to implement inrush current limiting etc.

Archived topic from AOV, old topic ID:6730, old post ID:39167
Honk if you love Jesus, text if you want to meet Him!
Locked