I think it is safe to say that server outages are the most common real world disaster. Just thinking about these issues I can think of a number of server issues recently that would fall into this category for example.
- Server with single power supply had a failed power supply after a storm. Server completely unavailable for 4 working day plus a weekend while the part was shipped and installed.
- Main board replacement in a server caused an issue with a Hyper V server and the permissions that were used to power on guest VMs. Resolution to the issue was to rebuild the host operating system and recreated the VMs using the old disks. This was a large company that has branch offices across the eastern side of Australia. Time to resolution 6 hours.
- Issue with a cloned Hyper V Server where the hyper visor wouldn’t auto start. Outage was 1 business day and a weekend.
These types of issues occur all the time just mostly due to regular circumstances. Parts that run electricity and moving parts do fail from time to time and the risk grows significantly as servers reach the end of their warranties and age past the 4 year mark.
So lets get into it. Real world costs.
Server outage – 1 hour
You come in Monday morning and after a power outage over the weekend your server has shut itself down. Call the IT company who comes out and reboots the server (and checks everything working). Total outage 1 hour. The costs for this issue are fairly small.
- 15 Staff x $25 per hour x 1 Hour = $375
- IT Company costs = $200 (1 x call out fee + 1 Hour @ $150 per hour)
- I will say that sales were not effected for this outage as it was caught early and the sales guys caught up within a few hours.
- Total is about $525.00 for the 1 hour outage.
Now this hasn’t taken into account how long the server was down for (has email started to bounce back due to time outs), if anything needed to happen while the server was down (i.e. a backup) etc. Also if this happened during the middle of the day the rates would be a little different as many people don’t save their work so additional work may be lost.
Server outage – 6 hours
So you had a fault with a part and the server needs to be taken offline for the repair and some of the key services don’t restart. The IT company comes onsite and the issue takes 6 hours to fix. Now the costs start to add up a little quicker.
- 15 Staff x $25 per hour x 8 Hours = $3000 (I have said that the outage last all day as many staff will be severely limited for the balance of the day as they catch up in their work.)
- IT Company costs = $950 (1 x call out fee + 1 Hour @ $150 per hour)
- Sales reduction of 70% for that day = ($10,000 / 70% = $7,000 loss)
- Total is about $10,950 for the 6 hour server outage.
Now yes while the server is down staff may do other tasks such as filing etc. Realistically though this would probably only last an hour or two before they catch up, with the balance of the time left doing little productive work.
Now lets get into the serious end of town
Server outage – 3 Days
Your server has thrown two hard drives in a RAID 5 set, leaving your server and the data gone. Day 1, the issue is identified and options considered, parts are ordered which take 48 hours to arrive and install. Day 3 – parts arrived and installed, server is now able to boot, server recovery starts which takes the server into the night. Day 4 you are back up and running with only minor issues remaining. Now we are talking serious money.
- 15 Staff x $25 per hour x 3.5 Days @ 8 hours per day = $10,500 (Day 4 is severely restricted as the minor issues are resolved and tasks are caught up, in real life this could be well over 8 hours to fully catch up)
- IT Company Costs = $6600 (Day 1, 4 hours + Call out fee – Day 3, 4 hours + 2 after hours rate + Call out fee – Day 4, 6 hours + Call out fee)
- Sales Reduction of 100% for 3 days + 30% reduction for 4th day = $33,000 in lost revenue.
- Total cost is about $50,100 for a 3 day outage.
Now a 3 day outage for a server is a fairly rare event but you can see how quickly the costs can add up and outages that span days do happen. So what about cloud, how does cloud help in this situation? Also if your backups are bad then the potential for the costs to go up very quickly greatly increase.
Server Outage – The cloud component
This really comes down to what services are placed in the cloud. Even when people move into cloud solutions such as hosted email they still maintain a server onsite for services such as File Services, printing, Network control (DNS, DHCP, Active Directory) so the numbers above would not be moved much by some cloud services. The other thing that needs to be taken into account for cloud services is key applications (Or Tier 1 apps as I like to call them)
Tier 1 applications are applications that are critical to the business in earning revenue. So you may have access to files, email and phones via cloud services but if the key application that manages your sales is out of action then the costs are going to be fairly similar. So in the issues above I have considered that most services are hosted onsite in the server.
So what can be done?
There is a whole lot of bad news up there, what can be done to reduce the risk. Some of the options are economies of scale and may be to expensive for smaller companies but each needs to be considered
Replace old hardware.
This is the best thing I can suggest to reduce the risk of hardware failing is to keep hardware up to date. This sounds expensive as a server can cost anywhere from $20,000 to $50,000 every 4 years but when compared with the numbers above it is only a 3 day outage. Also you can put this into perspective on a per employee per month cost to help better identify costs
- $40,000 server / 15 staff = $2667 per staff member over 4 years
- 2667 / 4 years = $667 per staff member per year
- $667 / 12 months = $56 per month per staff member.
Plus this will lead into the next idea to reduce the risk of this issue.
Recycle old hardware.
Old hardware once replaced doesn’t usually stop working. In many sites who regularly upgrade their primary hardware they recycle their old hardware to a backup or disaster recovery solution. In the event of an issue that may last a few days you could restore backups into virtual machines on recycled hardware and run the servers at a reduced speed and capacity. While this may be an inconvenience it allows revenue to continue to flow even if at a reduced rate. In this situation IT costs may increase a little as restores are performed and then data migrated etc, but the increase in productivity from 0% to even 50% should offset such a cost.
Go Virtual with High Availability
If your server’s are not already virtual then it is well worth looking into. Some workloads are better suited to bare bones but most can be visualised with little performance issues.
One of the features of going virtual is the ability to run Highly available systems. This means that machines (and in some cases services) are monitored by a network ‘controller’ and if a machine goes missing it is automatically restarted on a server that is working correctly. This takes outages due to hardware issues down to under 10 minutes.
Virtual machines are also usually fairly hardware independent so if the machines were intact they could be transferred to recycled or loan hardware and your site brought back online quicker if High Availability is unavailable.
Move some services to the cloud.
Depending upon the requirements of the application, costs etc some services can be moved into cloud services to reduce the risk of downtime. This isnt suitable for every service and the pro’s and con’s need to be considered for each possibility. I know that cloud is the current buzz word but there are just some services that DO NOT belong in the cloud.