When will my hard drive fail?

…or, why should I get RAID?

Like I said before, if you’re getting a dedicated server for web hosting, you should get RAID. Just get it. But if you really want to know why, keep reading.

How likely is my hard drive to fail?

There are a lot of factors that contribute to hard drive failures.

Usage

If you run your hard drive 24×7, and run applications that read & write the hard drive a lot, that causes more wear on the hard drive than using it infrequently.

Environment

If your hard drive is inside a hot computer case and a hot room, it’s more likely to fail.

Expected lifetime: MTBF and Service Life

The mean time between failures (MTBF) for modern hard drives is 50 years or more. MTBF is measured by running a bunch of hard drives for days or weeks and seeing how many fail during that time. That data is then extrapolated out to an MTBF number. You can search on Google or read this tutorial for more info.

However, a MTBF of 50 years does not mean your hard drive will last for 50 years. Instead, it’s an average. So some hard drives will last longer, while others will last shorter. Sometimes much shorter.

That MTBF number also assumes you only use your hard drive within its specified service life (usually 3-5 years). So a 50 year MTBF for a hard drive with a 5 year service life means that, if you replaced the hard drive with a new one every 5 years, it should last 50 years on average before failing.

A rule of thumb

Ok, so we really don’t know exactly when your hard drive will fail. We do know that you shouldn’t use it past the end of its service life, so I tell clients that a rule of thumb for a hard drive with a 5 year service life is that you have a 2% chance that the hard drive will fail in the first year, a 4% chance it will fail in the second year, 8% in the third, 16% in the fourth, 32% in the fifth, and just keep doubling it.

What happens if I lose a hard drive?

Hard drives are near the top of the list of things to fail, and a server with a single failed hard drive can be down for 4, 8, even 24 hours as you and/or the hosting company support staff

  • notice the site is down
  • figure out the hard drive is dead
  • find a replacement hard drive & install it
  • restore your machine by restoring from backup and/or reinstalling & reconfiguring things
  • test to make sure that everything is working correctly and fix any issues

So if you lose a hard drive and don’t have a backup web server your site could be down for 8-24 hours depending on your backup strategy & the quality of your support team.

Compare that to the loss of a fan or power supply which can be swapped out in under an hour. Some fans you can even swap out while the server is still running.

You need to think about what downtime means to your company. Will you lose money if your site is down for a day, either through lost sales, departing customers, or uptime fees paid to your clients? Will it be an embarrassment? Remember that sites can often fail at the worst time. Think about what 8-24 hours of downtime will “cost” your company. It could be a dollar amount or something non-monetary.

Then you can multiply this cost by the chance of your hard drive to fail (see the rule of thumb below) to come up with the expected cost to your company due to non-RAID hard drive failure.

So how do I decide to buy RAID or not?

So based on the above rule of thumb, plus the potential downtime, you can estimate whether it’s worth paying for RAID or not.

You might decide that up to 24 hours down isn’t acceptable at all, in which case you should go for RAID. Otherwise, if it’s a potentially acceptable loss, you can think of RAID as insurance. Is the extra cost of RAID more or less than the expected downtime cost?

But what if I have two web servers?

If you’re running multiple web servers and load-balancing across them, you might not need RAID, since you can simply direct all traffic to the “good” web server if one fails. Remember, though, that to be able to fail over successfully, you need to be capable of supporting all your traffic on a single web server. Since servers seem to enjoy failing at the worst times (which is often when your traffic is highest), you should expect that it’ll be a good amount of traffic on that single server.

If you have two load-balanced web servers and both are at 80% capacity, what happens if one server fails and you redirect all traffic onto the good server? That’s right — your good server will get overloaded & some percentage of your users will get errors and long load times, or even worse your good server will redline and crash, and then you’re totally down.

So…thats all for now.

 

0