Hi,
It seems that a lot of folks are worried about their disk failure rates and how these are impacted by temperature. I have some experience in reliability assessment and modelling in datacentres so I thought I'd join in.
Thermal impact on Reliability;
Disk reliability is significantly affected by their operating temperature. See the following link from Hitachi Data Storage for a graphic example from a manufacturer:
http://www.hitachigst.com/hdd/technolo/drivetemp/drivetemp.htm
As can be seen the rule of doubling for every 10 degrees C rise is not applicable as the hard drive is electromechanical not purely electronic. More interesting (to me anyway) is that the stated MTBF can be improved upon by running the disk colder.
From the chart, running this disk at 15 degrees above design temperature will increase the probabiliy of a failure by 1.4 every day you run the disk hot. Conversely if you chill it down to 15 degrees below design then the probability of failure is divided by 1.5.
MTBF & Reliability;
There has been discussion of what MTBF means and how you will be impacted by it in this post so I will offer some views on this.
MTBF is normally Mean Time Between Failures, but this applies to repairable systems, hard drives are typically replaced, not repaired, so in this case it would mean Mean Time Before Failure.
As has already been pointed out the reliability will be of the characteristic "bathtub curve" where the following three effects combine:
1) Initial high infant mortality due to manufacturing defects, this will typically show up during the first few days, format the disk and then benchmark it for a day to get through this bit
2) Normal low level random failure following an exponential reliability model (thus the log e in the equation already given)
3) End of design life high failure due to component wearout
So if your disk survives it's first few days it is then going to be subject to a continuous probability of random failure. The cumulative probability of failure (e.g. probability that a disk will last 100,000 hours) is given by the exponential model. This means that the probability of your disk failing tomorrow is the same all the way through the design lifetime.
MTBF and Warranty;
The MTBF given and the design life of the disk will be substantially larger then the warranty given for the simple reason that warranty returns cost the manufacturer a lot of money/ This is why you see disks with 5 year warranties with huge MTBFs. The manufacturer will want to limit the returns within warranty to a small percentage for cost reasons.
e.g.
Disks with MTBF of 83500 Hours are sold by manufacturer with a 1 year warranty.
Ignoring infant mortality and assuming no end of life failures;
probability of each disk surviving the first year is 90%
(in Excel use the formula =EXP(-(1/MTBF)*Hours) to give the reliability, subtract this from 1 to get the probability of failure)
So this manufacturer giving a 1 year warranty will have to replace 10% of all the disks they sell which will cost them more than the profit for the whole batch.
This is why the MTBF for disks is so high, the manufacturers who offer udeful warranties (5 years) have to make disks that only a very small percentage will fail within warranty to make money.
This MTBF is, however, an artificial value and should not be read as "my disk will last for 1,000,000 operating hours" because this is not true. Due to the use of the number and the testing methods used to get it what it means is:
"with 10,000 of my disks all running together for 70 hours roughly half will fail during the test"
End of Design Life;
Your disks will probably die shortly after the end of the design life due to mechanical wearout. This will be a safe (for the manufacturer) margin beyond the end of the warranty.
I have no data on how this is affected by temperature but it is reasonable to make the following assumptions:
1) Percentage of time powered up will impact EOL
2) Temperature whilst running will impact EOL (see the discussion above about the various seals etc in the disk)
3) Extent of use will impact EOL, if it is seeking continuously then the head motors and bearings are going to fail sooner.
I hope this helps, if anyone want me to clarify anything or wishes to know more then please let me know.
Thx
Liam