Le_Gritche wrote:
dhanson865 wrote:
As far as I'm concerned the failures below 30c were just noise in the data or in other words bad data to begin with.
Care to explain a bit why failures below 30°C would be noise
Did you read ANY of the threads I linked to?
whiic wrote:
Some manufacturers have accurate temperature sensors, others have not. Some Samsungs run at temperatures between "10" and "15" in a room temperature of 25.
Apparently they considered temperature values from Hitachi drives spurious and thus disregarded them: "For example,
some drives have reported temperatures that were
hotter than the surface of the sun."
I have commented several times on the unique way Hitachi drives report temperature. All others report temperature in the last byte of raw data field and rest are just zeroes. Because of this, it's easy to convert the whole raw data from binary to decimal and assume to be current temperature.
Hitachi drives report temperature as [zero byte][max temp byte][zero byte][min temp byte][zero byte][current temp byte]. For example: 0x003200120025 is the raw data of Hitachi Travelstar in this laptop I'm using. 0x32 is 50 deg C, 0x12 is 18 deg C and 0x25 is 37 deg C.
SpeedFan knows how to convert Hitachi temperature raw data to real temperature. Software like HDD Health does not. And whatever software Google uses also fits into the latter category.
Rusty075 wrote:
I think some of you are overlooking what is the most likely potential explanation for the "cool drives fail faster" phenomenon in the Google report:
The temperature results list average temperature readings for the drives.
Take two identical drives. Place one under continuous Medium utilization where its drive temperature stays at a near constant 45°. Place the other drive under Low utilization, where it spends say 2/3rds of its time idling at 25°, and 1/3rd of its time at 100% use where its temp peaks at over 50°. The Low drive will have an average temp that is much less than the Medium usage drive (33° in my hypothetical). But I could almost guarantee you that the repetitive thermal cycling that comes from alternating periods of high usage and low usage will be harder on that drive than the 12° hotter temp is on the Medium, thus making the Low drive statistically more likely to fail. For many high-precision mechanical parts thermal cycling is more damaging than conventional wear....seems reasonable the HDD's would have a similar reaction.
The "low temperature" drives were more likely drives that were deployed in low demand servers, where they spent large portions of their time at idle, rather than just happening to be in extra-cold rooms or in racks that inexplicably had a bunch of extra fans in them.
A correlating bit of data show up in the Utilization failure chart where in more than half the time periods the drives with Low utilization are more likely to fail than the drives with Medium usage. While their "Survival of the Fittest" is one theory, the effects of thermal cycling could also be playing a part.
whiic wrote:
Still, do you disagree with some of the following:
- 5400rpm drives usually run cooler than 7200rpm
- 5400rpm drive are a dying breed and usually use older technology and ball-bearings
- ball-bearings compromise HDD reliability over a longer period of use as ball-bearings have a tendency to wear out (thus increase non-repeatable run out (NRRO) and cause errors during I/O).
If you agree with those three propositions, and remember Google's study did have both 5400rpm and 7200rpm drives, don't you see the possibility of this affecting the statistics?
Some ... claim things like cooling a HDD "too much" will cause reliability to drop. I find it more likely that bearing type is more likely cause. Sure, it wouldn't be a statistical problem to have BB drives among other drives IF there was not correlation between BBs and lower rpms. But there obviously is a correlation, thus it affects the outcome and causes extra failures at low temperatures. Reduce cooling on cool-running BB drives and it certainly won't do any good... except reduce noise produced by fans, but since it's a server that doesn't matter.
Some drives report temps lower than ambient. This is not logical so lets call it a reporting error for those drives. Some drives erroneously report high temps when they are at low temps. Why is it this way? I don't know but I see it all the time. Motherboard sensors that report 70+C case temps at startup when the temp is really in the teens but roll over to 20C when the case starts to warm up. Why design a thermal sensor that will report wildly inaccurate readings? All I know is it happens in this industry and it happens often. Are we saying that the Google data centers are kept below 60F/15C? Do you trust all the temp data your hardware gives you without question?
The drives in the study varied widely. If you wanted to test specifically for the relationship between drive temp and failure rate you should have every variable the same other than drive temp.
I'm not saying the Google study is worthless. I'm just saying I'm taking it with a grain of salt.