MoJo, the bit of analysis Pierre fetched very much contradicts a naive reading what you said! Exponential usually refers to something like failures rates of 2% the first year, 4% the second, 8% the thrid, 16% the fourth and so on.
That's exactly what I was commenting on...thank you
I don't know what you call relevant but I went back to the paper and I see that remembered right: they've found strong correlations between SMART data and failures. The only SMART data which is unambiguously about the surface as far as I know is the reallocation count which had a fairly strong correlation with failures. But others parameters had as good (if not better) correlations.
The variables that the report mostly expands on -scan errors*, various reallocation types - seem to be related more to the drive's disk surface than the reading/writing, rotating mechanism...
Unless I read the report hastily, it seemed to me these were more clearly related to disk failure probability than other variables (seek errors, CRC erros, spin retries, power cycles)
* I am including scan errors in this category (of variables more related to disk surface issues) based on what the report says:
"Drives typically scan the disk surface in the background
and report errors as they discover them. Large scan error
counts can be indicative of surface defects
On the basis of these findings, I said that surface defects - not having to do with the moving mechanism - could be more related to failures than the wearing of the moving parts. Unless I am misunderstanding something, this should also reply in part to the following:
When the disk is on it is rotating all the time. The motor is running constantly. It doesn't matter if the head is moving or reading/writing data, the drive still spins. In Google's case the drives never stop so start/stop cycles were not included in the data they used. Equally surface defects are not affected by activity, beyond being accessed from time to time which for a drive that is nearly 100% full of randomly accessed data (like Google's) definitely fits that category.
I didn't say surface defects are affected by activity but that it might have more to do with disk failures.
It is true that Google's study has its limitations...I don't argue with that...but you brought up their report, and I pointed to some differences between what you said and what it said.
It is also quite weird that you referred me to the Google study to base your argument on hdd servos/motors, when this constant motor movement is not displayed as a prominent failure variable by that report. And then you criticise it for not taking into account the variable of start-stop counts.
(yet because the drives are in server environment it does not mean starts/stops do not occur, but that power cycles do not occur)
It does address the issue of power cycles - that MikeC touched upon - somewhat by saying:
"Power Cycles. The power cycles indicator counts the
number of times a drive is powered up and down. In
a server-class deployment, in which drives are powered
continuously, we do not expect to reach high enough
power cycle counts to see any effects on failure rates.
Our results find that for drives aged up to two years, this
is true, there is no significant correlation between failures
and high power cycles count. But for drives 3 years
and older, higher power cycle counts can increase the
absolute failure rate by over 2%. We believe this is due
more to our population mix than to aging effects. Moreover,
this correlation could be the effect (not the cause)
of troubled machines that require many repair iterations
and thus many power cycles to be fixed."
SMART keeps track of how many surface defects it sees. What constitutes a SMART failure is defined by the vendor and tends to be somewhat conservative, but if you look at the raw numbers for things like remapped blocks the pattern is fairly clear. The reason SMART typically fails to warn you before a drive fails is simply because its counters do not reach the limits set by the manufacturer. At work we use PC Check which is a bit more realistic with the numbers and which also runs the SMART long test. Most manufacturer's tools can run the long test and it often fails due to scanning the entire surface of the drive and picking up new defects which then push the remapped block count over the limit.
I have HDD Sentinel running all the time, and I always check my hdds for SMART events not portrayed by the manufacturers' tools and test them arduously from time to time...
Moreover, when I first connect a new drive, I always use Sentinel's tools for testing for hardware issues and within the first two weeks I run a full "hdd regeneration" test (reading, writing, re-reading, rewriting) to make sure there are no weak sectors...
...so I'm not arguing -nor did I argue- with you there...