The DAM Forum
Welcome, Guest. Please login or register.
November 26, 2014, 05:10:00 PM

Login with username, password and session length
Search:     Advanced search
28021 Posts in 5141 Topics by 2910 Members
Latest Member: kbroch
* Home Help Search Login Register
+  The DAM Forum
|-+  DAM Stuff
| |-+  Loss and Recovery
| | |-+  Migration from failing drive
« previous next »
Pages: [1] Print
Author Topic: Migration from failing drive  (Read 5130 times)
peterkrogh
Administrator
Hero Member
*****
Posts: 5682


View Profile Email
« on: November 21, 2009, 11:38:35 AM »

I moved this post out from a different thread.  It's a description of the process I'm using to move data from a failing drive and then validate.

Background:
I did a spot check of one of my archive drives, and it showed a growing Reallocated Bad Sectors count in the SMART data. This is almost certainly an indication that the drive is on its way out, even though it's a 6 month old drive (2TB WD Green).  I'm moving the data onto a new drive (2TB Hitachi).


Joe,
Sure.  I have my system set up with maximum protection for disaster recovery, rather than preserving the current work (at least in the files themselves).  I could have done it in either way, but decided to restore from the existing primary copy rather than the backup copy - the fact that these were DNG files is what tipped the balance.  (You'll see where this comes in below).

Let's review the setup:

The primary copy is my hard drive array in a JBOD device.  This gets updated from time to time with new PIE settings and other metadata. It's the current copy.  The drive we're replacing is the Original_02 drive that started showing a groing count of reallocated bad sectors. All files on this drive are DNG files with the validation hash, JPEG original files, and a few TIFFs that were put there in my pre-DAM days in 2002 and 2003.

I have a DVD/Blu-ray backup as well - these were made when the files were first sent to archive  This is the last-resort disaster recovery copy. (loading all the files from Blu-ray would take quite a while)

I also have a hard drive backup that lives offsite.  I have this configured as an additive backup - the files are generally written and then not updated. (They have been updated a few times, depending on how old they are. In each case, an update is preceded by a data validation of the primary files). All the work that has been done to these files should either live in the Expression Media catalog, or my "reworked images" Lightroom catalog for images that have been readjusted. (These files have *not* been updated in anticipation of just this scenario, where some portion of the primary files have been silently corrupted. In this case, it looks like I caught it before lots of damage was done.)

Possible course of action:
Load from DVD or Blu-ray. Takes too much time, files are the oldest version.  Not desirable
Load from backup disk. Slightly longer than reloading the primary, since there are multiple disks, and several long processes as part of the restoration. Also these are older versions of the files, and I would need to update the embedded metadata, which makes it an even longer process.
Load from primary disk that is having the problem. This is what I have chosen to do. My expectation is that there are relatively few problem files, so most of what's on this disk will be just fine.   I can also use this as a teachable moment.

Process:
1. Once I saw that there was a growing number of bad sectors on the primary drive (but still under 100, so it's not dying *immediately*), I pulled the drive from service and went out and bought a new drive.

2. Format the new drive and write zeros to it. Check SMART data after writing zeros to confirm that it's a good drive.  (The first one I bought was bad, so I actually had to do this twice). This took 6 hours per drive.

3. Validated transfer of all files to archive. Only two files could not be read and transferred. Chronosync logs these filenames - I'll restore them from backup drive or DVD later. This took 3 days to run.

4. Run the DNG converter on the whole drive (215,271 image files).  This is really the key for me to restore from the suspect primary drive.  I can know with nearly absolute certainty that the restored files are intact, because the DNG converter will throw an error if there is a hash mismatch. (I'm at this point in the process now - 120,000 left to go.)  This will take 2 days total.

5 Once the DNG converter runs, replace any additional corrupted files from backup.

6. Check with Expression Media catalog to see that all files that should be there are present.

(At this point, I'm basically done, but this is also a good trigger to take a look at the backups for validation.  I have not done that in over 6 months, so I'll do that now.

7. Run disk warrior on backup drives.

8. Check SMART data on backup drive.

9. If I suspect anything is amiss, I'll run further validation - which would typically be by using the DNG converter.  For Derivative backups, spot check a few buckets with ImageVerifier.

I'll put all this into a movie on the dpBestflow site once it's finished, so you can see the process.

Peter
« Last Edit: November 21, 2009, 11:49:05 AM by peterkrogh » Logged
R. Neil Haugen
Full Member
***
Posts: 121


View Profile Email
« Reply #1 on: November 21, 2009, 01:07:14 PM »

Peter,

I read an analysis of drive failures by a large net organization (was it Google? I can't recall) which had literally tons of drives in enclosures with tracking software recording the heat and usage of each drive over it's life. They ... like you ... could find NO commonality whatever between ANY "condition" and higher failure rates. Heat was one of the first shockers ... some of the drives with the highest average heat life and highest peak temps lasted longer than an "average" drive, and there was no shred of correlation between the heat "life" of a drive and it's lifespan before catastrophic failure. Drives that were "cooler" than average had no higher life spans.

The ONLY items that were correlative were 1) within the first six months and 2) after three years. Even those were only slightly higher than "average" time to failure. Some drives lasted three years, others were cycled-out at five with no sign of impending failure. The report (which ran to pages of graphs and charts) did mention growing "bad" sectors as indicative of impending failure but even that was not absolutely trackable, as some drives had a period where they reported noticeable failed sectors, but then stabilized and performed fine for quite some time. Some, of course, reported increased sector failures and fairly quickly made their final screech ...

Brand didn't correlate, nor did size, format, temp, number of accesses/"life" of the drive, spin-up time, spin-down time ... nothing that one would think would matter. So ... we apparently need to just be careful with good backup/restore plans carried out ... and when the "#hit" hit the fan, restore.

Like the age-old computer joke about God, Jesus, and Satan ... with Jesus and Satan competing in writing multi-media presentations from scratch code. A few minutes before the end of the allotted time, God points his finger and lightning bolts hit both system's power supplies. Satan is screaming mad as he furiously hits the keyboard anew ... and Jesus just swaps drives out and awaits the bell. At the closing bell, Jesus announces he's already to run his, and Satan is still screaming mad about the unfairness of it all. God merely rolls his eyes at Satan's prattle and replies: " Jesus saves."

I've had a couple drives go bad over the last 15 years ... and several that were clearly in that sudden-decline state, that we got everything off that wasn't already backed-up, so I've suffered no catastrophic situation. Yet. Still, thanks to your work and others, we are constantly improving our file handling. Keep up the good work!

Neil
Logged

R Neil Haugen
MyPhotoMentor.com
rNeilPhotog.com
Haugensgalleri.com
peterkrogh
Administrator
Hero Member
*****
Posts: 5682


View Profile Email
« Reply #2 on: November 21, 2009, 03:33:05 PM »

Neil,
I like the joke.
And, yes, one of the most reliable indicators of drive failure - with a higher correlation than almost anything else, as I understand it - is a growing bad sector count. I consider a single bad sector to be an anomaly, but anything above that to cause real worry and frequent monitoring.
Peter
Logged
JoeThePhotographer
Full Member
***
Posts: 208


View Profile
« Reply #3 on: November 23, 2009, 08:39:44 AM »

Peter,

Would having a RAID 1 or RAID set-up have prevented this failure?

I realize you only have a small sample of drives, but did you notice any higher or lower failure rates from bare drives v. retail packaged?

Joe
Logged
peterkrogh
Administrator
Hero Member
*****
Posts: 5682


View Profile Email
« Reply #4 on: November 23, 2009, 03:21:34 PM »

Joe,
In many ways, RAID can make this more complicated.  Most desktop RAID does not pass SMART data through, so you don't know that you might have failing drives until  they really start to fail. If I was having this problem on a RAID device that contained the entire archive, then I might feel obligated to validate the entire archive, instead of the 215,000 DNG files on this one drive.

My sample is not large enough to be relevant, as you point out. (And I don't buy OEM drives).
Peter
Logged
JoeThePhotographer
Full Member
***
Posts: 208


View Profile
« Reply #5 on: November 29, 2009, 07:04:32 AM »

Western Digital sell these external enclosures that come with two drives inside. It's my understanding that they can be sent up as a RAID 1 configuration.  If one fails, the other is still good to go.  Isn't that the idea?

Joe
Logged
peterkrogh
Administrator
Hero Member
*****
Posts: 5682


View Profile Email
« Reply #6 on: November 30, 2009, 10:42:38 AM »

Joe,
I'm not sure that would be handled correctly. It's my understanding that most desktop RAID is designed primarily to deal with a failing drive that must be reconstructed, rather than errors that show up in individual files. So the question is, what does it do when both drives read a file differently?  Does it notice the difference, and ask the user to determine which one is correct?

Enterprise systems should be designed to do some data scrubbing - basically working in the background to make sure that the primary copy of the data and the parity file match.

Cheap RAID may not help in these situations.

Peter
Logged
Pages: [1] Print 
« previous next »
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.11 | SMF © 2006-2009, Simple Machines LLC Valid XHTML 1.0! Valid CSS!