The DAM Forum
Welcome, Guest. Please login or register.
May 18, 2013, 08:51:53 PM

Login with username, password and session length
Search:     Advanced search
Jan 9, 2012
John Beardsworth's new Lightroom site
Lightroom Solutions
27960 Posts in 5113 Topics by 2914 Members
Latest Member: imthedamstar
* Home Help Search Login Register
+  The DAM Forum
|-+  Software Discussions
| |-+  ImageIngester and ImageVerifier
| | |-+  New ImageVerifier beta with hash checking (long post)
« previous next »
Pages: [1] Print
Author Topic: New ImageVerifier beta with hash checking (long post)  (Read 857 times)
Marc Rochkind
Hero Member
*****
Posts: 1136


View Profile WWW
« on: March 16, 2007, 05:40:05 PM »

Previous versions of ImageVerifier did what I now call structure checking, which is verifying the image file by reading through its various structures and decompressing any compressed image data, looking for errors. As I indicated in earlier posts, this can be effective in finding damage if the damage is large and/or the image is compressed. For highly compressed images like JPEGs, damage detection is very good. It's not so good for uncompressed raws, such as the DNGs that come straight from a Leica M8. It's better for compressed DNGs, but not as good as it is for JPEGs.

Another approach entirely is what I call hash checking, which is maintaining for each image known to be good a fixed-length hash computed from all the bytes in the file so that it's unlikely that two different files will produce the same hash. (Not impossible, since the hash is of fixed length and the number of possible image files is infinite.) If the two files are the good one and a copy (or even the original) that's been damaged, then comparing hashes of the two files will show that the files are not the same.

Comparing the actual files is even better, but in the case of a single file that's been damaged you don't have two files. All you have is the damaged file and the hash from when it used to be good. Also, reading one file to compute its hash takes half as long as reading two files.

The nice thing about structure checking is that no bookkeeping is involved--each file stands on its own. Hash checking, however, does create complications because you need to put the hash somewhere, and you need a way of associating the image with its hash. This is easy for a DAM system that controls all the assets, but much harder with a little passive utility like ImageVerifier. Putting the hash inside the file is one approach, but this has two problems: It's safe only for certain formats for which it's allowed, such as DNG, and it requires IV to write into the file, which I don't want to do because it raises the possibility of damage to the file during verification and because many photographers don't want to use any utilities that write into their files.

So, here's the scheme that IV uses: For each file, a key is generated that's rich enough so that two different images won't have the same key. The key is the concatenation of the filename (not the path, just the last component), the size, the modification date/time of the file, the EXIF DateTimeDigitized, the EXIF SubSecTimeDigitized, and the DateTime from the image header. It's still possible for two different images to have identical keys, but the worse that will happen in that case is that IV will erroneously say that they are different, and then later you can determine that they are not.

After the key is generated, a 512-bit hash is computed from the entire file, and both are stored in the IV database. Note that the location of the file--what folder it's in--is stored for reference purposes, but plays no role at all in associating keys (and therefore files) and their hashes.

Here's an example key:

DSC_0003.JPG  -  2591469  -  2007-02-05 22:56:43  -  2004:06:23 18:59:27.70  -  2007:02:03 14:00:20

The corresponding example hash is a 128-character string of hex digits.

If the file is copied to a backup, or moved to another folder, IV will still find its hash, as long as none of the components of the key have changed. (Copying normally doesn't change the modification date, and you need to be careful not to rename your files when you back them up. Once a file is named, ideally during ingestion, it ought to keep that name forever.)

A key and hash take up around 150 bytes in the database, depending on the length of the key. (Recall that the hash represents all the bytes in the file, which for most of us is around 10MB or more per image.) In addition, there's space needed for the parent folder of each image, but these are shared by all the images in that folder, so if you have, say, 50 images in a folder, storing the parent path adds only around 1 or 2 bytes per image. Doing the math, the space to store the keys and hashes for 200,000 images is only about 30MB, which is about the space of 3 images. A lot of overhead space is also need for indexing, so the 30MB maybe should be tripled. Still, the space cost is reasonable.

When you run this new version of IV, you have the option of doing a hash check, a structure check, or both. To run a hash check, you must have previously generated the hashes for the files you want to check. If during a hash check no stored hash can be found for an image (that is, no match for the key), you get a failure message that's different from the message you get when the hashes fail to match.

The hash check is pretty fast: Less than a half-second per image, compared to 2 - 4 seconds for a structure check.

In practice, the workflow is something like this: You take a folder of images known to be good and run IV on them with the Store Hashes checkbox checked. This automatically checks the Check Structure checkbox, because a hash is stored in the database only if the file passed the structure check. Then, after the hashes are stored, you can run a hash check on the same files now, or a year from now, or 5 years from now.

Here's a simple test I ran: I stored hashes from a folder of JPEGs, and then took one of them, copied it to another folder, and wrote a single "x" byte into it, 123 bytes from the end, being careful to reset the modification time back to its original value. The resulting damaged JPEG passed the structure check, but it failed the hash check.

I think you'll find IV very satisfying to use. It seems to magically just remember all your images and what their bytes are supposed to be, and then dips into its memory whenever it sees a repeat image. All this works no matter what folder structure you use, how many backups you have, or how often you rearrange things. (Just don't change the file names.)

Administration of the hashes is limited in this version, but they're not supposed to be anything you need to tend to. There's a panel you can open up to see the folders whose images were hashed, along with a list of the keys for those images. You can purge a single folder, although you don't have to, since  storing new hashes with the same keys will overwrite the old hashes.

You can try this new version of IV out for yourself. As usual, let me know how it works for you.

--Marc
Logged

peterkrogh
Administrator
Hero Member
*****
Posts: 5682


View Profile
« Reply #1 on: March 16, 2007, 09:59:48 PM »

Marc,
This is very, very cool.
Is the database of hashes fully exposed for people like John to export to DAM software, once it's ready to make use of the information?  Would it need to be invoked by IV, or is hash checking a standard operation?
This is one of the most important steps in the preservation of digital photography to come along in a while, IMO.
Peter
Logged
Marc Rochkind
Hero Member
*****
Posts: 1136


View Profile WWW
« Reply #2 on: March 16, 2007, 10:30:37 PM »

Peter--

Thanks for the encouraging words!

There's nothing another program could do with the hashes, as they're not standardized, and neither are my keys. What I have in the database is entirely open (SQLite3, used by IIP, LR, Aperture, and lots of other programs). If at some point any vendor wanted to talk about some standardized, or at least common, key and hash algorithms, I would welcome that. The hash uses the standardized SHA-512 algorithm, and the actual implementation is in some open source code that's available to anyone.

Let me know if you get a chance to try it out.

--Marc
Logged

Pages: [1] Print 
« previous next »
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.11 | SMF © 2006-2009, Simple Machines LLC Valid XHTML 1.0! Valid CSS!