Using Maresware to Validate Forensic File Hashes.

Using Maresware to Validate Forensic File Hashes

Before you get into this article, you might read these associated sequence of articles.

Start here:

Inventory/Catalog files  Creating an inventory of evidentiary files
Forensic file copying  Article tests over 40 "forensic" file copiers
Forensic Hashing  Article tests over 30 "forensic" hash programs.
ZIP-IT for forensic retention  Article test a few zipping programs and
ZIP_IT_TAKE2  More tests for your zipping capabilities.
ZIP FILE/container/container  Hashing your zip container reliably
the one you are on: MATCH FILE HASHES  Demonstrates hash matches using Maresware.
A HASH software buffet   How-to use Maresware hash software

ABSTRACT

I decided to put this document together because a few days ago, a very intelligent forensic investigator said his co-workers asked how to easily use hashing software to compare hash values from pointA to pointB. His exact quote is/was "A colleague did ask if we can get the tool just to hash source and hash destination, comparing differences without any copying".

So I got to thinking. I know, its bad for the health. But I was thinking about how many others might at some point wished they had a simple program or process to do just that. I realize that the large suites can compare hashes, but that involves creating a case, loading the data, etc. etc. etc.; and other hashing software can compare the source and destination, but usually that usually occurs during the actual copy process, and/or installation of the software (remember to read my hash test article). So what about a simple, process or batch file (thats a script for you millenials) that could match hashes from pointA to pointB any time you wish, without creating a case, and without going thru a copy process, because you already have the two directories populated. Lets design a simple process that could be performed routinely.

I discuss three processes on how to validate and/or compare hash values using a number of different Maresware programs. Don't be overwhelmed at the descriptions or processes. Because they are generic, they can be re-used and modified easily. Know full well that in my previous life, I actually taught internal auditors how to use the software efficiently. So you should have no trouble learning its process.

Also, I remind the reader that the operation of the software used in these descriptions: (hash, hashcmp, disksort, total, compare) is extremely generic and can (and has been) be used to process other types of data which an analyst, or investigator might generate in day to day operations. So when reading the capabilities of the software, don't restrict your thoughts to merely matching hash values.

One other thing discussed at the end of this article is an alternative to hash matching. It is a program that will take a hash data file and tell you which hash values have duplicates. hash_dup Help file, and hash_dup.exe download.

Before we start:
Review this hash_test_article on testing your hashing software in a forensic/evidentiary environment, and how the hashing software will (notice I didn't say may) fail strict cross examination.

Table of contents of the sections of this article for quick jump:

Article Section:                        Help file and executable download

HASHCMP program. simplest but efficient.         Manual  hashcmp.exe
Disksort & Total programs. next more versatile.  Manual  disksort.exe  total.exe
COMPARE extrememly configurable process.         Manual  compare.exe

HASH_DUP find duplicate hash values within a single hash run. Help file  hash_dup.exe

and a links to:
 DISKCAT manual and diskcat.exe  Program to provide complete file listing of the tree.

THE THREE PROCESSES

All these processes and programs described here are contained in a simple (basic) batch file located within this zip file here called: hash_test.zip.
In addition to containing test data, this zip file contains the following batch files:
1. HASH_TEST_DEMO.BAT - Contains all three versions (below) of the process.
2. HASH_TEST_HASHCMP.BAT - Contains only the hashcmp tests of hash_matching.
3. HASH_TEST_DISKSORT.BAT - Contains only the disksort tests of the hash_matching.
4. HASH_TEST_COMPARE.BAT - Contains the compare program tests of the hash_matching.

To make the demonstrations easy, when the zip file is exptracted (maintain folder structure) it will create a few sample directories, and a small number (about 10) files within those directories to demonstrate the operation of the batch files. So once the data is extracted, running of the batch files from the top level directory that was created, will produce sample results of how each works.

The processes in the above mentioned batch file are generic, so in the real world you may need minor modification depending on your own data layout and needs. All the programs mentioned here can be found in links mentioned above and on the main home page, along with appropriate help files. The zip file also contains the necessary versions of Maresware used for each step.

In addition, there is a folder with appropriate files with which to demonstrate the speed of the compare and bsearch program when comparing MD5 values against the large NSRL MD5 data set. To run the batch file within that diretory, you will need to download the NSRL MD5 data set (NSRL_MD5_271_RDS.zip link at the end of this article) and the bsearch program and make certain it is in the path. Then follow the instructions in the _README.1st file in the directory.

1. Unzip the zip file to an empty directory, while maintaining folder structure.
2. You will see a batch file (hash_test_demo.bat) in the top level folder once extracted.
3. Set the default command prompt to that top level folder, and run the hash_test_demo.bat

If not already done, the batch will
1. Extract two directories: SRCE and DEST.
2. Place some files in both which will demonstrate the operation of the below described processes.
3. Then display the results of each run.
Once satisfied, to work in the real world, (you know, thats where you get a minuscule salary) just change the top level folder names from SRCE and DEST to your own values and rerun the batch for live results.

A very important thing to consider when reading about these processes is that this "hash matching" is done with the understanding that the reason one needs to match hashes is that you are in a forensic evidentiary situation where you have two different trees/directories containing files you wish to match the hashes of. AND: and this is very important, it is expected or assumed the way you obtained these two tree setups was to do a "copy" of one entire tree to the destination tree. This copy process, (done outside of the forensic suite analysis) should have resulted in the same file structure in both the source and destination.

I will explain the errors that might be experienced if you do not have an expected identicle copy of source to destination. However, we all know that your process is reliable and your copy process produced a valid duplicate file structure. HA HA.

A basic underlying assumption is that these two directory structures have been "created" during your forensic process, and for whatever, however they were created, you now have an original source/SRCE and a destination/DEST tree with which you are concerned that the hashes and number of files in both locations are identicle. Its now up to you to develop a process to confirm that the hashes and file counts match.

HASHCMP

The hashcmp program is the simplest way to compare two hash data sets to see which files either match or do not match from one hash run to the other. Hashcmp is the most basic of the three processes to compare two hash outputs.

Although hashcmp was originally designed to operate on the output of the Maresware hash.exe program, with a little thought and understanding of its operation, the hashcmp program can be adapted to process/compare any two files of identical fixed length data that have a common sorted field such as the MD5 field. (for instance, compare two directory listings to see what might match or not).

Hashcmp takes two "fixed length record" files created with the hash.exe program, and compares them either on the entire record length, or just the hash value. In most cases you would want to ONLY compare on the hash value. The situation is that you have evidence files in a SOURCE1 directory, and "hopefully" accurate copies of ALL the files in a DESTINATION1 work location. What you want to do, is make sure all the hashes in SOURCE and DESTINATION match. Might be a good idea, yes/no?

NOTE: Again, a reminder, hashcmp is designed to compare two identical formatted files on a single field. This generic compare is regulated by the approproate -d and -l (ell) options. Not described here.

To accomplish this "hash" verification, you first create two files using the hash program. One file has hashes of files in SOURCE1 (which may be your original evidence store), and the second hash file is created from DESTINATION1 which may be (hopefully) an identical copy of the files from SOURCE1.

To do this first step, run the hash program on each of the top level directories of the data. The options used here are basic, and others can be included as needed. But depending on the options used, the output record length/size will be adjusted and may necessitate modifications to the options used throughout. Here are the basic hash commands to create two reference output files. Replace -p and -f options with appropriate items. (again, these batch files are contained in the link referenced at the top of this article.)

hash -p x:\source1_folder -f files_to_hash(usually *.*) -w 300 -d "|" -o SRCE.out -R -1 logfile1 hash -p x:\destination1_folder -f files_to_hash(usually *.*) -w 300 -d "|" -o DEST.out -R -1 logfile1

The -w 300 creates an output path record of 300 characters. since ALL maresware software works ONLY with fixed length records, this is a reasonable path length to accomodate any long filename files.
The -d "|" adds pipe delimeters to the output record. The hashcmp program requests pipe delimeters (unless certain specific hashcmp options are used, which we will not discuss here).
The -R says do NOT reset file last access date. This is a precaution in case the registry is set to ON.
The -1 logfile (thats a one, not an ell) is totally optional, it creates a log of the process.

These runs should produce an output record of roughly 388 characters, with the hash value a position 311.
Check out this sample record, with spaces truncated for legibility. Assume C:\TMP is the original data, and D:\TMP is the duplicated data set.
C:\TMP\ZIP_IT.htm | C772D55C42A41B4E6F261F28B8DAA7FF | 12072 | 06/14/2019 09:02:22w EST

Now the 2nd run of the destination output. See the difference hash value D7 instead of C7 in the first 2 characters.
D:\TMP\ZIP_IT.htm | D772D55C42A41B4E6F261F28B8DAA7FF | 12072| 06/14/2019 09:02:22w EST

Now that we have two hash runs available, we can run the hashcmp program.
The generic hashcmp program is:

hashcmp  SRCE.out  DEST.out  -o  mismatch.out  -h 
   or
hashcmp  SRCE.out  DEST.out  -1  -o  mismatch.out  -h
hashcmp  SRCE.out  DEST.out  -2  -o  mismatch.out  -h

The 2nd and 3rd versions, I'll leave to you to figure out the difference between option -1 and -2.

What you will get from an actual run is an output file containing references like: (I seeded/altered the actual value of the output record in the file from C7 to D7 so you can see a representation of the output file. (spaces truncated for legibility)
found in SRCE.out not in DEST.out | C:\TMP\ZIP_IT.htm | C772D55C42A41B4E6F261F28B8DAA7FF | 12072 .... found in DEST.out not in SRCE.out | D:\TMP\ZIP_IT.htm | D772D55C42A41B4E6F261F28B8DAA7FF | 12072 ....
Notice that becuase the same file has two different hash values, you actually get two records output. One referencing the hash value in file1 not in file2, and conversely, one value found in file2 not in file1. Figure for yourself, the appropriate hashcmp command option to only show those in file1 or file2 in the output mismatch.

The final three line batch file (contained in the reference file mentioned in the top of this article)

hash     -p  x:\source1_folder        -f  files_to_hash(usually *.*)  -w  300  -d "|"  -o  SRCE.out   -R  -1 logfile1
hash     -p  x:\destination1_folder   -f  files_to_hash(usually *.*)  -w  300  -d "|"  -o  DEST.out   -R  -1 logfile1  
hashcmp  source.out  destination.out -o  mismatch.out  -h

Errors that might be experienced with incomplete or duplicate (hash) directory "copies".

You may be wondering why I have delayed this error section until now, and didn't place it above the hashcmp section. Maybe because the hashcmp process generally does not experience any of the problems described here based on the file structures in the examples.

First, lets set out a sample output of the two hash runs of the SRCE and DEST tree structures. One is called SRCE (original file location), and the other DEST (the copied files location). The first thing you should take notice of, is that there are more files in one directory than in the other. This could be the result of a number of things. Not discussed here.

D:\DEST\D1\ZIP_IT2.htm | D872D55C42A41B4E6F261F28B8DAA7FF | 12072 ....
D:\DEST\D2\ZIP_IT3.htm | F772D55C42A41B4E6F261F28B8DAA7FF | 12072 ....

Situation shown above:
You have identical tree structure from source\tree to destination\tree.
SRCE\D1
SRCE\D2
DEST\D1
DEST\D2
But faulty file copies. (C7 not copied at all, and E7/F7 have different/faulty copy hash values for the ZIP_IT3.htm )

In the source D1 you have two files with different hashes, C7 and D8. Normal occurance. However, in source D2 you have a duplicate of the ZIP_IT1 file (C7) located in the D1 directory. Totally feasable. Take note of this duplicate for later use.

Now, lets look at what the copy process gave us in the DEST tree.
In the DEST\D1 we have a duplicate/copy of the ZIP_IT2 (D8) file. A good copy. But we seem to be missing the D1\ZIP_IT1 (C7) file. Perhaps a faulty or missed copy. Regardless, we are missing the ZIP_IT1 file from the source\D1 directory with the C7 hash.

Now we look at a copy error from source\D2 to destination\D2 of the file ZIP_IT3. In the source D2 it has an E7 hash, but the faulty copy produced the correct name, but a hash of F7 in the destination D2 directory. This faulty copy will be looked for later.

So we possibly now have one of more of the following situations; regardless of how we got here.
1. Duplicate hash C7 in the source tree. Maybe our forensic process merely found the same file in two locations.
2. Good copy hash D8 from source to destination, and
3. Failed/corrupted copy of ZIP_IT3, hash: E7 to F7 from source to destination.
4. Incorrect number (4) in the source tree, while only two (2) ended up in the destination.

These are four situations which could occur. In which case you have to design a process which will locate and point to the possible anomolies described above. Hopefully, when explaining the processes below, I will also explain how each process overcomes the above errors.

In the chance that any of these above anomolies occur, the first item might be to confirm that you have the correct number of files in both the source and the destination tree. This is a simple process of obtaining a listing or catalog of the total number of files. A simple process to validate your file count for each tree is identical. It is suggested you obtain a true list of all the files in both source and destination, and confirm the counts are the same. diskcat.exe will perform a valid file list and count. A novel idea!

USING DISKSORT and TOTAL

The next process uses the disksort and total programs. Disksort and total are generic programs to perform their respective tasks (sort records, and total items). When used to sort on MD5 and then total/count the number of occurances of each MD5 value the process shows which MD5 shows up either as a unique MD5, or a count of two (once in SRCE and once in DEST).

If the MD5 of the source and destination were the same, then the count for any MD5 would be 2. However, when the MD5 value indicates a single MD5 value/count, this indicates that the MD5 on source is not the same as that on the destination. Thus resulting in two unique output records, one for each different MD5.

This disksort/total process is a more generic way to find which (unique MD5) values might exist in FILEA and not in FILEB. The disksort/total process can be used on many different data files. (but they must be fixed length records; a holdover requirement from my big iron data processing days). Fear not, if you have variable length records, Maresware has software to reform them to fixed length records for processing.

First, as always create a hash of the two directories in question:

hash  -p  x:\source1_folder       -f files_to_hash(usually *.*) -w 300 -d "|" -o source.out      -v -R -1 logfile1
hash  -p  x:\destination1_folder  -f files_to_hash(usually *.*) -w 300 -d "|" -o destination.out -v -R -1 logfile1

Notice this time we added the -v option. The -v option says NO Verbose. Meaning don't put any headers in the output files. All you get is pure data.
C:\SRCE\setup.bat | 564ACB6605F971C51483C6FE8F70CBD9 | 174 | 06/14/2019 09:02:22w EST

Now we have two files from different locations with the same format. Next we combine them into a single data file using the Windows copy command.
c:copy /b SOURCE.OUT + DESTINATION.OUT COMBINED
This combined both files into a single "COMBINED" file. The /b is generally needed to make sure windows does what you asked, not what it thinks you need. (how often does that happen?)

Now that we have all the data in a single COMBINED file we need to sort on the MD5 field. As in the previous example, the MD5 field is located at displacement 311 and is 32 characters long. So the disksort command to sort the COMBINED data is

c:disksort COMBINED COMBINED.SRT -r 388 -p 311 -l 32

This creates a file (COMBIND.SRT) which is now sorted on the MD5 field.

We use total to "count" the number of occurances of each MD5 value. It is assumed that if the same MD5 shows up in both files, then the count for any specific MD5 will be at least 2. If the source and copy MD5's are different, then we will have a single instance of non-matchin MD5s.

Notice the hash count for the setup2.bat which changed its hash, now reflects a single instance on the C: and D: drive. This indicates that the count for that MD5 is only a single item for each file.

The command for the total program is:
C:total COMBINED.SRT COMBINED.CNT -r 388 -p 310 -l 32 -c

To wrap up this segment, the commands to perform this operation are:

hash -p x:\source1_folder -f files_to_hash(usually *.*) -w 300 -d "|" -o source.out -v -R -1 logfile1 hash -p x:\destination1_folder -f files_to_hash(usually *.*) -w 300 -d "|" -o destination.out -v -R -1 logfile1 copy /b source.out + destination.out COMBINED disksort COMBINED COMBINED.SRT -r 388 -p 311 -l 32 total COMBINED.SRT COMBINED.CNT -r 388 -p 310 -l 32 -c
For efficiency I usually add a grep option after total to show me ONLY the items that have a count of +1:
grep +1 combined.cnt > mismatched_items

Since this process is super generic, you can replace the source folders of the hash routinely and consistantly find items where hashes do not match.
The combination of disksort and total provids a lot of possibilites when dealing with the prospect of getting counts in other data sets.

Errors that might be experienced with incomplete directory "copies".

A situation, similar to one of the 4 errors previously identified, which may be found to produce an error in this process is the following:
You have two copies of an identical file in source and none of that file in destination
sourceD1\ZIP_IT1.htm ABCD
sourceD2\ZIP_IT1.htm ABCD

destinD1\ZIP_IT3.htm 1234

Notice you have two files in the source with identical hash, and a seperate different file in the destination. These two identical files can reside anywhere within the source tree, have either the same name (different directory) or different name, same directory as shown in the above example. This duplicate hash in the source will cause total to produce a report that is not totally correct. It will have to be remedied.

When this process is run, the total.exe counter will show two values of ABCD which is correct, BUT no copy of that file exists in the destination. And, it will properly display fileX as a unique hash in the destination. This error in missing the fact that the FILEA and FILEB does not show up in the destination needs to be addressed.

sample output represenation after the total run
sourceD1\ZIP_IT1.htm ABCD +2
destinD1\ZIP_IT3.htm 1234 +1

However, fileA doesn't even exist in the destination location, but shows a value of two instances found in the total run, which is BAD. So how do we overcome this inconsistancy?

The simplest way to overcome this inconsistancy with the output is to reduce both the source and destination hash values to unique items. This way, when the total is performed, you get the correct counts.
To modify the above batch file to correct for this inconsistancy we sort BOTH the original source and original destination hash outputs, and ask the disksort program to -u (unique) all the values. This unique option will provide only a single instance of each hash value that is found in the respective trees. Then, when two sorted files are merged, and again sorted, you end up with the correct number of hash values per tree. Here is the correct batch:

hash -p x:\source1_folder      -f files_to_hash(usually *.*) -w 300 -d "|" -o source.out      -v -R -1 logfile1
hash -p x:\destination1_folder -f files_to_hash(usually *.*) -w 300 -d "|" -o destination.out -v -R -1 logfile1
disksort   source.out      source.srt      -r 388  -p 311 -l 32  -u
disksort   destination.out destination.srt -r 388  -p 311 -l 32  -u 
copy /b    source.srt  + destination.srt   COMBINED    
disksort   COMBINED    COMBINED.SRT    -r 388  -p 311 -l 32 
total                  COMBINED.SRT   COMBINED.CNT  -r 388  -p 310  -l 32 -c

Notice, we decided to sort (and unique) both hash outputs before we merge/copy them to a single file. This sorting and uniquing of the hash files results in ONLY a single hash value being found in each of the hash files. Then when they combine and are again re-sorted on the MD5, you end up with the correct counts from the total.exe program. SO SO SO. This refined process is the preferred process. Even though it requires two additional sort runs, it reduces any duplicates in the hash files to single instances.

USING COMPARE

Using the compare program is the most generic and can be used to compare any two files which are sorted on the common field.

The data files which you are comparing DO NOT need to have the same record format. BUT they do need to be fixed length records, which is easy for the maresware software to create. For this reason, the compare program is ideal for comparing the NSRL data sets (below) with general outputs of the hash program. However, that process I will leave for you to formulate the correct command line options. For this segment, we will continue to use the two basic outputs of the hash command, and provide them to the compare program to find any "MIS"matches between source and destination.

This next program compare, because it can operate on two totally different formatted files needs you, the user, to provide it with some additional information about each of the input files. Are you ready to take on this responsibility?

The item which provides compare.exe its information on how to compare the two files, and what field(s) to compare on is called a parameter file. This parameter file tells compare the (parameters, formats of the files). It provides:

record lengths of each of the two input files, (ie: 388, and 388, could be different)
the location of the sorted field from fileA (ie: 310)
the location of the sorted field from fileB (ie: 310)
the length of the sorted field. (in this case 32)
additional line(s) to tell the program when it finds a matching record from the two files, how to build an new output record which can contain data (fields) from each of the "matching" source files. Because in most instances we will have FILE1 and FILE2 with a matching MD5, there is no reason why we can't take fields from each of these "matching" records and rebuild a new output record. Thus building a new output data record with data from from both inputs. BUT BUT BUT: When you have no matching record in FILE1, the only place to get output data information is from the FILE2 which has the unmatched MD5.

A real life parameter file provided to compare for this process of comparing data from two hash outputs would look like

5           a dummy place holder, held over from tape processing in my prior life 
388         (fixed) record length of the A file 
38800       block size of the A file (again a tape holdover). multiple of the records length, max of 64k
388         (fixed) record length of the B file. in this case identical, but can be totally different format.
3880        block size of the B file (again a tape holdover). multiple of the records length, max of 64k. 
310         displacement (from 0) of the location of the sorted field for FILE1. Both files sorted on their respective key field.
310         displacement (from 0) of the location of the sorted field for FILE2. Can be different in real life.
32          length of the field to compare. this is the sorted field length. obviously same for both files.
B000=388    From here on, multiple lines to build the new output record when a match, or in our case a mismatch is found.

The B000=388 tells the program, when a match (or mismatch) is found take from the B file from displacement 000, take 388 characters to build the new output records. Since we have designed the program to look for mismatches in this case, only the mismatched data file will provide outputs. However, in real life, when a match is found on the key field, since you have both an A and a B file you can build the output record with data from both files. To learn how to do this, RTFM.

So now the simple batch file to run this more generic compare process is:

Because we may end up with the same problem of having multiple occurances of a file in the SRCE and none in the destination, we must perform a test to not only see which files/hashes show up in the destination and not in the source (meaning faulty hash), but also, which files show up in the source and not in the destination, (two files with same hash in source, but didn't even copy to the destination). To do this, we run the "unequal" compare in two directions. Just follow the batch below, and see if you can figure the logic. It works.

hash  -p  x:\source1_folder        -f  files_to_hash(usually *.*)  -w  300  -d "|"  -o  source.out      -v  -R  -1 logfile1
hash  -p  x:\destination1_folder   -f  files_to_hash(usually *.*)  -w  300  -d "|"  -o  destination.out -v  -R  -1 logfile1
disksort  source.out        source.srt       -r 388 -p 311 -l 32  -u
disksort  destination.out   destination.srt  -r 388 -p 311 -l 32  -u
compare   source.srt        destination.srt  mismatched.out  compare.par      -u
compare   destination.srt   source.srt       mismatches_src.out  compare.par  -u

the compare.par file reference above the parameter file we built.
the -u option tells the program we only want -unequal values to the output. The two compare runs:
1. First one shows hash values in destintation not in source. (assume a bad copy. this is our ultimate goal)
2. Second run shows hash values in source not in destination. (assume no copy at all).

Again, super generic file comparing on a sorted key field. This process can easily be duplicated for routine processes by just altering the content of the compare.par (parameter file) to tell the compare program the makeup of the input data records.

Notice that we sort and unique the items before running the compare. This sort and compare accomplishes the same uniqueing of the hash values in the source and destination files as was done in the revised batch in the above disksort/total run.

It is always suggested, when performing these batches, that the source and destination hash files be uniqued on hash value before the runs complete. This will ensure a valid response to the process.

Using compare on NSRL data

Also, how many of you have often thought of trying to compare your hash values with the known NSRL data sets? I have re-processed the 2024_03 NSRL version data sets to a single "uniqued" file containing just over 180 million MD5 hashes, which may be of assistance.

If you wish a zip file of the complete current 2024_03 NSRL data, send me an email and I will provide the link. NSRL_HASH article.

Using Maresware compare.exe to compare your hash values with the NSRL items above requires a slight modification of the compare.exe process explained above. I will explain below. But with a little practice and knowledge of what/how the software works, it will be easy to modify the "compare" batch files to accomodate any of the above data files.

In simple terms, here is the process.
hash the suspect folder files.
sort the data file of the hashed values on the MD5 field.
compare the NSRL data with the hashes of your suspect files.
The unequal compare (-u) will show the files in your list that are NOT in the known NSRL data set.

c:hash  -p s:\suspect_folder  -o suspect_hash.out -d "|" -w 300 -v -R         (hash the suspect data directory)
c:disksort   suspect_hash.out suspect_hash.srt  -r 388 -p 311 -l 32           (sort on the MD5 value)
c:compare    MD5_271_RDS      suspect_hash.srt  not_on_nsrl_md5s   compare.par -u  (perform unequal compare)

The output will contain records of the files in your suspect folder that are NOT part of the known NSRL set.
The only thing that requires some design is the content of the compare.par file. If you need help to design that file let me know.

HASH_DUP

Short discussion here, because this is not matching hash data files, it deals with telling you which records in a single data file contain duplicate hash values.

The situation is that you conduct a hash run (could be on a single tree c:\tmp\..., or different trees c:\tmp and d:\tmp) and place the output into a single hash data file. Then you wish to see which files in the data file contain duplicate hashes. These duplicates could be because you have the same file with two different names in the same directory, or the same file located in two different directories. (I can't tell you how often that has occured on my file system).

Anyway, to make this discussion short and sweet. The hash_dup program takes an input file containing the filenames/paths of the files with duplicate hashes. This output file contains ONLY indications/records of the duplicate hash values. Then you can do what you wish with the results. I usually edit the duplicate file to remove one of each duplicate line (to save a single instance of a hashed file), then use the rm command with the -S option to remove the duplicates. If you wish to get more explanation of this removing duplicate files, let me know.

The hash_dup Help file, and hash_dup.exe download.

Take a look at these related articles.

Inventory/Catalog files  Creating an inventory of evidentiary files
Forensic file copying  Article tests over 40 "forensic" file copiers
Forensic Hashing  Article tests over 30 "forensic" hash programs.
ZIP-IT for forensic retention  Article test a few zipping programs and
ZIP_IT_TAKE2  More tests for your zipping capabilities.
A HASH software buffet   How-to use Maresware hash software