SAMPLE DATA

Jump to this section:
LFNs Long filenames:    NSRL INFO:     DISKCAT:Demo batch file    HASH:and SEARCH Demo batch    MD5: Demo batch   

Purpose

The purpose of this page is to provide one stop shopping for Maresware data files which contain data that might be useful in your operations. Because of the size and/or content, some of the files are 7-zip files, which have been named with a .zip extension in order for the browser to download the file.

Once downloaded, unzip using the appropriate software. Some 7-zip files (created with 7-zip because other zipping software failed to properly perform) were renamed with a .zip extension to allow browsers to download the file as a zip file. In these cases, just rename to .7z and use 7zip to unzip the file. If a file is a 7-zip and you use other unzipping softare you may loose some of the data. (forensically speaking)

In some of the sections below, you will see links to TEST_SUITE which is a self extracting, winrar executable that contains a number of directories holding data and batch files to run that will demonstrate the operation of the program being mentioned. To extract the files, set up an emtpy diretory, (ie: X: \TEST_STUFF), then place the executable in that directory, and then from the command line, run the program TEST_SUITE.exe. It will extract a number of folders of which, most of the names will be self explanatory. The folders D1 and D2 contain the test data files, while the other folders contain the executables, batch files to run the executables, output file directory, and a directory containing supporting files for the executables. Once extracted, the _READ_ME.TXT file will explain what is contained in the extracted folders, and how to run the batch files.


Long Filename Files

The file TEST_SUITE   contains sample files, and a batch file (script), which will allow you to create folders with files and directories that contain filenames which are longer than 255 characters. The D1 folder once extracted holds the long filename files. See if your software can find ALL the files (including alternate data streams) located in this tree.

Many stand alone programs which perform file system recursion have a very basic flaw. They cannot process files which contain paths/filenames that are longer than 255 characters. With current operating systems, and persons creating filenames that read like "War and Peace" it is not uncommon for files to be located in paths which are longer than 255 characters.

I have tested a number of these "forensic" programs to see if they can find and process files which fit this description. Many of the stand alone programs which are supposed to recurse/traverse a directory fail at the 255 character limit. "Maresware have been specifically coded to find these files."

Download this executable and run it from an emtpy directory. The D1 folder is the one that contains paths greater than 255 characters. Then see if your recursion programs can find and process these files.

top

NSRL GENERAL

The NIST main NSRL page can be found here. NSRL-NIST overview This page is a very important read: and is where you can download the RDS hash sets. Since March 2022, they are large zip files (The single "MODERN" set is 67G+) which need to be extracted.

You can also check this page on my website for another overview of the NSRL data file processing. On this page you will also find download links to the MD5|SHA records which I have processed and made available for download. There are four zip files containing the 175+ million sorted MD5|SHA records. You need to recombine them to a single data file. If you need help re-combining them, let me know.

Starting October 2021 I have started to re-process and merge older versions of the NSRL data files which I have been accumulating since before 2017. These versions range from version 2.58 to the current ver 2.75 as of Dec. 2021. However I have not maintained all of them. See the list below of which versions I have saved.

Fortunately, or UNFORTUNATELY as of March 2022 NIST reformated the data set to be provided as SQLITE data base.

I took this new data set and extracted out the MD5|SHA1 values, uniqued them from 552,037,702 records to 43,262,580, and merged them with 174,391,680 unique records from the older sets which i have mentioned below.

The NSRL data files date as far back as before 2009. Unfortunately I have only maintained the MD5 data for some of the older files. But, if I read the NIST website correctly, all the old data should be duplicated in the more current NSRL data bases. Also, up to the older 2.75 version I have included the NIST: LEGACY data set. However, NIST advised that even though its name is LEGACY, its contents can change from version to version. Go figure. SO: What I have done is take all the available versions in my inventory (Legacy:2021, 231, 258, 260, 262, 265, 270, 271, 273, 274, 275) and extract out the MD5|SHA values as they appear in the data files. A sample of the original NIST comma delimited record is shown below. For the column headers review the NIST documentation.

This is the legacy format, :
"00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3C9786A","05E566DF","J0180794.JPG",32768,653,"358",""

This is a sample of the new format, notice the SHA256 is now added, wrapped at the MD5 for legibility:
5C5C5124E85B06F7389FA37C1300A52464482911D834C973AD87A9D3FB607737|A971ECF3D3EFB1848C114036AE3158DA5BE0D2CF|
C63810FED5764442C8064E8234266C3D|205657^^zcs-NETWORK-8.8.15_GA_3953.RHEL8_64.20200629025823.tgz|472620260|234823

NOTE: all references to the new SQL data set at this time DOES NOT INCLUDE any of the ANDROID, IOS versions. I merely used the 552,037,702 MODERN for this new SQL processing.

To restate: with the recent versions, NIST has started to seperate the files into IOS, ANDROID, MODERN/RDS. The MODERN (or as i call it RDS) is the more generally available files found on most Windows machines. And they call it MODERN. The others, IOS and ANDROID are self explanatory. However, since the files can possibly migrate from one grouping to the other, I have ultimately combined ALL the sets prior to the SQL version, (LEGACY, IOS, ANDROID, MODERN/RDS) into a single data file containing the MD5|SHA values of ALL the NIST files. And I have also uniqued all the MD5|SHA pairs which results in a single instance of each pair. You can then take this list and massage it to your hearts content, or import it into any forensic engine capable of importing the sorted pair. Have fun. After all, initially all you want to know is if it is in the data base. Because the full data file content, IE: name, company, size, etc. creates a very very large data base I have not made it available. However, if you find you really need to have the full pedigree of the file get me the MD5 and I will attempt to find the rest of the record.

Also, be reminded that the files contained in these lists are not guaranteed to be GOOD file. They are just referenced as KNOWN files. Which means that some of them may be bad files, virus', etc. So read the NIST documentation to become familiar with the actual files included in the data bases.

Quick, confusing summary:
As of March 2022, NIST started a new SQL format. All my processing prior to that contain the MODERN, IOS, ANDROID which total just over 174 million unique MD5's. Then with the SQL version which NIST calls: RDS_2022.03.1_modern I have at this time (March 2022) only included in the overall total the 43 million MODERN/SQL files.


My analysis and formatting of the lists.

To restate, because the original data files were so large, I have decided to only extract for this operation the MD5|SHA pair. If you need the full record, let me know. dm @ dmares . com

The final MD5|SHA pair (I swapped the SHA and MD5 from the original format) in my data file looks like this shown below. It is a fixed length record of 75 characters with MD5 first, then the SHA. BUT the final data file DOES NOT have headers. . So make sure when you import the data to your forensic tool, it can figure which is which. If it can't, you may want to find another tool. Also, if your program needs only the single MD5 or SHA field, figure out how to do that, or contact me for the appropriate software to split the MD5 from the MD5|SHA data record of over 175 million. Think, excel can do that efficiently!!!!!!!!!.

And also, this one is for you collision/collusion advocates. Of the total number of MD5|SHA values in the 2022 version, I did find a single collision. But the logic of the format has me totally baffled, and I contacted NIST for an explanation. So, in this new data set (MARCH 2022) there is one possible collision. Yet to be formally explained. Since it appears to be ass backwards of what you would expect, (different MD5's with same SHA). The overall process I used, was to sort on the entire 75 characters (MD5|SHA1) which resulted in records as shown below. I then uniqued on the entire 75 characters, and counted duplicate MD5's. None showed up. But when I counted duplicate SHA1's look what was found.

First is to sort the data on the entire 75 characters, which gets a file like you see below. Notice that there are two different MD5s while only a single SHA. Beats me what happened. It should be the other way around.

   MD5                          |      SHA
C63810FED5764442C8064E8234266C3D|A971ECF3D3EFB1848C114036AE3158DA5BE0D2CF
E46A40CC121F03229C726E4753309E17|A971ECF3D3EFB1848C114036AE3158DA5BE0D2CF

Again, the 2022 single (reverse) collision has yet to be explained. Go figure.

From the NIST website are some statistics below: (I still, after reading the NIST explanations, can't figure out what the total unique item count should be. If anyone can do that from reading these values let me know).



NSRL: RDS_2022.03.1_modern
Modern:            552,038,839 
Unique:             43,262,568 
   ===============================
RDS 2.75 December 2021 Hash Counts

Modern:             202,302,512
Modern (minimal):    41,850,362
Modern (unique):     22,366,821
Legacy:             134,570,414
Android:             50,308,347
iOS:                 13,124,271
   =============================
RDS 2.74 September 2021 Hash Counts

Modern:             192,677,749
Modern (minimal):    38,320,334
Modern (unique):     20,411,375
Legacy:             113,737,918
Android:             41,589,780
iOS:                    931,242

Below are my total counts for the various version items prior to 2022 which I have maintained. Notice except for the modern/unique which i didn't calculate in this list, all the version 274 NIST count match mine. Was I glad to see that. What a novel idea. However, when uniqued on the entire MD5|SHA combined field, you see the numbers decrease in some instances, such as the android and ios ver 274 counts.

Version|  IOS      |  IOS_UNQ  |  ANDROID  |  ANDROID_UNQ | RDS/Modern | MODERN_UNQ |  LEGACY    |  LEGACY_UNQ
       |           |           |           |              |            |            |            |  
ver 231|           |           |           |              |            |  19,222,354|            |  
ver 258|           |           |           |              |            |  38,316,594|            |  
ver 260|           |           |           |              |            |  41,980,129|            |  
ver 262|           |           |           |              |            |   7,713,758|            |  
ver 265|           |           | 15,728,036|     5,177,636|            |            |            |  
ver 267| 14,390,472|  7,713,758|  8,396,701|     4,164,911|            |            |            |  
ver 270|  9,037,374|  5,334,511| 16,245,715|     7,043,597| 124,858,861|  31,638,076|            |  
ver 271| 46,447,082| 25,883,151| 18,890,716|     7,845,648| 130,274,166|            |            |  
ver 273| 46,447,082| 25,883,151|           |              |            |            |            |  
ver 274|    931,242|    568,223| 41,589,780|    14,861,596| 192,677,749|  38,320,334| 113,737,918| 46,111,042
ver 275| 13,124,271|  7,115,566| 50,308,347|    17,799,609| 202,302,512|  41,850,361| 134,570,414| 54,424,559       
        ---------------------------------------------------------------------------------------------------

Current total for all the items from 231 thru the new 2022 is:  175,831,092
FINAL UNIQUE VALUE for ALL combined is:    
Pre  2022 SQL:   174,391,680 
Post 2022 merge: 175,831,092 
So we added about 1.5 million in this new data set.
Here is the breakdown by first character count: 0-F. Consistancy and even distribution is the name of the game. 0 +10989701 1 +10995576 2 +10989640 3 +10989007 4 +10988843 5 +10989717 6 +10987534 7 +10993763 8 +10987264 9 +10987341 A +10990219 B +10990034 C +10978924 D +10991958 E +10989548 F +10992023 Nice, when a plan (or the totals) comes together :-)

This final combined/uniqued MD5|SHA total is about 13G. The total zipped is about 7,051,790,764 (7+GIG). I have split the file into four small(er) zip files for easier download.
NSRL_0-3.zip, 1,763,191,848
NSRL_4-7.zip, 1,763,027,765
NSRL_8-B.zip, 1,762,833,850
NSRL_C-F.zip, 1,762,737,301
==============
Total zip: 7,051,790,764
The links to these files can be found on this page. The updated zips were put on the page on March 21, 2022 which include the new MODERN NSRL data sets.

top

The entire NSRL SQL data base files can be downloaded from the NIST site mentioned above. As mentioned before, I only extracted out the MD5, SHA values for availability on this website. If you really want the entire data set, download the data base, and have fun.

top

DISKCAT    DEMO    BATCH

The file TEST_SUITE   contains a batch file with sample command lines to run diskcat.

Run the executable from a top level (empty) directory to extract all the files and programs. Then run the command line diskcat_demo.bat file located in the run_exes directory. The batch should have all the correct paths layed out to provide adequate results. Compare the output of this diskcat(aloging) program with your own programs used to provide listings of files located in specified evidentiary paths. See which provide more evidentiary information.

top

HASH    and    SEARCH    BATCH

The file TEST_SUITE   batch file contains a batch file which will create a data set of SHA 256 values. Then use the Maresware search program to search the hash data set for a specified number of SHA256 values.

Run the executable from a top level (empty) directory to extract all the files and programs. Then run the command line hash_demo.bat and search_demo.bat file located in the run_exes directory. The batch should have all the correct paths layed out to provide adequate results. Compare the outputs of the hash program with your own programs used to provide hash listings of files located in specified evidentiary paths. See which provide more evidentiary information.

The search_demo.bat files is included for you to see how fast the search program can find records contained in appropriate formatted outputs. Very useful when searching millions of data records for selected keys. (if you don't know what a key to search for means, don't even bother running the search batch)

top

MD5    BATCH

The file TEST_SUITE    batch file contains a batch file and sample HASH and SHA data files which will can be used to test and see the action of the MD5 program. It creates some sample MD5 output, and also compares some sample data files with preset MD5 and SHA values to display the action of the MD5 program.

The search command line shows how efficient the search program can be in finding specific HASH values in a data set containing HASH/SHA values. BUT BUT the search program can only work on fixed length records.

Run the executable from a top level (empty) directory to extract all the files and programs. Then run the command line md5_demo.bat file located in the run_exes directory. The batch should have all the correct paths layed out to provide adequate results. Compare the output of this md5 program with your own programs used to provide hashes (md5) listings of files located in specified evidentiary paths. See which provide more evidentiary information. Also, examine the different format of this output from that of the hash program.

top

GENERIC    DATA    PROCESS with a hint of HASH

The file TEST_SUITE    contains a generic batch file (run_test.bat) which demonstrates the speed and efficiency of some of the other data file processing programs, such as search, bsearch, compare, verticle. Don't confuse the maresware "search" program with a typical string search program. The other programs (including the forensic "upcopy" program) and help files can be downloaded from the website and made sure they are in the path before running the batch.

Anyone who processes large amounts of RAW data whether you created it through a forensic process, or obtained it from a source. The data processing programs are fast, efficient, and programmable. When I had a real job, I used to use it to process 100 million mainframe records. But don't let that deter you from taking a look see at the possibilities in your forensic work.

top

The Maresware help files may also contain additional zip and batch files which demonstrate the operation of the software.

If you find errors in the file links or process, please let me know.
Remember, this software doesn't contain bugs. Its just operationally challenged.

 

top