NSRL Hash Set stats and stuff

Version 2024.03 SQL data set: (circa Mar. 2024)

The March 2024, SQL version 2024.03.xxx. is what this article deals with, and the massaged data talked about here is a result of my re-processing the NSRL data to a more usable format of only the MD5 values. The NSRL goes back many years, and previously the names of the data files were sequenced like, 267, 273, 274 .(<275).. etc. In some cases below, I have processed and included those older versions because some of the MD5's were unique and warrented being added. However, where totals are posted, some may seen confusing, as some include the old version MD5s and others include only the current 202403 items. So if the numbers seem irregular, that is probably the reason they don't match.

See the NSRL - NIST site for explanation of their processes and definition of what is included in the data sets.

The various NSRL segments (IOS, ANDROID, Legacy, Modern) contain a total of approximately 1.3 billion MD5 values. I have merged and uniqued these segments of the Legacy, Modern, ANDROID and IOS, to obtain a total of 186+ million MD5 items from the 202403 set.

The record format of the file which I have massaged is simply the 32 byte MD5 value with a carriage return, making the entire fixed width record 34 characters (32 MD5 and 2 CR/LF). Since there are no collisions/collusions I didn't feel it necessary to add the substantial (doubling) size if the SHA were included.

I would have liked to include the application_id in the record, but since I can hardly spell database I couldn't reform the data to include the application_id. If you research the NSRL web site you will probably have some questions about the files which are actually included in the data set. Enough said, do your own research. You are after all a forensicator.

Go to this page and scroll down about 14 sections to to the section on: which hashes are for known bad files.

The data/files I am making available are sorted on the MD5 value and is a fixed length record of 34 characters. Using a reliable binary search engine such as Maresware BSEARCH, I searched the 240+ million for 20 MD5 values and it took less than a second. Sequential searches of the 240 million records took a little over 1 minute. Depending on the speed of your drive and machine, the times will obviously be different for you.

 
     Output stats from the linear SEARCH program. 
     Output record length is            34 
     No of records read =      240,428,306 
     No of records wrote=               20 
     Elapsed time: 0 hrs.  1 mins. 21 secs

while a binary search (BSEARCH) of the MD5 values is as fast as a traditional indexed search.

Even though many suites can process MD5 lists. I would imagine 240+ million records might cause some to choke. Maresware SEARCH BSEARCH and COMPARE are programs which were developed to process gigundo mainframe data files and can easily perform the searches and comparisons very easily of the 200 million here. They are also batch file compatable. Each of these programs has its own specific speciality for the process designed. Review of the help file is suggested.

The current (March 2024) RDS_2024.03 (combined/uniqued) values from NIST are 186+ million unique MD5's.

1D6EBB5A789ABD108FF578263E1F40F3
9B3702B0E788C6D62996392FE3C9786A

See the NIST - NSRL site for explanation of their processes and definition of what is included in the data sets. NSRL-NIST overview.

Current March 2024 Hash Counts (before combining and uniquing)
               Total              Unique
Modern:      879,510,365        69,437,521
Legacy:      289,938,900        61,814,050
Android:      97,148,886        29,406,405
IOS:          89,176,502        26,148,185
            ============       ===========
Total:     1,355,774,653       186,806,161

Older VERSIONS V231-277 before 202403 were combined to build the more complete zip files below. The total unique MD5 values are just at 240,428,306

I have split the combined 240+ million items into 4 smaller (760 Meg. each) zipped files and made them available for download. Each zip file is just over 760 Meg in size.
     NSRL_0-3.zip   contains 60,117,305 items with first character 0-3
     NSRL_4-7.zip   contains 60,108,371 items with first character 4-7
     NSRL_8-B.zip   contains 60,102,830 items with first character 8-B
     NSRL_C-F.zip   contains 60,099,800 items with first character C-F
     NSRL_DEMO.zip   contains sample command lines and batch file to run maresware.

You should unzip them, and then merge them. Make certain the sort order is in tact. Else you can't use a binary search or compare.

C:> copy /b    NSRL_0-3.MD5    +    NSRL_4-7.MD5    +    NSRL_8-B.MD5    +    NSRL_C-F.MD5    COMPLETE_SORTED_MD5
in the sorted fashion to restore the entire data set. If you need help doing the merge, let me know. Once you merge the files, I suggest you use the sortchek.exe and the help file program to verify that the total set is still sorted. The suggested command line for sortchek is:
D:>sortchek     COMPLETE_SORTED_MD5     -r 34 -p 0 -l 32
replace the COMPLETE_SORTED_MD5 name with whatever yours is named. If it finds a record out of order, it will show you.

SAMPLE RUNS:

If you wish a sample of a MARESWARE batch file to demonstrate how to use and run MARESWARE when processing the NSRL MD5's data which is referenced above send me an email request and I'll send the sample batch file.

Below are speed stats running WIN10 on an i7 machine using a 2T external USB drive.

BSEARCH stats:
   Input filesize     =  8,174,562,404
   Number input records  = 240,428,306
   Records Written   =              20

   Finished:      Fri Jun 07 14:00:41 2024
   Elapsed time: 0 hrs. 0 mins. 1 secs

===========================================
Linear SEARCH stats:

     No of records read =      240,428,306 
     No of records wrote=               20 
     Elapsed time: 0 hrs.  1 mins. 21 secs 

===========================================
COMPARE stats from a smaller group
182,078,612 Records in IOS_MOD_ANDR_LEG.MD5
        262 Records in Maresware_hashes

Total records file 1 = 182,078,612
Total records file 2 =         262

Read  181,597,969 records from: IOS_MOD_ANDR_LEG.MD5
Read         262 records from: Maresware_hashes
Wrote         40 records to                junk
Final record length is =                     34
Elapsed time: 0 hrs. 1 mins. 2 secs

A reference page for algorithms and other documemts may be found at: NIST. Research the documents link.

Top