URL_SRCH

PURPOSE   OPERATION   OPTIONS   COMMAND LINES  


Author: Dan Mares, dmares @ maresware . com (you will be asked for e-mail address confirmation)
Portions Copyright © 2005-2023 by Dan Mares and Mares and Company, LLC
Phone: 678-427-3275, leave message, otherwise I think you are selling something.
Last update: Apr 14, 2023

One liner: Searches text of files for URL's, emails, phone numbers, credit card numbers.

Sample Maresware Batches  an executable with data that demonstrates various Maresware software. Download and run the appropriate _16_xx batch for url_srch demo.

Get url_srch.exe   This version may not be the most current. Check banner date url_srch -?

This is a command line program.
MUST be run within a command window as administrator.


PURPOSE

Search files on a drive to determine if they contain any of the following indicators:

CAVEAT:
Because the SSN and Bank credit card number formats algorithyms were initiated during the early 2000's, the algorithms used may not be 100% accurate. Testing should be done to confirm your hits are what you expect.

The program takes command line options to define the path and file types to search. Once it determines the files to search, it procedes to open and examine the contens of each file for indications of IP addresses, E-mail addresses, web site URL's, U.S.Phone number, U.S. Social Security Numbers, or Credit Card numbers of bank issuance (not gas or store cards).

This program has been found to be extremely useful in finding these items in exported freespace and unused space files. Run this program against freespace that has been exported from other forensic software.


top

OPERATION

===============================
First off, know yee all, that the maresware software is designed as a generic process software. Except as otherwise noted, the software is not designed to operate on only one type of input data, such as say a data base, docx, xlsx, pcap, zip file or spreadsheet. It was desgined originally to work on pure mainframe text files. So in the case of this program: url_srch, if the input file contains text data formatted as an IP, email, url, etc. then the software should be able to find it, and include the data in the output file. If the data searching for is contained in a binary or munged (thats a technical term) format, then the program won't find it. See sample outputs at the end of this file.
===============================

The program opens each file that is identified by the command line parameters and procedes to identify the items (IP, E-Mail address, URL).

The program is set to identify:

Every effort has been used to eliminate false hits of IP addresses, e-mails, URL's, Phone and SSN numbers that would normally not be of an acceptable range or format. However, it is always better to have some incorrect formats, than miss a meaningful item.

IPV6 formats
The basic IPV6 format is 8 sets of 4 hex (ABCD) digits separated with colons (:)
abcd:1234:5678:90ef:abcd:1234:5678:90ef
However, there are so many possible exceptions to this format, that any one of the exceptions could be missed by this program.
The user should research the standard and familiarize themselves with the exceptions, and if the exception appears in their search.
Possible research:
Reference 1
ipv6.com
Oracle reference
Note special instance of bypassing 0000:0000 segments
IPV6 formats

Some users have a single list of credit card numbers. (ie: just numbers, no other text on each line). Because of an anomoy (not a bug) in the logic, this format, a single item on the line, will not obtain correct answers. If you have credit card numbers, with only one item per line, the best way to obtain correct processing is to add about 10 blank spaces, or just add a dummy text into each line. Instead of this
1234567890123
do something like this
1234567890123 add any text here
This added text to each line, will obtain correct processing.

The international country URL's have generally been covered and accounted for as best as possible for those of the early 2000's when this program was originally written. Obviously almost any .extention is valid today, so if you have a special one you are looking for, let me know, and I may be able to include/update the program. The following usual list of domains are currently accounted for:
Again: If you have an unusual or domain that is not listed, please advise, and i will add it.

.com.mx
.com
.edu
.net.nz
.net
.org.uk
.org.nz
.org
.gov
.biz
.info
.win
.fun
.icu
.us
.co.uk
.co.nz
.me.uk
.name


top

OPTIONS

-?:  Get a help screen.

-p + path(s):  If more than one directory is needed to be looked at, then add the paths here as appropriate. (-p c:\windows    d:\work)   [PATH=path]

-f + filespec:  If more than one file type is needed, add them here. (-f   *.c   *.obj   *.dll)   [FILES=filetype]

If these options are used, the program builds a matrix of paths and file types. It searches all the requested directories for all the requested file types. Thus giving a total of all the files in all the paths requested. These options are added to any default command line provided. (C:>hash c:\work\*.c -f *.dll -p d:\windows)

-r:  DO NOT recurse through path provided. Default is recurse through path (-p option).

-x + filespec:  e(x)clude these file types from listing. Maximum of 100 file types accepted. (same format as -f option) (-x thesefiles.txt)

-oO + filename:  Output file name. Place the output to a filename. If uppercase ‘O’ then existing output is appended to. If you DO NOT wish header/accounting/command line information included in the output file because you are taking the output to the next analysis step, be sure to use both the -a accounting, or -1 (thats a one) logfile, and -v options.

-U:  Search also for Unicode type hits. If this is not chosen, only 8 bit ascii values are looked at. This means that you might miss a lot of Unicode hits. This option slows down processing.

-[euiPSC6]:    Select ONLY (e)mails, (u)rls, (i)ps, (P)hone numbers, (S)SN's, (C)redit card numbers, (6)IPV6 values. Only those items meeting the criteria are found. The default is to find all items except credit card numbers.
NOTE: IPV6 in unicode files is not yet implemented

-m #[CLR]:   where # is new maximum length of output line containing hit. if -m used, hit string is encased within « and ». decimal 174, 175 after the number you can place a 'C', 'L' or 'R' (i.e. -m 90L) The 'C L or R' tell the program where to place the hit string. 'C' is default

--NOTEXT: (ver. 2023-05-09) This option eliminates the field following the filename field, that would normally contain the 80 characters of the hit string and its surrounding text. It takes priority and removes any -m option values input. It is the field that is normally printed after the filename field. In some instances this surrounding text is not needed and all you are looking for is the filename, displacements, and the item (IP, URL, email etc) that was hit. The users will use other means to get to the actual surrounding text. This option significanlty reduces the size of the output file.

--HITWIDTH=xx:   (ver. 2023-05-09) The final field is the "HIT" field containing ONLY the text of the hit item. IE: the IP, URL, email address, etc. The final field is defaulted to 60 characters to accomodate a large hit item. However, if you are looking for only IP or URL's you might reduce this field a smaller value to save output space. For instance, if looking for IP's only, you might use --HITWIDTH=20 which provided a reasonable size for any IP found and doesn't waste space.

-d + #:  where # is the ascii value of the delimiter to use between fields. The default delimiter is the pipe (|), ascii decimal 124. If the value is a single digit, it must be preceeded by a 0. (-d 02) -d is only available with a -m or -w option.

-Ww [#]:  print single line wide output for input into data base. the -d (delimeter) option is encouraged at this point. if -W is used, then output file header is not inserted. this is better for import into data bases. replace # with max path value to print. the # for path size is optional the -w is default. to turn off use -w0

-D [#]: begin processing files this many bytes in from beginning.

-E [#]: end processing at this location in file.

-1 + logfile:  Create a logfile of the operation.

-R:  Reset access date to original date before operation

-g + #:  Where the # is replaced by a number indicating, list all files ‘g’reater than # days old. You can use a -gl pair to bracket file ages. [OLDER=xxx]

-g + mm-dd-yyyy
-l + mm-dd-yyyy
:  (that's and ell, not a one). Process only those files (g)reater (older) than or (l)ess than (newer) than this mm-dd-yyyy date. The date MUST be in the form mm-dd-yyyy. It MUST have two digit month and days (leading 0 if necessary), and it MUST have a 4 digit year. The date given mm-dd-yyyy is NOT included in the calculation. Ie. if today was 01-10-2003 and you entered -l 01-09-2003 you would only process todays files. If you wanted to include those on 01-09, you should have entered -l 01-08-2003.

-g + #    Where the # is replaced by a number indicating: list all files ‘g’reater than # days old. You can use a -gl pair to bracket file ages. [OLDER]=50

-l + #    (ell, not one) Where the # is replaced by a number indicating: list all files ‘l’ess than # days old. You can use a -gl pair to bracket file ages. To get todays files, use (-l 1) [NEWER]=10

-g + mm-dd-yyyy[acw]
Process only those files (g)reater (older) than this mm-dd-yyyy date. The date MUST be in the form mm-dd-yyyy. It MUST have two digit month and days (leading 0 if necessary), and it MUST have a 4 digit year. The date calculation is calculated as of midnite on the date given for the -g option of mm-dd-yyyy. For this reason, the day provided is NOT included in the calculation. Ie. if you entered -g 01-01-2006 you would only process dates PRIOR to 1/1/2006. This means all of 2005 and before. See below for the [acw] meanings.

-l + mm-dd-yyyy[acw]:  (that's and ell, not a one). Process only those files (l)ess than (newer) than this mm-dd-yyyy date. The date MUST be in the form mm-dd-yyyy. It MUST have two digit month and days (leading 0 if necessary), and it MUST have a 4 digit year. The date calculation is calculated as of midnite on the date given for the -l option of mm-dd-yyyy. For this reason, the day provided IS included in the calculation. Ie. if you entered -l 01-01-2006 you would process all of 2006 to the current date.

-L + #:  Where the # is replaced by a number indicating, list all files less than # bytes in size. (-L 100000) [LESSTHAN=xxx]

-G + #:  Where the # is replaced by a number indicating, list all files greater than # bytes in size. You can use a -GL pair to bracket file sizes. (-G 10000) (-G 10000 -L 100000) [GREATER]=10000

--email=textfile:    textfile contains a list, one per line, of the email addresses to look for. This restricts the output of the email searches to ONLY those emails listed in this text file. The file can contain a single domain to get all those emails. ie: @gmail.com or @yahoo.com etc. will get all yahoo and gmail emails.

--urls=textfile:    textfile contains a list, one per line, of the urls to look for. The format should be abc.com. This restricts the output of the URL searches to ONLY those listed in this text file. Do not include the http: unless you feel it is absolutely necessary. Sample: dmares.com, nist.gov

--ips=textfile:    textfile contains a list, one per line, of the ip's to look for. The format should be n[nn].n[nn].n[nn].n[nn], 4 octets, but doesn't have to be 3 digits each. This restricts the output of the IP searches to ONLY those listed in this text file. sample: 123.45.90.1, 69.89.12.222

--split[=xx]: split output file into this many records. the -v option is turned on to eliminate headers. if no modifier is chosed for split counts, 5,000 is default. example: --split, --split=5000


top

COMMAND LINES

Command lines can take one of three formats:

DO NOT use as the -p or -f option the full filename/path.
If using the -p option, include only the path here, and
with the -f option, only place the filename, without paths
The -v option provided a much cleaner output for input to the next step. Hopefully you have a next step.

URL_SRCH
URL_SRCH -p  d:\path  -o  c:\tmp\IP_output   -i6        -v  -w -m 200 -d "|" 
URL_SRCH -p  d:\path  -o  c:\tmp\output                 -v  -w -m 200 -d "|" 
URL_SRCH -p  d:\path  -f  ccards.txt -o  c:\tmp\output  -v  -w -m 200 -d "|" 
URL_SRCH -p  d:\path  -o  c:\tmp\output -C              -v  -w -m 200 -d "|" 
URL_SRCH -p  d:\path  -o  c:\tmp\output -U              -v  -w -m 200 -d "|"   (add UNICODE to the search) 

Two (truncated) lines of sample output of url_srch searching a pcap file for emails and IP addresses. Notice the suspec item is located in the middle of the text output. Again, formatted for display, the actual output is more verbose.

 item | POS  | filename      |  text                                                     | content  
|ip   | 1942 | D:\file1.pcap |48737-46Za715e192.168.1.2;rport..From: sip:voi18063@sip.cy |192.168.1.2
|mail | 1981 | D:\file1.pcap |sip:voi18063@sip.cybercity.dk;tag=903df0a..To: sip:voi18   |voi18063@sip.cybercity.dk
You can then take the suspect IP's and put them in the Strsrch program to specifically search the file for all the content surrounding the suspect IP.


Related software:
Strsrch   Search for specific text strings.

top