Author: Dan Mares, dmares @ maresware . com
Portions Copyright © 2000-2014 by Dan Mares and Mares and Company, LLC
Phone: (770)242-6687 X119
Last update: September 27, 2010.
This is a simple program designed to take the contents of a file which contains html or xml tags (code), remove those tags and create a new file with those tags removed.
Often (in forensics and e-discovery) users carve out or just have a significant number of files which contain html or xml code and wish to remove that code so that the text contents (clear text) can be more easily viewed by the recipient.
The resulting output file will have most if not all of the html or xml tags removed and the user will be left with an output file of clear text.
NOTE: At this time, the removal of html and xml code is mutually exclusive, which means: either the html or xml code can be removed during a single pass, but NOT both codes. To remove both the html and xml code, the output from one run, must be passed as the input to the next run.
The program can take a traditional diskcat type option input. (see diskcat options for most of the acceptable options). However, be aware that most of the diskcat output formatting options (-o, -w, etc) are removed from this program.
The simplest and default way to run this program is just to run it with a -f filetype option. Since it is designed to remove html or xml code, it would be wise to process only htm or xml files. So an option of -f *.htm would suffice in the operation.
Paths, and other file types can be processed and if only a path is provided (-p c:\suspect files"), then the program defaults to process ALL files in the path designated.
Appropriate options can be included for MAC date/time restrictions of files, and other simple file selection options. But it is probably best to just have .htm or .xml files to process.
During the process, the source file is opened, read, processed and a new output file is created without the html or xml code. The new file has the same name as the original, but it has an added extension of .ttx added. So and input file of myfile.htm will generate an output file of myfile.htm.ttx", Similarly for all files processed.
At least one option is required on the command line. If not, the help screen is displayed.
-p + path(s) If more than one directory is to be looked at, then add the paths here as appropriate. (-p c:\windows d: \work)
Some options may conflict with one another, and be mutually exclusive. I have made every effort to notify the user when conflicts occur, or they are mutually exclusive. But when using convoluted mixtures of options, please test the results.
-f + filespec If more than one file type is needed, add them here. (-f *.htm *.xml *.txt carve.*) it is suggested that you process .htm or xml files with this option
-x + filespec E(x)clude these file types from listing (same format as -f option) (-x thesefiles.txt)
--noxml Use this --noxml (or a variant, --xml) to process the XML tags. Since the processing of html and xml tags is mutually exclusive, if you want to process the xml tags in a file, this option is mandatatory.
Other options from the diskcat option set may work, but are not guaranteed.
C:>NO_HTML -[options] [--noxml]
C:>NO_HTML -f *.htm
process all html files and add the ttx extension.
C:>NO_HTML -f *.xml --noxml
process all xml files and remove the xml tags.