X_WAYS_META_PROCESSING


PURPOSE   OPERATION   COMMAND LINES   OPTIONS   RELATED PROGRAMS


Author: Dan Mares, dmares @ maresware . com
Portions Copyright 1998-2016 by Dan Mares and Mares and Company, LLC
Phone: 678-427-3275
Last Update: 03/26/2012 and 4/10/2012

This program comes in two versions.
One version processes the metadata field found within the X-Ways: "Export List" tab delimited file which many attempt to import to a spreadsheet.
The 2nd version processes the metadata line found in the X-Ways generated HTML report.

Both versions operate and have similar command lines. But work on different input files. So read and adjust the wording of this help file according to the input file you are using.


Here is a small amount of Sample data in a zip file which can be downloaded. The two files within this zip file are named:
metadata_docs_keywords
metadata_export.tab
Unzip the files. then run the command line:
x_ways_meta_processing    metadata_export.tab    metadata_docs_keywords
The output file is called:           metadata_export_tmp.tab
Load (import data) this file (metadata_export_tmp.tab) to a spreadsheet and take a look at the added columns. (Make sure your import criteria is a tab delimeted file)

This is a command line program.
MUST be run within a command window as administrator.


Purpose

To find (and isolate) semi-colon (;) delimited fields within the X-Ways metadata field that is exported during the X-Ways "Export List" operation, or HTML report generation of X-Ways. The default X-Ways export list is tab delimited, and this program ONLY works on those tab delimited files.

The X-ways report that is generated is a typical HTML report. The report contains a Metadata: field. The contents are usually identified as individual Metadata items seperated with a carriage return or (HTML line break command).
Similar to this sample shown here. It may contain a lot of irrelevant information.
Important note: See below section labelled HTML_REPORT.

Metadata: Width: 4000
Height: 3000
Orientation: 1
Equipment make: SOME CAMERA MAKE
Model: KODAK
Keywords: help me
Date Original: 2009:08:20 10:27:26
Date digitized: 2009:08:20 10:27:26
Thumbnail: true
Focal length: 4.0
F number: 3.5

The "Export List" version which deals with the X-Ways "export list" options, take "user" identified field(s), isolates each one, extracts each one and sets each field up as its own tab delimited field within the output record which was generated by X-Ways. These newly inserted tab delimeted fields, can now be easily imported to a spreadsheet and manipulated by the user.

The resulting output record, now has an additional tab delimited field(s) which was identified as the semi-colon (;) delimited sub-field within the X-Ways metadata field. The original metadata field is not modified in any way, and is always maintained in the newly designed output record.

HTML_REPORT

The HTML report processing version takes the selected or requested metadata fields and these are the only metadata components included in the resulting html report. Any additional unused/unnecessary Metadata items are removed from the output html file. This reduces erroneous, and extra html data which the user feels may be unnecessary in the final HTML report. Each (new) line in the resulting html output file is now also labelled Metadata:

The "Metadata:" line in the report MUST be the first item on the line in the html report and be left justified with no spaces. If you wish to change the Metadata: field to be BOLDED so it will stand out, you may do this. But you can only use the < B > and < / B > html tags. Those are the only html tags that the program understands when looking for the Metadata field name. If you use the < STRONG > tag it will not work. If the program does not respond properly, (meaning it does not properly find and parse the Metadata field) please open the html report with a text editor and do a search and replace. Remove the bolding or any other formatting around the word Metadata, and make it left justified on the line. In other words: replace the string < B >Metadata: < / b >, with just Metadata: Open and examine the X-Ways generated html report, and you will understand this above restriction.

WHY THIS PROGRAM WAS WRITTEN

The reason this program was created, is that when X-Ways finds and extracts metadata from within a file, it (X-WAYS) extracts many different metadata items from within the metadata content of any particular file. This metadata extracted is different depending on the type and content of the file being processed. There are a large number of variable items placed within this metadata field. (See below the list of meta data fields i have found). Some of which (when available) may be the Last Printed Date, Author, various Exif data, date(s), camera type, and other useful metadata information. However, since the metadata field as extracted is basically a field with free form sub-fields semi-colon (;) delimited, it is not easy to either identify the targeted item (ie: Last Printed date), or in which location within the metadata field it is found. If you have ever imported the export list to a spreadsheet you have experienced this problem.

This unknown location within the metadata field, and the variableness of the metadata information makes it almost impossible to isolate and reparse the targetted item (ie. Last Printed Date:). This program identifies the targeted (user identified) sub-field, extracts and isolates its data and makes a seperate delimited item/column of the data. Then when the resulting data file is imported to a spreadsheet such as Excel, that user identified sub-field is now its own unique column within Excel, and can be processed as other columns. If you have ever tried to parse the metadata field you know what we are talking about.

Said another way. This program takes the metadata field, and based on the users input (hopefully properly and correctly researched information) parses the metadata field to locate the semi-colon delimited field(s) which is needed. It then reparses the metadata field and seperates out the selected field(s) into its OWN seperate tab column, which when imported to the spreadsheet will process very nicely. The original data record is not changed, except now it has added tab delimited items.

The original field which this program was written to parse was the "Last Printed:" date item within the doc and spreadhseet generated metadata. It has also been tested on email, Exif, and link file metadata, and seems to work with all of these metadata fields. Any feedback on its operation is appreciated.

CAUTION: There is one caveat. Which is, that X-WAYS ALLOWS carriage returns to be embedded within the metadata field of the "Exported List" process. These embedded carriage returns usually are a result of parsing email items, but, regardless of the source file, will cause a spreadsheet (Excel) to have major problems. This program finds those embedded carriage returns and converts them to spaces. The resulting output is now easily and cleanly imported to the spreadsheet without a problem. If you add a redirection    "2> CR_error_filename"    the program will create this error file that lists what lines in the input file actually have embedded carriage returns and were fixed. This fix is ONLY needed in the "Export List" process, and not the report processing.

If you only want to convert the carriage returns embedded within the metadata field, simple select one metadata field, (any field will do), and run the program. Your output contains the added field, but the metadata field itself, now has the carriage returns converted.

04-04-2012 NOTE: Thanks to Jimmy Weg, I have made a second program called: x-ways_report_process.exe that is designed to work on the metadata field that is included in the X-Ways HTML report files. It will search the Metadata: line and select out only those segments which the user requests on the command line. The user can input up to 10 items to select on the command line. OR OR, (4/9/2012) if the 3rd item on the command line (which would normally be the metadata item searched for) is replaced by a filename containing the items, then these items are searched for. The metadata items must be one per line in a text file. see the command line below.


Top

Operation

This program takes a single filename as its input, and up to ten other items on the command line. Be careful, this IS a COMMAND LINE program.

The program takes the input filename, parses it, and adds an _tmp to the input name. Thus generating a new filename and uses this as the output. So an initial input file named: xways_export.txt will generate an output name: xways_export_tmp.txt. So look for a new output filename similar to the input, with the added _tmp.

The input file should be the usual tab delimited file which is exported thru the X-Ways "Export List" option. THE INPUT MUST BE A TAB DELIMITED FILE, and you must advise the spreadsheet program of this when importing the data. This is the default export format of the X-Ways "export list" operation. The user may include in the ouput record any other fields they would usually include. HOWEVER: The last field in the exported record MUST be the metadata field. This last field being the metadata field is the ONLY one which is being searched or processed for the item(s) which is provided by the user on the command line. If the metadata field is NOT the last field, the output file will not have the expected content.

The traditional format of the X-Ways metadaa field from the "Export List:" process is a single field within the tab delimited record. This is a sample of three fields below (path, hash, metadata, I split the metadata to two lines for easy reading. Notice in the metadata, there are (colon delimeted) sub-fields of: File name:, Sequence:, Version:, Length:, Cluster:, Modification: )

\WINDOWS\system32\config\Newsid Backup	      F67ACE253768387C57471BE55F051ABC	
File name: temRoot\System32\Config\DEFAULT;Sequence: 831;Version: 1.5;Length: 499712;
Cluster: 1;Modification: 12/20/2011  04:50:23;Last Printed: 12/25/2011  06:50:23

From the report html file, we find the format: Notice this version displays the html code of the BR to indicate a line break. (carriage returns inserted for clarity)

Metadata: Width: 3296 < BR > Height: 2472 < BR > Orientation: 1 < BR > 
Software: OLYMUS CAMERA MODEL < BR > Equipment make: YOUR CAMERA COMPANY < BR > 
Model: THE MODEL < BR > Maker note: (12728 bytes) < BR > 
Keywords: any words in the metadata < BR > 
Date Original: 2011:03:20 13:27:26 < BR > 
Date digitized: 2011:03:20 13: 27:26 < BR > Thumbnail: true < BR > 
Focal length: 4.0 < BR > F number: 4.70 

Within this "Export List" metadata field, X-Ways usually delimits the metadata with semi-colon (;) delimited fields. So that within the metadata you have multiple items which X-Ways has parsed into semi-colon delimted items. One of these items is what the user will probably be looking for. One usual item is the "Last Printed:" date of Office documents. If available, this "Last Printed:" date will be one of the semi-colon; delimited items within the metadata field. In other instances, there may not be any metadata at all, or the item being looked for is not part of the metadata extracted. These are the three possbilities. If you don't know what this is referring to, don't bother to read on.

On the command line, after the user provides the input filename, you are required to input a search string (or a text file, one item per line of the fields to search for). This search string is the name of the semi-colon delimited field within the metadata field which is the item to look for. For the purposes of further discussion we will use the "Last Printed:" field name which is sometimes part of the metadata of Office documents. Notice that the actual name of the field usually ends with a colon (:). This is how X-Ways seems to identify the item name.

The program will read each record within the input file. It then finds the last tab delimited field (which MUST be the metadata field). Within the metadata field, it then looks for the string(s) which the user has input, in this case "Last Printed:". The string searched for is case sensitive, so be aware of any anomolies that might exist in the X-Ways data record, especially CaSe sensiTivity of the item being sought.

Once the string is located, the program assumes it is the semi-colon delimited field to extract. It then outputs the first part of the record, up to this metadata field, it then outputs this subset of the metadata field, which is what the user asked for, and finally it outputs the complete metadata field as it was originally in the X-Ways output record.

What we end up with is the searched for field, tab delimited inserted just BEFORE the originals meta data field.

This stand alone tabbed field is now properly formatted so that when the user imports the resutling output file into a spreadsheet that field is easily identified and processed.


Top

Command Lines


C:> x-ways_report_process.exe  report_input.html  file_containing_metadata_fields_to_look_for   (preferred version)
C:> x_ways_meta_processing     inputfilename.txt  "String_to_search_for:"  "Another_string_max_of_10:"
C:> x_ways_meta_processing     inputfilename.txt  "Last Printed:"
C:> x_ways_meta_processing     inputfilename.txt  "Last Printed:"   2> CR_error_filename
C:> x-ways_report_process.exe  report_input.html  "Last Printed:" "Keywords:" 

Notice that all the strings in the inputfilename.txt above to search for terminate in a colon (:). This is because in my research, most if not all of the metadata field names within X-Ways metadata column are identified by a colon terminator. It is not required, but seems to be the standard.

Will attempt to locate the "String(s)_to_search_for" field within the metadata field, and extract it to another tab delimited field within a new output file.

Please note, that there are a max of ten metadata strings in the x_ways_meta_processing.exe per run that can be searched for. For this reason, it is preferred that you use the text file which contains your strings. This makes the list easily modified and reusable.

Sample string(s) file
Company:
Creation:
Modification:
Last Saved:
Last Printed:
Author:
Subject:
Last Saved By:
Version:
See the full list of items i have found.

The redirection 2> to the CR_error_filename, (only used in the "Export List" processing) finds and lists those records in the input file which contain embedded carriage returns in the metadata field, and changes the embedded carriage return to blanks. The result is that the data file can easily and cleanly be imported to the spreadsheet.

There is a way to get the "report html process" version to create seperate tagged Metadata: lines for each nd every metadata item. It makes the reading of the report a lot cleaner and easier. If you wish to learn how to do this, give us a call: 770-242-6687 X 119.


Top

Options

None, but a weird way to search for items case insensitive.

The default is to search for the strings as case sensistive.
So you better get it correct.
However, if you call the program with ALL UPPERCASE characters (X_WAYS_META_PROCESSING), then the search is done case insensitive.This case insensitivity is NOT currently available in the report processor program.


Top

Fields I Found

Below are fields i've found in the metadata column of X-Ways. I have yet to add any email eml fields. The list is long. When using in the program, if your research confirms what we have here, be sure to include the colon as part of the field name. That is usually the field delimiter. Also, do proper research to determine the case of the field you are searching for. Many programs arbitrarily alter the case. Notice some items below (see Content-type) have two versions.

_EmailSubject:     _NewReviewCycle:     action-uri=http:     application-name:     Application:     AppVersion:     assetid:     Attach:     attached to a shape:     Attributes:     Attributes:     Author:     author:     Bit count:     Build identifier:     Build Number:     Build year:     CACHE-CONTROL:     Cache-control:     cache-control:     Canon:     Caption:     Category:     Category:     Channels:     Char Count:     Characters:     CharactersWithSpaces:     CLASSIFICATION:     Cluster:     Code page:     Comment:     Comments:     Company:     Compression:     Consistent:     Contact:     Content-Language:     Content-type:     content-type:     Content-Type:     Copy ID:     Copyright:     Copyrighted:     Copyrighted:     Created-with:     Creation Time:     Creation:     Creator Application:     Creator Host OS:     Creator Version:     CREATOR:     Creator:     CreatorTool Date digitized:     Date Original:     Date taken:     Date:     Description rdf:     Description:     description:     Detach:     DocSecurity:     DocumentID>adobe:     DocumentID>uuid:     DriveLetter:     Duration:     EmbeddedFile:     End time:     Equipment make:     Expires:     expires:     F number:     falseCreationD:     File format revision:     File history flags:     File name:     File size:     Files:     Finish:     Firmware:     Flags:     Focal length:     Format Tag:     GENERATOR:     Generator:     Height:     Hidden count:     Host Name:     http:     http:     https:     ID List:     Image Number:     IMG:     INAM:     Interpretation:     ISRC:     Item:     Keywords:     Last Accessed:     Last Opened By:     Last Printed:     Last Saved By:     Last Saved:     Last Written:     Last-Modified:     Latitude:     Length:     Linearized:     Lines:     LinksUpToDate:     Local Path:     Locale identifier:     Logger name:     Longitude:     Lowest version:     MAC Address:     Machine:     mailto:     Maker note:     Manager:     Manufacturer:     mode :     Model:     Modification:     Moved to recycle bin:     Network share name:     noquick:     Note count:     Note:     Object ID:     Orientation:     Originator:     OS Version:     OS:     Owner:     Page count:     Pages:     Paragraphs:     PASSWORD:     pics-label:     Play Duration:     Pragma:     pragma:     Presentation Target:     Producer:     ProgId:     progid:     propID:     PROTECT:     RATING:     REFRESH:     Refresh:     refresh:     Relative Path:     Repair count:     ROBOTS:     Robots:     robots:     Root cell:     Saved State:     ScaleCrop:     searchid:     Security Level:     Sequence:     Serial number:     Service Pack:     Set ID:     SharedDoc:     signed:     Signing date:     Size:     Software:     Source Computer:     SourceModified:     Start time:     Start:     State:     Stream Type:     Subject:     subject:     Target Attributes:     Target Created:     Target File Size:     Target Path:     Template:     theme:     Thumbnail:     Timestamp:     Title:     TotalTime:     Type:     Unique ID:     URL=http:     url=http:     URL=https:     User Comment:     Version:     viewport:     Volume ID:     Volume Name:     Volume Serial:     Volume Type:     Volume:     Wantlive:     Width:     Words:     Work:    


Top

Related Programs

EM_PROCESS  A sister program which can easily separate the header fields within eml files.

CSV2PIPE  Is capable of removing embedded carriage returns from csv files.

Top