X_WAYS_META_PROCESSING


PURPOSE   OPERATION   COMMAND LINES   OPTIONS   RELATED PROGRAMS
BUGS: Operational Challenge


Author: Dan Mares, dmares @ maresware . com
Portions Copyright 2012-2016 by Dan Mares and Mares and Company, LLC
Phone: 678-427-3275
Last Update: 4/28/2015


20140106: Added 'X' capability to the requested field file.
20150425: Explained how to modify metadata field of deleted files $R... to run the keywords

This program currently does NOT support Unicode input files.
This is a command line program.
MUST be run within a command window as administrator.


This program comes in two versions. Be sure you use the correct version. Don't mix html report, with export data.
One version (the preferred method) processes the metadata field found within the X-Ways produced "Export List" tab delimited file which many attempt to import to a spreadsheet.
The 2nd version processes the metadata line found in the X-Ways generated HTML report.

Both versions operate and have similar command lines. But work on different input files. So read and adjust the wording of this help file according to the input file you are using.


Here is a small amount of Sample data in a zip file which can be downloaded. The two files within this zip file are named:
there are a number of sample keyword files like:metadata_docs_keywords
there are two sample TAB delimeted x-ways metadata files: DOCS and JPEG_META.TAB
and a sample of two commands in the commands.bat file which shows how to process the sample data.
Unzip the files. then run the command.bat:

The output files have a _tmp added to the filename:        DOCS_META_tmp.tab
Load (import data) this file to a spreadsheet and take a look at the added columns. (Make sure your import criteria is a tab delimeted file)


Purpose

To find (and isolate) semi-colon (;) delimited fields within the X-Ways metadata field that is exported during the X-Ways "Export List" operation, or HTML report generation of X-Ways. The default X-Ways export list is tab delimited, and this program ONLY works on those TAB delimited files.

The X-ways report that is generated is a typical HTML report. The report contains a Metadata: field. The contents are usually identified as individual Metadata items seperated with a carriage return or (HTML line break command).
Similar to this sample shown here. It may contain a lot of irrelevant information.
Important note: See below section labelled HTML_REPORT.

Metadata: Width: 4000
Height: 3000
Orientation: 1
Equipment make: SOME CAMERA MAKE
Model: KODAK
Keywords: help me
Date Original: 2009:08:20 10:27:26
Date digitized: 2009:08:20 10:27:26
Thumbnail: true
Focal length: 4.0
F number: 3.5

The "Export List" version which deals with the X-Ways "export list" options, take "user" identified field(s), isolates each one, extracts each one and sets each field up as its own tab delimited field within the output record which was generated by X-Ways. These newly inserted tab delimeted fields, can now be easily imported to a spreadsheet and manipulated by the user.

The resulting output record, now has an additional tab delimited field(s) which was identified as the semi-colon (;) delimited sub-field within the X-Ways metadata field. The original metadata field is not modified in any way, and is always maintained in the newly designed output record.


HTML_REPORT

The HTML report processing version takes the selected or requested metadata fields and these are the only metadata components included in the resulting html report. Any additional unused/unnecessary Metadata items are removed from the output html file. This reduces erroneous, and extra html data which the user feels may be unnecessary in the final HTML report. Each (new) line in the resulting html output file is now also labelled Metadata:

The "Metadata:" line in the report MUST be the first item on the line in the html report and be left justified with no spaces. If you wish to change the Metadata: field to be BOLDED so it will stand out, you may do this. But you can only use the < B > and < / B > html tags. Those are the only html tags that the program understands when looking for the Metadata field name. If you use the < STRONG > tag it will not work. If the program does not respond properly, (meaning it does not properly find and parse the Metadata field) please open the html report with a text editor and do a search and replace. Remove the bolding or any other formatting around the word Metadata, and make it left justified on the line. In other words: replace the string < B >Metadata: < / b >, with just Metadata: Open and examine the X-Ways generated html report, and you will understand this above restriction.


WHY THIS PROGRAM WAS WRITTEN

The reason this program was created, is that when X-Ways finds and extracts metadata from within a file, it (X-WAYS) extracts many different metadata items from within the metadata content of any particular file. This metadata extracted is different depending on the type and content of the file being processed. There are a large number of variable items placed within this metadata field. (See below the list of meta data fields i have found). Some of which (when available) may be the Last Printed Date, Author, various Exif data, date(s), camera type, and other useful metadata information. However, since the metadata field as extracted is basically a field with free form sub-fields semi-colon (;) delimited, it is not easy to either identify the targeted item (ie: Last Printed date), or in which location within the metadata field it is found. If you have ever imported the export list to a spreadsheet you have experienced this problem.

This unknown location within the metadata field, and the variableness of the metadata information makes it almost impossible to isolate and reparse the targetted item (ie. Last Printed Date:). This program identifies the targeted (user identified) sub-field, extracts and isolates its data and makes a seperate delimited item/column of the data. Then when the resulting data file is imported to a spreadsheet such as Excel, that user identified sub-field is now its own unique column within Excel, and can be processed as other columns. If you have ever tried to parse the metadata field you know what we are talking about.

Said another way. This program takes the metadata field, and based on the users input (hopefully properly and correctly researched information) parses the metadata field to locate the semi-colon delimited field(s) which is needed. It then reparses the metadata field and seperates out the selected field(s) into its OWN seperate tab column, which when imported to the spreadsheet will process very nicely. The original data record is not changed, except now it has added tab delimited items.

The original field which this program was written to parse was the "Last Printed:" date item within the doc and spreadhseet generated metadata. It has also been tested on email, Exif, and link file metadata, and seems to work with all of these metadata fields. Any feedback on its operation is appreciated.

CAUTION: There is one caveat. Which is, that X-WAYS ALLOWS carriage returns to be embedded within the metadata field of the "Exported List" process. These embedded carriage returns usually are a result of parsing email items, but, regardless of the source file, will cause a spreadsheet (Excel) to have major problems. This program finds those embedded carriage returns and converts them to spaces. The resulting output is now easily and cleanly imported to the spreadsheet without a problem. If you add a redirection    "2> CR_error_filename"    the program will create this error file that lists what lines in the input file actually have embedded carriage returns and were fixed. This fix is ONLY needed in the "Export List" process, and not the report processing.

If you only want to convert the carriage returns embedded within the metadata field, simple select one metadata field, (any field will do), and run the program. Your output contains the added field, but the metadata field itself, now has the carriage returns converted.

04-04-2012 NOTE: Thanks to Jimmy Weg, I have made a second program called: x-ways_report_process.exe that is designed to work on the metadata field that is included in the X-Ways HTML report files. It will search the Metadata: line and select out only those segments which the user requests on the command line. The user can input up to 10 items to select on the command line. OR OR, (4/9/2012) if the 3rd item on the command line (which would normally be the metadata item searched for) is replaced by a filename containing the items, then these items are searched for. The metadata items must be one per line in a text file. see the command line below.


Top

Operation

This program takes a single filename as its input, and up to ten other items on the command line. Be careful, this IS a COMMAND LINE program.

The program takes the input filename, parses it, and adds an _tmp to the input name. Thus generating a new filename and uses this as the output. So an initial input file named: xways_export.txt will generate an output name: xways_export_tmp.txt. Look for a new output filename similar to the input, with the added _tmp.

The input file should be the usual tab delimited file which is exported thru the X-Ways "Export List" option. THE INPUT MUST BE A TAB DELIMITED FILE, and you must advise the spreadsheet program of this when importing the data. This is the default export format of the X-Ways "export list" operation. The user may include in the ouput record any other fields they would usually include. HOWEVER: The last field in the exported record MUST be the metadata field. This last field being the metadata field is the ONLY one which is being searched or processed for the item(s) which is provided by the user on the command line. If the metadata field is NOT the last field, the output file will not have the expected content.

The traditional format of the X-Ways metadaa field from the "Export List:" process is a single field within the tab delimited record. This is a sample of three fields below (path, hash, metadata, I split the metadata to two lines for easy reading. Notice in the metadata, there are (colon delimeted) sub-fields of: File name:, Sequence:, Version:, Length:, Cluster:, Modification: )

\WINDOWS\system32\config\Newsid Backup	      F67ACE253768387C57471BE55F051ABC	
File name: temRoot\System32\Config\DEFAULT;Sequence: 831;Version: 1.5;Length: 499712;
Cluster: 1;Modification: 12/20/2011  04:50:23;Last Printed: 12/25/2011  06:50:23

From the report html file, we find the format: Notice this version displays the html code of the BR to indicate a line break. (carriage returns inserted for clarity)

Metadata: Width: 3296 < BR > Height: 2472 < BR > Orientation: 1 < BR > 
Software: OLYMUS CAMERA MODEL < BR > Equipment make: YOUR CAMERA COMPANY < BR > 
Model: THE MODEL < BR > Maker note: (12728 bytes) < BR > 
Keywords: any words in the metadata < BR > 
Date Original: 2011:03:20 13:27:26 < BR > 
Date digitized: 2011:03:20 13: 27:26 < BR > Thumbnail: true < BR > 
Focal length: 4.0 < BR > F number: 4.70 

Within this "Export List" metadata field, X-Ways usually delimits the metadata with semi-colon (;) delimited fields. So that within the metadata you have multiple items which X-Ways has parsed into semi-colon delimted items. One of these items is what the user will probably be looking for. One usual item is the "Last Printed:" date of Office documents. If available, this "Last Printed:" date will be one of the semi-colon; delimited items within the metadata field. In other instances, there may not be any metadata at all, or the item being looked for is not part of the metadata extracted. These are the three possbilities. If you don't know what this is referring to, don't bother to read on.

On the command line, after the user provides the input filename, you are required to input a search string (or a text file, one item per line of the fields to search for). This search string is the name of the semi-colon delimited field within the metadata field which is the item to look for. For the purposes of further discussion we will use the "Last Printed:" field name which is sometimes part of the metadata of Office documents. Notice that the actual name of the field usually ends with a colon (:). This is how X-Ways seems to identify the item name.


X in the field file

SPECIAL CASE for not adding field title in output record

The Last Printed: field is displayed in the output record as:
Last Printed: 2009/07/13 19:43:12

Some persons have expressed concern that they do not wish to have the field name included in every record. They would rather show it for sorting purposes as
" 2009/07/13 19:43:12" or as
"2009/07/13 19:43:12"
Notice in version 2, there is no leading space.

If you DO NOT want to include the field name in every record. When you create the field text file, after the item, include an upper case X or a lower case x. The next lines show a line where the Last Accessed: and Last Modified: will have the field name included, while the Last Printed: will not.
Last Accessed:
Last Printed:X
Last Modified:x

Notice the 'X' and 'x' after the colon. This tells the program to not print the field title in the individual output record.
However: both x's, X's cause the field to be output in a little different.
The upper case 'X' causes the leading space to be maintained in the output. This is the exact way that X-Ways outputs the data. However, when importing this field to Excel, some persons have complained that they can't get Excel to correctly do a date sort. It appears that they cannot get Excel to do a YYYYMMDD sort. (go figure).
If you use the lower case 'x' then the output field will not include the leading space. "2009/07/13". Then when importing into Excel, "Excel" in its infinite wisdom, knowing more how you want the field displayed, than you, will turn the date around and display it as: 07/13/2009. This format apparently sorts better.
So, depending on how you want the output data to be interpreted by Excel, use either the upper or lower case 'X'. I personally prefer to leave it as original, and deal with Excel in my own way.
Think about the countries that use different formats:

01/03/2009    MM/DD/YYYY
03/01/2009    DD/MM/YYYY
2009/01/03    YYYY/MM/DD

Which version of the above, will ALWAYS be interpreted correctly?

The program will read each record within the input file. It then finds the last tab delimited field (which MUST be the metadata field). Within the metadata field, it then looks for the string(s) which the user has input, in this case "Last Printed:". The string searched for is case sensitive, so be aware of any anomolies that might exist in the X-Ways data record, especially CaSe sensiTivity of the item being sought.

Once the string is located, the program assumes it is the semi-colon delimited field to extract. It then outputs the first part of the record, up to this metadata field, it then outputs this subset of the metadata field, which is what the user asked for, and finally it outputs the complete metadata field as it was originally in the X-Ways output record.

What we end up with is the searched for field, tab delimited inserted just BEFORE the originals meta data field.

This stand alone tabbed field is now properly formatted so that when the user imports the resutling output file into a spreadsheet that field is easily identified and processed.


Top

Command Lines


C:> x-ways_report_process.exe  report_input.html  file_containing_metadata_fields_to_look_for   (preferred version)
C:> x_ways_meta_processing     inputfilename.txt  "String_to_search_for:"  "Another_string_max_of_10:"
C:> x_ways_meta_processing     inputfilename.txt  "Last Printed:"
C:> x_ways_meta_processing     inputfilename.txt  "Last Printed:"   2> CR_error_filename
C:> x-ways_report_process.exe  report_input.html  "Last Printed:" "Keywords:" 

Notice that all the strings in the inputfilename.txt above to search for terminate in a colon (:). This is because in my research, most if not all of the metadata field names within X-Ways metadata column are identified by a colon terminator. It is not required, but seems to be the standard.

Will attempt to locate the "String(s)_to_search_for" field within the metadata field, and extract it to another tab delimited field within a new output file.

Please note, When using the command line to identify the meta data field(s) you wish to locate, that there are a max of ten metadata strings in the x_ways_meta_processing.exe per run that can be searched for.

For this reason, it is preferred that you use the text file which contains your strings. This makes the list easily modified and reusable.

A sample text file might contain

Sample string(s) file
Company:
Creation:
Modification:
Last Accessed:X
Last Saved:
Last Printed:
Author:
Subject:
Last Saved By:
Version:
Creation Time:X
See the full list of items i have found.

The redirection 2> to the CR_error_filename, (only used in the "Export List" processing) finds and lists those records in the input file which contain embedded carriage returns in the metadata field, and changes the embedded carriage return to blanks. The result is that the data file can easily and cleanly be imported to the spreadsheet.

There is a way to get the "report html process" version to create seperate tagged Metadata: lines for each nd every metadata item. It makes the reading of the report a lot cleaner and easier. If you wish to learn how to do this, give us a call: 770-242-6687 X 119.


Top

Options

None, but a weird way to search for items case insensitive.

The default is to search for the strings as case sensistive.
So you better get it correct.
However, if you call the program with ALL UPPERCASE characters (X_WAYS_META_PROCESSING), then the search is done case insensitive.This case insensitivity is NOT currently available in the report processor program.


Top

Fields I Found

Below are fields i've found in the metadata column of X-Ways. I have yet to add any email eml fields. The list is long. When using in the program, if your research confirms what we have here, be sure to include the colon as part of the field name. That is usually the field delimiter. Also, do proper research to determine the case of the field you are searching for. Many programs arbitrarily alter the case. Notice some items below (see Content-type) have two versions.

_EmailSubject:     _NewReviewCycle:     action-uri=http:     application-name:     Application:     AppVersion:     assetid:     Attach:     attached to a shape:     Attributes:     Attributes:     Author:     author:     Bit count:     Build identifier:     Build Number:     Build year:     CACHE-CONTROL:     Cache-control:     cache-control:     Canon:     Caption:     Category:     Category:     Channels:     Char Count:     Characters:     CharactersWithSpaces:     CLASSIFICATION:     Cluster:     Code page:     Comment:     Comments:     Company:     Compression:     Consistent:     Contact:     Content-Language:     Content-type:     content-type:     Content-Type:     Copy ID:     Copyright:     Copyrighted:     Copyrighted:     Created-with:     Creation Time:     Creation:     Creator Application:     Creator Host OS:     Creator Version:     CREATOR:     Creator:     CreatorTool Date digitized:     Date Original:     Date taken:     Date:     Description rdf:     Description:     description:     Detach:     DocSecurity:     DocumentID>adobe:     DocumentID>uuid:     DriveLetter:     Duration:     EmbeddedFile:     End time:     Equipment make:     Expires:     expires:     F number:     falseCreationD:     File format revision:     File history flags:     File name:     File size:     Files:     Finish:     Firmware:     Flags:     Focal length:     Format Tag:     GENERATOR:     Generator:     Height:     Hidden count:     Host Name:     http:     http:     https:     ID List:     Image Number:     IMG:     INAM:     Interpretation:     ISRC:     Item:     Keywords:     Last Accessed:     Last Opened By:     Last Printed:     Last Saved By:     Last Saved:     Last Written:     Last-Modified:     Latitude:     Length:     Linearized:     Lines:     LinksUpToDate:     Local Path:     Locale identifier:     Logger name:     Longitude:     Lowest version:     MAC Address:     Machine:     mailto:     Maker note:     Manager:     Manufacturer:     mode :     Model:     Modification:     Moved to recycle bin:     Network share name:     noquick:     Note count:     Note:     Object ID:     Orientation:     (Original Filename: *** see below)    Originator:     OS Version:     OS:     Owner:     Page count:     Pages:     Paragraphs:     PASSWORD:     pics-label:     Play Duration:     Pragma:     pragma:     Presentation Target:     Producer:     ProgId:     progid:     propID:     PROTECT:     RATING:     REFRESH:     Refresh:     refresh:     Relative Path:     Repair count:     ROBOTS:     Robots:     robots:     Root cell:     Saved State:     ScaleCrop:     searchid:     Security Level:     Sequence:     Serial number:     Service Pack:     Set ID:     SharedDoc:     signed:     Signing date:     Size:     Software:     Source Computer:     SourceModified:     Start time:     Start:     State:     Stream Type:     Subject:     subject:     Target Attributes:     Target Created:     Target File Size:     Target Path:     Template:     theme:     Thumbnail:     Timestamp:     Title:     TotalTime:     Type:     Unique ID:     URL=http:     url=http:     URL=https:     User Comment:     Version:     viewport:     Volume ID:     Volume Name:     (Volume Serial: *** see below)    Volume Type:     Volume:     Wantlive:     Width:     Words:     Work:    
Original Filename:   use with metadata of $R... files

SPECIAL INSTRUCTIONS: READ CAREFULLY

The "Volume Serial:" number is the serial number given to the disk by Microsoft at the time of formatting. It is most easily seen when doing a "dir" of the drive. The response shows up as " Volume Serial Number is 1442-13FE". However Microsoft stores the volume serial number in the boot record in little-endian fashion at displacement 72 (from 0). So if you are trying to confirm/find the serial number 1442-13FE at displacement 72, you would actuall need to look for: FE134214 (without the dash). The link file internal record of the serial number is displayed as it is in the DIR command, so when looking at the raw (boot record) data, you need to convert to little-endian.

X-WAYS and $R.... MetaData for Original Filename:

X-Ways $R... (recycled files) and obaining the Original Filename

When X-Ways exports the metadata of the $R files, it produces a "Movedd to recycle bin" field like:

Moved to recycle bin: 2015/03/24 22:46:58.0 +0;C:\Users\DAN\Documents\Admin\Filename_whatever.pdf

Notice the actual original filename doesn't have traditional (colon :) field delimeter or a unique field name before it.
It is merged with the Moved to... as a single field. The default operation of this program will not be able to parse the original filename because it is combined with the MOVED date.

In order to allow for correct parsing of the original filename into a field, we must do the following.
Look at the time offset which was used. In this case it is +0 followed by a semicolon delimeter. Assume the entire file has the same +0; offset, we can change the +0; to reflect a correct field delimeter. Do the following,
Perform a search and replace with the following parameter (using the offset as the key).

Find:              +0;
Replace with:      +0;Original Filename: 
This will fix the fields so that now we have:
Moved to recycle bin: 2015/03/24 22:46:58.0 +0;Original Filename: C:\Users\DAN\Documents\Admin\Filename_whatever.pdf

Which now contains a new delimeted field of the original filename. Obviously, if the time zone offset is not +0, you have to use the correct representation that X-Ways has used in that field.
Add the line: Original Filename:
to your meta data field file, and run the program. TH TH THats all FOLKS.


Top

Bugs: Operational Challenges

Situation: You are working with document meta data, and have the following two fields identified in your list as wanting to segregate: AppVersion: and Version:

Problem: Because of the way the keyword searching for the field (Version:) is conducted, the AppVersion: field is often inadvertently found and segregated instead of the proper Version:, and the output fields spreadsheet will often contain the AppVersion value in the column identified as Version.

Workable Solution: In the file which contains the fields to search, change the AppVersion: item to AppVersions:. (Add an S to the item making it unique for the search algorithm). Then open the X-ways export text file and do a search and replace. Replace AppVersion: with AppVersions:. Don't forget to include the colon (:) in the search/replace fields. The adding of the S to the AppVersion field makes it unique enough to allow for a proper search. This is an innocuous change, and shouldn't give you any heartache when presenting the data.

Unfortunately I have tried to fix the coding to remove this challenge, but have been unsuccessful while keeping the output column format consistant. This is a simple and easy fix. Also, this fix logic can be appliead to any of the fields (OS Version: to OS Versions:) which have similar formats.
Top

Related Programs

X-WAYS_ID_rename  A sister program to take the X-Ways export list data and rename the exported files.

EML_PROCESS  A sister program which can easily separate the header fields within eml files.

CSV2PIPE  Is capable of removing embedded carriage returns from csv files.

Top