EML_PROCESS


PURPOSE   OPERATION   COMMAND LINES   OPTIONS   RELATED PROGRAMS


Author: Dan Mares, dmares @ maresware . com
Portions Copyright 2016 by Dan Mares and Mares and Company, LLC
Phone: 678-427-3275
Last Update: 07/31/2013
08-24-2012: found additional headers in Thunderbird forwards and now processese these items.
08-23-2012: added filesize to output fields.
08-22-2012: fixed all "known" date field conversion problems.
08-21-2012: fixed date field conversion when year < 2000.
08-20-2012: fixed comma delimeted (csv) data output.
08-20-2012: removed the --NOQUOTE option in favor of better csv process.
08-17-2012: added a clean numeric (YYYY-MM-DD) date field for sorting.
08-15-2012: added --NOQUOTE option.
08-12-2012: added full unicode filename capability.
07-31-2013: added attachment identification for X-Ways extracted eml files.

This is a command line program.
MUST be run within a command window as administrator.


Purpose

This program was designed to work with .eml files and extract or parse the header items. It also attempts to pull "header" data from any forwards that might be located within the eml file.

It reads the eml file and attempts to find the header items, then selects those items and places them into a single output file that is pipe ( | ) delimeted. This output file can then be imported to Excel, reprocessed, or filtered by the Verticle  program.

If finds the filename, and filesize, and in addition, the header items it attempts to locate and parse are: To:, From: CC:, BCC:, Date:, Subject:, Message Read Indicator:, Attachment information:, Recent IP values. If IP values are of special importance, you should also consider running the .eml files through the URL_SRCH   program.

Fields output:
All items are looked for first in the main or top level header of the email (eml) file. In the From, To, CC, BCC items, the #1 index will most likely reflect the first/main header content. Any additional headers will not be indexed, but will contain the indicator of (#body).
Any header items not located will contain blank data. Not all items are always found in the file headers.

Field #1: Filename: the system filename. If the eml file was output as an X-Ways child, the path will most likely contain unprintable unicode characters.
Field #2: Filesize: of the file on the disk.
Field #3: From: the first From: line located.
Field #4: To: The content of the destination To: line.
Field #5: CC: The content of the destination Cc: line.
Field #6: BCC: The content of the destination Bcc: line.
Field #7: Date: The literal Date: line found in the header. Often this may be blank.
Field #8: Date-Time: The Date line converted to a YYYY-MM-DD hh:mm:ss TZ format for easy sorting.
Field #9: Forward Date: If this appears to be a forwarded message, This date will be the original send date.
Field #10: Subject: The subject line.
Field #11: Message Read: If it can be determined that the message was read, a "YES" will appear. Otherwise an "Unknown" may be shown. Since all mail processors don't appear to mark read messages the same, this may be incomplete.
Rield #12: Attachment Info: If available, all items located as attachments will be listed. Not necessarily all tied to the current header.
Field #13: Recent IP: All the IP addresses located in the various headers. May be incomplete depending on how/where the IP's are listed in the headers.


Note: Caution

Please be aware, and do your own research (you do know how to do that, don't you?) to confirm that many of the programs which "extract" eml data and create eml files often create or format the header fields in their own liking. Adding some, removing others, and altering the format of those that remain. This means, that for instance, in the date field, some exports may include the day of the week, Mon. some may include the UTC offset, and others may not include a date field at all. Specifically, the attachment information uses different notation when extracted/exported from various forensic software. If the program doesn't properly identify the attachment information, please provide a sample of the eml file which contains the attachment for research.

In designing and testing this program I found that depending on the source or program that generated the individual .eml file, the header items are often formatted differently. For instance, Thunderbird saves the eml file without a lot of "MAPI-" designators, especially when identifying any attachments. Nuix on the other hand appears to place MAPI-designators all over the place. And a third program apparently creates completely seperate lines for the To: items, if there are a number of persons the email is sent to. All of these differences, (lets call them undocumented enhancements) may have some impact on the output accuracy. If the program fails to pick up all the appropriate header information, please let me see a copy of the item, so i can determine what new format was intorduced into the eml header. Thank you. dmares at dmares dot com


 

(Mozilla) ThunderBird forwarded message headers.

When researching some of the Thunderbird generated eml messages, it was found that message forwards within the document often had the keyword lines (From:, To:, CC: etc) prefaced with either an asterisk *From:, or a greater than sign >From: or some some variation of either. Since this syntax does not appear in any standard, it had to be hard coded to find these embedded "forwarded" items within the Thunderbird text. The versions after 8/24/2012 should find these items with no problem and add them as (#body) items.

Top
Top

Operation

This program finds files pointed to on the command line with the -f (file) option. It assumes you are pointing to eml formatted files (-f *.eml). It is expected that you will only point it to .eml files using the -f *.eml option. The -f *.eml option is mandatory, unless all the files in the current tree/path are .eml files.

It opens and reads each file attempting to locate the header items within each file. Because it reads one line at a time, if the file is large with embedded attachments, the program may seem slow in nature. But it does perform well. If there are multiple header items, representing forwarded e-mails, it will find the additional headers also. In the ouput record you will see the (From:, To: and Cc:) columns with items indexed such as (#1) or (#body:). If the index is the (#1), then that item found in the main or first set of headers. Any subsequent From and To headers are identified with the (#body) designator. This is to assist in identifying possible "forwards".

When dealing with the items associated with header #1, the program assumes the first header is ended when the first line identifying a "Subject:" is found. If the eml file does not contain a Subject: line, then the sequence and #body identifier may be in error. This was found in a number of oddly formatted eml files, and should be reviewed closely by the examiner. Fields other than the From: To: CC: field will get sequence numbers based on the occurance that they were found. There should be no assumption that date #1 refers to #1 From. It merely means it is the first date field located. As always, the user must take responsibilty for confirming the actual content of the eml file.

Once it finds the header items, it copies them to the output file. The -o outputfilename option should be used, otherwise the output goes to the screen, and is probably useless. If you make the outputfile a .txt, it will import relatively painlessly into Excel. Each input file information is placed in a new line or row, and the data is pipe delimeted. When loading to a spreadsheet, you MUST be sure to identify the delimeter as a pipe, and not any other delimeter. This output file can then be imported into a spreadsheet, or reprocessed with the verticle program

If you prefer to have the output with one line per item (field), then take the output of this program and pass it throught the verticle or View the verticle help file verticle filter.

The (#1) index numbers are added for each "main or primary" header item located within that message. To differentiate between a main or primary header and subsequent headers which are the results of "Forwards", the (#1) is used for the main header information, and if subsequent header information is located, then the (#body) indicator is used. The only test to determine if an item is within the primary or body header, is that we have already passed the first "Subject:" line. In some instances, this is not an accurate analysis, and the user must confirm the (#body) locations. Any match between the (#1) in the From, To, CC fields and the other columns provided MUST be validated by the user. A primary case might be the Attachment column. The indicators or indexes within the attachment column merely indicate that the (#x) value is a sequence for an attachment. The (#5) attachment for example, may be part of the main message, or one of the forwards. NO ATTEMPT IS MADE TO ASSOCIATE A SPECIFIC ATTACHMENT WITH ITS MAIN OR FORWARDED MESSAGE BODY

It is STRONGLY RECOMMENDED that the user review the header of the particular email format they are viewing, and become comfortable with the formats and layouts of each of the headers. Often the forwarded header information may be altered of an unusual format.

The Message read column is found in the Mozilla status line, and is either a 0 or 1 within the header. This may not be totally accurate depending on other items in the status line. So the Message read indicator may be unclear. If a Mapi email is pricessed, it is obtained from the "Mapi-Message-Flags:" item. Both sources of the Read field are not guaranteed to be accurate. The original message header should be reviewed in total to determine any other modifications.

A note about attachment file names with the format: ATTxxxxx.htm. This attachment is usually associated with a signature or other non-essential file which is part of a wireless communication. In research and review, it can usually be an ignored attachment.

A note about the date formats found in the files. I have found that the date format will show up in one of two formats.
Date: Wed, 9 May 2012 13:46:32 -0400
Date: Wed, May 9, 2012 13:46:32 -0400
Date: Sun, Jun 20 04:40:13 1999 MS
Notice the 2nd version has the day of the month following the literal month, while the first has it before the month. Most of the programs will form the date as the first format (Date: Wed, 9 May 2012). The program picks up this format to conver the literal text to a numeric value in a new field for better sorting. If it finds the date format in the 2nd format, the date converted field is generally correct, while the 3rd format is not yet processed properly. One day I might fix this, but for now, it doesn't seem to be much of a problem, as the 3rd date format does not appear to be a standard.

On Aug. 22, 2012, these date field conversion problems were fixed. But without any beta testers willing to test the progress, this version has not been released. If you have problems with the date processing, contact us for the fixed version.


Top

Command Lines

C:>   eml_process    -[OPTIONS]
C:>   eml_process    -f *.eml    -o outputfilename.txt

Top

Options

Traditional path, file, output options.

-p + path(s)    If more than one directory is to be looked at, then add the paths here as appropriate. (-p c:\windows d:\work)

-f + filespec    If more than one file type is needed, add them here. (-f *.c *.obj *.dll)

If the above options are used, the program builds a matrix of paths and file types. It searches all the requested directories for all the requested file types, thus producing a total of all the files in all the paths requested. These options are added to any default command line provided.

-o outputfilename:     output file to create.

-d delimeter:    The default delimeter is the pipe (|) symbol. There are no programs that reprocess data that can't handle a pipe delimeted file. This is because it is a reserved character, and is almost never (at least not as often as commas) found within a normal data field. To change the delimeter to something else, ie. a tab, use the -d option followed by the decimal number of the delimeter value. so a tab delimeter would be: -d 09. A comma delimeter can be quoted, -d ",", If comma delimeter is used, then the traditional quotes around text fields are inserted so that Excell and other program properly recognize field delimeters. The quotes are only added for the -d , option.

When importing the output of this program to a spreadsheet, you MUST inform the spreadsheet that the data being imported is pipe delimeted. Otherwise the import will NOT work.


Anomolies Located

In processing .eml files that have been exported from various source programs (ie: X-Ways, NUIX, Thunderbird, etc) I have found that the header information is not consistant across all platforms. One example is already identified above in the Date: parsing section. Some programs export the date in DD mon YYYY, others use mon DD YYYY others use mon DD YY, Date: Sun, Jun 20 04:40:13 1999 MS, etc. So far, the format "Sun, Jun 20 04:40:13 1999" where the time is inserted prior to the year is still not parsing correct. It is a minor set of data and the user can easily find and correct the incorrect parsing. In other situations, the header fields have tab characters where there is usually spaces. This will throw off the delimeters if you ask for a delimeter of tab -d 09. In short, if the output is inconsistant with what you know (and you better research your own output) to be a valid format, please let me know, and if possible send a sample of the file. I can't fix, what I don't know if broken.

The program is designed to output its format as a pipe (|) delimeted file. Since this format has been around since before many knew what a PC was, it (the program) performs best when using the default output delimeter as the pipe. Unless you have a real need to use other delimeters, please use the default.


Related Programs

Verticle   turns the pipe delimeted file into a line item for insert to reports.

X-Ways Metadata   processing. Processes the metadata field in the X-Ways "Export List" output.

Top