To search a file for occurrences of specific keys.
The program allows you to search fields in a record (row) for the occurrence of specified search keys. The file must be made up of fixed length records. If the records are not fixed length but are delimited, there are a number of Maresware programs that can make the records fixed length. There are a number of different modes which can be used to search a file.
If your input file is sorted on the key field, then the BSEARCH program will increase speed dramatically. The Bsearch program can only be used with fixed length records.
This mode searches for keys in one specific field or each record.
The 16 bit version of SEARCH has a search key limit of 2000 keys, or a parameter file size of 128000 bytes. Whichever is larger. The limitations here for the 16 bit version are that the keys MUST be sorted before the program is run. Since the 16 bit version is no longer actively supported and is superceded by the 32 bit version, this limit should be of little consequence.
The 32 bit Version has a key file limitation of available memory. There must be enough free memory for the entire key file to fit into. (On most machines this is in the multi megabyte range). The 32 bit version thus has no realistic limit to the size of the key file. However, if the number of keys exceeds a few hundred, the compare program may be more efficient. But that requires the files to be sorted.
SEARCH can be used to search either a fixed or certain special formatted variable length records. If a variable length record is searched, the key field being searched MUST reside within the fixed section of the record. These types of records are most often found on mainframe data bases, and are not of much concern to PC users. Unless the data is from a mainframe.
SEARCH will also allow you to search a second field of the record for a second key match. The search is done on the first field, and when the first field matches, it then goes through the keys in a second key file to determine if any of these keys in the second key file match the one in the record. Three possible searches to conduct on the second key field are logical AND, logical OR, and logical NOT.
The GSEARCH mode allows you to literally "program" the SEARCH program with Boolean logic to find records meeting your criteria.
In the GSEARCH mode you can search any number of fields for any number of conditions and keys. This mode allows for logical AND, logical OR and logical NOT searches. It allows similtaneous searches on up to a 100 sets of boolean tests.
It can be used for both fixed and variable length records.
Using this mode allows the user to effectively “program” SEARCH into doing many different logical tests on each record. It is extremely versatile, but not easy to understand and the learning curve may be steep.
GSEARCH mode is automatically installed when the line of the parameter file (third line) normally holding the length of the search key is set to a 0. Thus indicating to the program that the length of the search key will be found elsewhere. The parameter file format in this case takes on a completely different format than a normal search function.
SEARCH will also allow you to build/rebuild your output record from pieces of the selected record. In this case the output can be only those fields that are of interest to you. This is the filebreak (-f) option, and its operation is identical (but not as robust) to the stand alone filebreak program. If you wish to use the filbreak options, refer to the filbreak documentation which is located in a separate section.
The filbreak option can break up variable length records with a fixed portion at the beginning of a record, and multiple repeating sections following the fixed section. See the explanation that follows concerning the makeup of the parameter file associated with this operation. This filbreak output also allows you to make fixed length records from variable length records. (see repeating field)
FILBREAK OPTIONS FOR CREATING FIXED LENGTH OUTPUT RECORDS
The next sections deal with very specific mainframe type data base records containing repeating sections and are of little use to persons working with PC data bases.
When using records where each record is of a different record type and thus has a different record layout, however, at least one field in the fixed section is always the same for each group of records(account) and it becomes necessary to select all records with a particular code (account #) located in the same field of each record.
The replacement parameter file will allow you to select all record types having a similar account # fields.
When searching variable records occasionally the record will contain a fixed section and repeating sections. The repeating sections will all be the same format (length) and will occur after the fixed section.
If, and only if your selected output records will contain records that are of the same format and contain similarly formatted(length) repeating sections then you can use the following filbreak parameter file format to do the following.
The filbreak parameter file will cause the program (when it selects a record) to take the fixed length portion of the record and one of the variable sections and create an output record containing the fixed and 1 variable section.
FIXED PORTION1 ALWAYS SAME| VARIABLE1 | VARIABLE2| VARIABLE3 FIXED PORTION2 ALWAYS SAME| VARIABLE1 | VARIABLE2| VARIABLE3
The output for record selection 1 would be:
FIXED PORTION1 ALWAYS SAME| VARIABLE1 | FIXED PORTION1 ALWAYS SAME| VARIABLE2 | FIXED PORTION1 ALWAYS SAME| VARIABLE3 |
This procedure is repeated until there is one output record created for each variable section of the input record.
This procedure usually creates an output file greater in size than the original input.
You can use any GSEARCH search parameter file with this filbreak option providing it will only select records of similar format(record layout).
SAMPLE FILBREAK PARAMETER FILE FOR MULTIPLE REPEATING SECTIONS
mult 40 /* full length of repeating section */ 0000=026 /* fixed section, or part of it */ 0035=030 /* location and chosen length of repeating section to place to output record*/ Final output record will be 56 characters. The format is: Line 1: MUST contain the word “mult” to signify a multiple repeating section record. Line 2: MUST be the length of the repeating section. Line 3: The normal displacement and character count of the fixed portion of the record Line 4: The displacement and length of the variable/repeating section of the record.
SEARCH also has the option of searching an ascii file that has a packed decimal as the field to be searched. This packed decimal ASCII file is assumed to have been generated from an IBM mainframe EBCDIC tape that was processed through filbreak to convert EBCDIC to ASCII while leaving the packed decimal fields alone.
The command line option -p for packed field will tell the program that this is a packed field. It has only been tested in a limited fashion so use caution until you are confident the program is doing what is expected of it. The filbreak option of search can take a packed field identified in the filbreak parameter file as a ‘u’ field and unpack it to 2x its listed size.
The filbreak option of search contains the infomix date and integer handling routines to expand these fields. It cannot as of yet search them directly. If you need one of these fields searched, use filbreak to expand the field, and then search the resulting output file. Check with the author for additional explanation of this if you need it.
Variable length records must be preceded by a 4 byte ASCII count of the record length.
RECL| here is the data portion 0050| DATA UP TO record length of 50
These are generally generated from mainframe data bases. If tape is the input medium, there can be no spanning of blocks with a record, and it will not take blocks that are preceded by a block count. The associated options when accessing variable length records are ‘ U4F’. The F, 4 and -f filbreak option are not necessarily compatible with each other (see options below).
If you use this variable length input option, the parameter file is identical to a normal parameter file with two exceptions.
The blocksize listed (line 1) is ignored (however, if using tape input, this blocksize must be at least as large as the largest tape block expected to encounter or data will be lost. Usually 32000 is sufficient to insure no loss of data). If using disk input, just place a reasonable size on line 1, it will be ignored by the program but must be present for space holding purposes.
The record length given should be one of two sizes. The easiest and most reliable is to place a 0 on line 2 of the parameter file. The 0 record length indicates to the program you don’t know what the record length is, and SEARCH will expect to find variable length records.
The second method. If you wish to use this same parameter file on a fixed length record, then it should be what you expect the fixed length record to be, it will be ignored when the program determines the variable length option is chosen. In this case to communicate to the program that the input is variable you must use the -U option (for Unisys) on the command line to indicate that it is a Unisys file and that the records are variable. An alternative to placing a value in line 2 of the parameter file and using the -U option.
When the 0 is used on the second line of the parameter file, the -4 option is also automatically installed and cannot be ignored. (unless the -f option is also used which negates any -4 option chosen.
You can’t have a filbreak option and expect the 4 byte record length to remain accurate.) -U4 option is automatically installed with the -f option.
It is also NOT advisable to use the carriage return option (-r) with the -4 option. This is because the -4 option copies the record length (4 bytes) to the output record for later reprocessing. If the carriage return is added (thus changing the actual record length), then this 4 bytes will give an erroneous record length for further processing of the output records.
The program also has an option -F + # to make the output a fixed length record of # characters long. It has logic to automatically adjust for the -r and -R options. If using variable length records input, care must be taken to make the fixed length at least as great as the longest expected record. Unexpected results will occur if any of the variable records are shorter than the fixed length requested. No error checking is done to truncate longer records.
When using the variable record length options, be careful that the options used all are logically compatible as not too much error checking has been placed in the program. The program gives you a lot of flexibility, and as such cannot check for operator error.
The major difference is that the parameter file items take a different format which are explained in the GSEARCH parameter file section. Note that some (most) options are not compatible with the GSEARCH operation. This is nothing to be alarmed about, since the GSEARCH capability is better than most of these options. However, because of its power and versatility, the GSEACH mode is not simple to use and thus the user should become familiar with it before doing any extensive analysis.
The GSEARCH mode can be used on both variable and fixed length records. Its operation should be reserved for specific situations needing its search power. This is because the GSEARCH operation is slower than a normal search operation.
See the explanation of the GSEARCH parameter file below for full details of its use.
C:>search input output parameter_file -[options] Item 1: Program name [search]. Item 2: Input file name. (multiple input files and wildcards can be used) Item 3: Output file name. Item 4: Parameter file name Item 5: Options (if used)
c:>search inputfile output.file paramfile -c
convert unprintable characters to blanks
c:>search inputfile output.file paramfile -M 9999
convert output field to characters 9999
c:>search input output paramfile -S keyfile
use keys from “keyfile”
c:>search input output param -c -z
convert unprintables, and zap the matched records. show only mismatched records
c:>search input output param -f break.param -r
use filebreak option and add carriage return at end of record
c:>search input output param -c -h 2 -z
convert, zap, and pass a tape header
c:>search input output param -1 65
convert un printing to decimal 65 which is upper case A
c:>search input output param -1 Z
convert un printing to Zeros (0)
c:>search input output param -c -n -h 3
convert all unprintables, but leave the newlines alone, with three tape headers
c:>search input output param -k 2ndkey.fle
search on a second field for field and keys as defined in the file 2ndkey.fle
c:>search input output param -k 2ndkey.fle -K 2
Search on a second field for field and keys as defined in the file 2ndkey.fle, and use the NOT logic. Meaning field 1 key present in record, but field 2 keys not.
c:>search input output param -p
search the input on a field that is packed decimal
c:>search input output param -U4
process variable length records and maintain 4 byte record length.
c;>search input output param -F 20 -f break.par
force a 20 character output record after using filbreak option
(GSEARCH operation takes a different parameter file format. see next section )
The parameter file contains information about the input file, and it also contains the search keys. The search keys must be one to a line with carriage returns. Created with unix screen editor “vi”, DOS edit, or a non-document text editor.
The parameter file layout is as follows:
Line 1: Input blocksize. (ie 20000) The blocksize MUST be an even multiple of the record length except when using variable length input records. In this case use any number except a 0. Maximum blocksize is 65535. If the value has an 'A' following it (ie 20000A) then the 32bit version automatically adjusts the value for optimum performance.
If using tape input, must be exact blocksize. If using disk input, can be blocksize up to max of 32767.
If using tape input with variable length records should be large enough to hold biggest tape block expected. (Usually 30000).
Line 2: Input record length (no leading zeros). If a variable length record is used as the input, then this item can be a 0 to so indicate to the program that input is variable, or if you put a number here and are using variable length input, the use the -U option on the command line.
Line 3: Displacement to the key field in the record. First character of record is displacement 0.
If variable length records are used as the input, this displacement MUST lie within the fixed length section of the record and cannot lie outside the length of the shortest record.
Line 4: The length of the keys we will be searching for.
Line 5-end: One key per line. Must be carriage return delimited. If using the -S option, then the second key file must have one key per line.
NOTE: If you wish the last character of a key to be a blank, be aware that some editors do not let you end a line with a blank. They automatically truncate the line at the last printable character.
SAMPLE PARAMETER FILE:
9000 \* line 1: input blocksize - max of 32767. multiple of record length *\ 90 \* line 2: input record length == 90 *\ 7 \* line 3: searched field begins at displacement 7, first character is displacement 0 */ 9 \* line 4: length of field to be searched is 9 *\ 123456789 \* line 5-X: individual keys to compare against *\ 234567899 \*if -g or -l option is used, then only the first*\ 390483244 \* search key is used *\ 444223456 Parameter file comments after at least one blank line.
After the last search key input, it is advisable to enter one carriage return on a line by itself. This is because, depending on the program used to create the parameter file, the proper file ending sequence may not be detected by the program.
After the last line and two blank lines, comments may be added to the parameter file.
NOTE: If searching for a one character key, you CANNOT enter any comments. Sorry, that’s the way it is.
Line 1-2: Same information as standard parameter file. i.e. blocksize, record length. If you are using variable length input records, then record length (line 2) will be 0, variable options (-U4) are automatic. Alternatively you can use the -U option if the input is variable and you desire to drop the 4 byte record length.
Line 3-4: MUST be a 0. The 0’s on the line that would normally be the displacement to the field being searched is the indication to the program that the GSEARCH operation is needed. This then sets into motion checks for the appropriate keys in the lines following.
Line 5-: (max of 225 lines):Each line contains 4 items, one after the other, with no spaces in between, and each line must end with a carriage return. The 4 items are:
Item 1: 4 characters designating the displacement(from 0) to the field to check. Use leading zeros if necessary. (i.e. 24 is 0024)
NOTE: For each logic group, the displacements should be placed as much as possible in ascending order. For logical reasons, if you are using a 0 or 1 condition (see next section), this is mandatory. Situations have occurred where this policy wasn’t followed and the program gave erroneous results.
Item 2: One character depicting the type of logical operation to perform on the field. Currently the following are available:
An ‘=’ (equal sign) means the field must match item 4 exactly
A ‘>’ (greater than) means choose all items greater than that in Item 4:
A ‘<‘ (less than) means choose all items less than that in item 4.
A ‘!’ (not equal) means choose if the record key is NOT EQUAL to this key.
A 'b' for the Boyer-More string search method. See end of the search documentation for additional details. This search basically allows you easy searching of a name in a “floating” field.
A 'B' for a "REVERSE" search of the Boyer-More string search method. See end of the search documentation for additional details. This search basically allows you easy searching of a name in a "floating" field. The upper case 'B' causes those records NOT meeting the match to be selected.
An 'h' for 'h'alloween Mask. This is similar to the 'v' section below, but will validate alpha numeric according to the mask. See the (SEE validate BELOW)
An ‘m’ for the multiple/repeating section.
An ‘e’ must be used with the m for repeating sections.
A ‘p’ for Jerry option.
An ‘s’ for SOUNDEX search.
An ‘r’ for the very special search where records repeat. See end of the search documentation for details.
Item 3: A zero(0), one(1) or two(2) which is called a condition operator. See items below that describe these condition operators.
The 1 may be replaced by a + (meaning AND), and the 2 may be replaced by a pipe | (meaning OR).
Item 4: The string to match on (max of 70 characters). MUST end with a carriage return.
Blanks are counted as legal characters to check for. With one major exception. The BOYER-MOORE ‘b’ search and the SOUNDEX ‘s’ search has a special ending sequence. see BOYER-MOORE or SOUNDEX section at end of documentation. (NOTE: some word processors do not let you end a line with a blank, in this case they truncate the line which is not what you want to do)
Lines 1-2: 10000 /*blocksize, if variable tape input, use 320000 */ 100 /*record length or if variable input a 0 */ Lines 3+4: 0 /*dummy (place holder displacement */ 0 /*this is key length line, MUST be 0 to indicate GSEARCH option */
NOTE: Each time lines 5-: is indicated, this means it would be an example of a completely new parameter file. spaces are included for legibility: DO NOT USE them in real parameterfiles
Line 5-: 0026=0ABC
/* look at displacement 26 and match the 3 letters ABC, = the 0 indicates that this is the beginning of a sub grouping. In this case it contains only one statement. */
Line 5-: 0026=0ABC /*look at displacement 26 and match ABC */ 0050=1CDE /*AND because this is a 1 condition also match CDE at displacement 50 */
If BOTH are not matched the record is not chosen. Any number of 1’s can follow the 0 condition, and because a one (1) means AND, ALL must be met to select the record. In the above list, both must be matched to select the record.
Line 5-: 0026>01000 /* set 1: at disp 26, select all (0 condition) greater than 1000 */ 0050<150 /* AND(one condition) all that are less that 50 at position 50 */ 0100=0ABC /*OR (new 0 set 2:) at displacement 100 select if ABC ismatched.*/
Here, the first set of 0 conditions 1000 and 50 can be matched, OR the second 0 condition set can also be matched.
If EITHER 0 condition set is matched, the record is chosen. In the above list there are 2 possible ways to select the record, because there are two 0 groupings.
Line 5-: 0010=034 /* set 1: look at position 10 and match 34 OR */ 0020=012 /* set 2:look at position 20 and match a 12 OR */ 0100>09 /* set 3:look at position 100, match greater than 9 OR */ 0024=0ABC /* set 4:look at position 24 and match ABC AND*/ 0030=1987 /* look at position 30 and match 987 AND*/ 0050=230 /* match at least 1 of the following conditions*/ 0050=240 /* if any of 2 conditions is met choose the record*/ 0050=245
In the above parameter file there is a choice of 4 possible (0 condition) sets to select a record.
Each grouping must begin with a 0 line. All 0 groups logically OR with each other, and all 1’s following a 0 must be met and are ANDed with the associated 0 group, and ONE but not all the 2’s must be met.
Line 5-: 0010=0 20 /* one 0 set, match a 20 AND */ 0030=2 10 /* match 1 but not all of the items of the (2 sets) following */ 0030=2 20 0040=2 12
for multiple/repeating fields use this format
30000 0 0 0 0026=020 0040m0 0036=148 0040e
The 0040m0 indicates the length of the repeating section is 40.
After that line you can place normal GSEARCH parameter lines pertaining to the fields located in the repeating section.
After the parameter lines, you must end it with a line 0040e indicating that this is the end of the repeating section.
Searches using the GSEARCH mode are done in groups.
A group consists of one or more related lines in the parameter file, with a maximum of 200 parameter lines for the entire file.
Each group MUST begin with a zero ‘0’, and only one zero ‘0’, is allowed in each group.
Each group is treated as an independent logic test and seperate from the other groups in the parameter file. This is what allows you to develop multiple logic strategies. Each group can contain separate and unique logic strategies.
Within each group tests are conducted on the position and string described in each parameter.
IF, all the CONDITIONS are met within a group, the record is selected for output and additional groups need not be checked. The record, once selected is selected because it meets a single group logic test. Therefore additional group tests are not required. This may have an effect on the accounting file numbers.
If the conditions in a group are not met, then each group (0 condition set) is checked in sequence until the record is either selected, or all groups have been exhausted and the record is NOT selected.
Within a particular group certain logical operations can take place. These operations are grouped together under the conditional operators as follows:
Operator 0: numeric 0 not letter O. This MUST be the first operator of any group. The conditions on this line MUST be met before additional grouped items are checked. If you have groups containing only 0 conditions (one line groups) then if any of the 0 group conditions is met, the record is selected. The 0 groups are logical exclusive OR’s.
0010=0MUST BE THIS SINGLE ITEM: GROUP1
0010=0OR MUST BE THIS SINGLE ITEM: GROUP2. A LOGICAL OR with GROUP1
There is always an IMPLIED AND between a 0 line and a subsequent condition line of either a 1 (+, AND), or 2 (|, OR) condition.
Operator 1 or + (plus sign): If you want to add additional criteria to a group the 1 or 2 operator is used for subsequent lines (logic tests) within a group.
1’s ( or +'s) and 2’s (or '|' pipes=OR's) cause different logical tests to be done before the record is either selected or rejected.
The 1 operators are used to indicate that the conditions on ALL the 1 lines must also be met (along with the leading 0 line) in order to select the record. (LOGICAL ANDing)
You can have as many 1 lines following a 0 group as are necessary to complete your logic.
0010=0 MUST BE THIS
0030=1 AND THIS LINE
0040=1 AND THIS LINE ALSO
(Remember: ALL 1 lines must also be met. if not, the record is not selected and the next 0 group (if present) is checked).
Also, the displacements for the 1 lines must be in ascending order, because when the program hits a short record, and the 1 displacement is assumed past the end of the record then that record logically cannot be selected because that displacement does not exist to check.
Operator 2 or | (pipe symbol): The 2 operator (if used) must follow the 0 and (optional) 1 operator lines.
The logic that a 2 operator uses is if ANY one of the 2 operator lines meet the criteria then the record is selected and no further checks of subsequent 2’s are necessary.
In other words, the 2 operator is an OR operator and the 1 operator is an AND operator.
0010=0 MUST BE THIS
0030=2 AND ITEM2
0040=2 OR ITEM3
0010=0 MUST BE THIS
0060=1 AND MUST BE THIS
0030=2 AND THIS
0040=2 OR THIS
The v and V characters as the second item on a line in the GSEARCH parameter file indicate that the user wishes to perform numeric verification on this field.
The lower case v indicates the record is to be selected ONLY IF this field is ALL numerics. Numerics are 0-9 only. Commas, periods and dollar signs ($), are NOT considered part of a number.
The upper case V reverses the lower case v logic indicating the record is to be selected ONLY IF the field is made up of alpha numeric characters.
Item 4 on the parameter line which is usually the item searched for should now just be a string of characters (any characters) that is as long as the field you wish to check. Detecting the length of this string is the only way the program has of identifying how many characters are in the field you wish to ‘v’alidate.
The 0, 1, 2 logic of item 3 still apply to this option.
A sample line might be 0010v0888888888
The validate 'h' action, for 'H'alloween Mask option is similar to the 'v' except that the validation checks for alpha, numeric or special characters depending on what is in that specific position. To check for an alpha character, place an alpha character, say an 'a' in that position. To check for a number, place a number, say a '9'. Any character other than alpha, or number (say a dash -) will be expected to match EXACTLY. So if you wanted to match a hyphen (dash - ) then you would place the dash in the location. A field to check a telephone number would be: 999-999-9999. (THERE IS NO NEGATION TEST POSSIBLE WITH THIS MASK)
Here is a modification which will allow you to check two fields in the same record and determine if they have the same value. (it will also work for unequal tests with the correct g-search options)
To get it to operate, you place a ‘j’ for jerry, in the location (item 2) that the ‘=’ sign would normally be placed in the parameter file. (0010j0filler_length)
The field displacement is still to be the location of the field. (0010)
The type of match (i.e. 0, 1, or 2 ) should at this time be only a 0. (If not, a 0 is forced anyway).
Then enter a key. (filler_length)
This key to search for can be any data (it is only a filler, identifying the length of the field).
The length of the key must be exactly equal to the length of the key you are looking at. (i.e. if you are looking at a key of ssns (length of 9), then the key would be aaaaaaaaa It can any value. Just as long as it is the proper length.)
The paramter line to start this jerry match would look like this, if the location of the field was at displacemet 14.
Then, on the next line, you create a standard g-search condition parameter line.
Place the displacement (003), the type of match, (=, !, g, l etc), and then the 1, or 2 as necessary (a 1 is more logical), and then again insert enough characters to fill the length of the key that will be searched. Here is the second line of the jerry search.
The field length for each line must be the same.
The first time a j is found the contents of that field in the input record are copied into the next search key thus seeding the next search key with the contents of the current key of the data record.
The first key (with the j) is forced to a match which then continues on to the next key and processes it with a 1 or 2 as necessary.
The 9 a’s or any thing you want to put there are merely place holders. The two search key lines would look like this.
which say, go to displacement 14, force a jerry cain match of the 0 set, and copy the 9 characters you find at location 14 of the input record to the second key to search for.
then look at displacement 30, and search for the 9 characters you just copied into the 2nd search key location. The logic should also work for a ! (not equal) and greater (g) and less than (l) test in addition to the ‘=’ test.
After the first field is located and the characters are seeded (copied) into the second key, then all other g-search logic should work.
Here is a sample that will also pick up both 2 and 3 fields of the same value
40 40 0 0 0004j0aaaaa seed the 2nd key. 0014=1aaaaa check the 2nd 0014j0aaaaa now that the second key is seeded, seed the 0026=1aaaaa seed the last key.
The BOYER-MOORE is designed to do is to allow you to search for a string of characters in what I call a floating field.
A floating field in this definition is one in which the characters in the field can be located anywhere within the limits of the field itself.
The key to coding this search is the asterisks (*) which identify the field length.
A prime example is one where you have the capability of entering the entire first and last name in one field. In some instances you will find the first name first, and in others you will find it last. There is also a possibility that spaces may be added freely which will cause the characters to “float”. A sample of what I mean follows. The *’s represent the beginning and end of the field, (Not the syntax for coding the search)
*DAN MARES* *MARES,DAN* *DANIEL J. MARES*
Searching a field like this for a specific character sequence (i.e. a last name) is very difficult.
You would have to take into account all the possible locations of the name and provide a separate parameter line for each possible displacement. For instance, if the name field began in displacement 10 and you were looking for Dan Mares, you might need up to 5 different parameter lines to cover all the possibilities. Like these: (the ! shows where the field may end)
0010=0MARES! 0011=0 MARES ! 0012=0 MARES! 0013=0 MARES ! 0014=0 MARES !
This would continue until the entire field was covered to make sure that no matter where the MARES was
located, it could be found. This would be necessary in order to make sure you found MARES even if it was
entered into the field like MARES, DAN and not DAN MARES.
(In this case the stars (*) are actually part of the formatting, and need to be included to delineate the end of the string, and end of the field length to search.)
converted to boyer moore parameter would be: 0010b0MARES* * or to negate the search use upper case B: 0010B0MARES* *
If the format of the key is:
with a value right after the search item, the length of the boyer key is changed to that value.
This fixes cases where a very very long field is needed to search and eliminated making the line a lot of blanks. The max length of the field is 2048 bytes. Hopefully you will never have one that long.
With a short name this is not too much of a problem. But think about the case where a character string can show up anywhere in a 35 position field. This is the case with street addresses and PO Box numbers. The BOYER-MOORE option fixes this predicament.
What you do is this.
1: Use the proper displacement to the beginniing of the field. (i.e. 0010).
2: Instead of the = (equal sign) replace it with a 'b' or upper case 'B'. This routine should only go in the place of an ‘=’ condition.
3: Then enter the normal ‘0’ for a 0 zero condition set.
4: Last, enter the string you wish to search for with the special formatting explained here.
Immediately after the string, enter a ‘*’ (asterisk) as the string terminator, but DO NOT terminate the line.
Now, enter blanks to fill out the line so that the length of this line in the parameter file is as long as the number of characters in the field you wish to search. (Usually it is the field length, but can be less, and most often is). If you are using an editor that does not allow blanks at the end of a line, enter some obscure character, like another ‘*’ as the end of the line. So a parameter line to search 20 characters for SMITH might look like:
0010b0SMITH* * 0050B0A CITY* * // not this city
the second * asterisk here denotes where the END OF THE FIELD IS
Obviously, the only restriction is that you can’t search a field longer than you can enter as a single line with the text editor you are using to create the parameter file. (80 characters - minus the amount of space the leading 0000b0 takes up)
This parameter line can then be followed with any other reasonable parameter sequence. (i.e. 0, 1, or 2 sets).
Soundex is a way of searching for names based on a phonetic match. It is not an exact match and often creates more hits than are necessary. However, the phone company uses this method to locate names when you use directory assistance. And they are pretty successful at it. The Soundex search I have installed is not as good as the phone company and you should experiment with it before relying on it. It is only an approximation and you should make yourself familiar with its limitations before using it.
You would only use this option if you did not know the exact spelling of a name, which often happens. Or you were looking for a street name in an address field, and didn’t know the exact spelling.
I caution anyone who uses this option, that it is only a phonetic approximation and may not always be as accurate as you expect. You should experiment thoroughly with this option so you are familiar with its capability.
An example of what it can do is to find the name MARES if you were to ask it to find MARS, MARZ, MARIES, MARIS and others. If you can get close with your spelling, it will work fine.
The searching algorythym is relatively slow (about 4 X normal), but there are tricks to make it seem as fast as a normal search. I will give you one hint later.
HOW TO USE THE SOUNDEX SEARCH
It is virtually identical to the Boyer-Moore search. The parameter line is similar with 2 exceptions.
1: Instead of placing an ‘=’ or ‘b’ in the position for the logical operation to be performed you would place an ‘s’ indicating a SOUNDEX search is to be performed. This has the same effect as an ‘=’ search. Meaning that it must return a match for the record to be considered.
2: Then place your set criteria, ‘0’, ‘1’, or ‘2’.
3: Then place the key (name) you are searching for. Be sure, here again, to make the ultimate length of the line at least long enough to cover the field it is searching in. The program WILL parse as many names as it finds in the field and check each one. So, DAN MARES and MARES, DAN would be searched on both names to find a soundex match on MARS. A sample line is:
The ‘*’ was included to show the end of the field.
The algorythym I used is one which is supposedly used by the Georgia drivers license bureau. And is relatively accurate. I say relatively, because there are obviously some phonetic spellings which it doesn’t contain. If you should find that it consistently misses on a specific phonetic match, let me know and I can add it. However, there are tricks (too complicated for this manual) that you can use to overcome these problems.
One trick, to make it run as fast as a normal search is this. Assume we are looking for CUSICK. We don’t know how it might be spelled in the record so we will search for CUSIK. The one thing we are sure of in most instances is the first one or two characters of the name (CU). We use this to our advantage. First we set up a normal 0 set to search for records containing CU as their 1st 2 characters.
Then once we have located only those items, we use the soundex search as a ‘1’ set to check the entire name.
The entire two lines of the parameter file are:
This combination was up to 4 times faster in test runs. But remember, you must know and be confident that the original search will find an exact match on that many characters.
This is a specialized parameter file which will only search records with repeating fields, or something that has a similar format.
Let me explain.
When certain records are returned from mainfram data bases, a complete account is contained in any number of contiguous records.
The first record usually is a specific record type or identifier (lets call it type 13) which has, among other information a key (sequence number) to all the other subesquent records. This sequence number might be the account number of the individual identified in the initial type 13 record.
This key is a character field containing the sequence number or individual identifier..
Subsequent records contain this sequence character field (located in the same place as the 13 record has it located) to identify those records as belonging to the same individual account holder.
Prior to this enhancement, what had to be done to obtain all records for an account, you first had to pass the file and obtain the sequence numbers for those accounts you were looking for. Then you had to rerun the program search again for ALL records containing the sequence identifier. This took some time.
What this new option does is this.
You tell it what to identify in the first record of a sequence of records.
In this case, identify the type 13 record for those accounts you are looking for.
Then once it finds the 13 record, extract out that field which is the key or identier to the following records in the set. In our case it will be 15 characters relating to an account number.
The program builds a ‘0’ set using these 15 characters, and subsequently extracts all the appropriate succeeding records belonging to the same account.
Then when it locates a new type 13 record matching the 13 parameter, the process starts again.
This is how you set up the parameter sequence to do the search.
First, set up a 0 set, AND IF IT IS THE ONLY LINE IN THE SET replace the ‘=’ sign with an ‘r’ for (r)eplicate. So that line would look like
(look in 30 for a 13)
If it it to be followed by 1’s or 2’s, (use the = instead of the r) and follow it with any appropriate 1, or 2 set lines.
In the last line of the set, whether it is a single 0 set, or has 1’s and 2’s(if you use 2’s, each 2 must follow the following format) replace the ‘=’ with an ‘r’ for replicate.
0030r013, or 0030r113.
Then follow this set with a ‘0’ set identifying the field that the sequence identifier or account number will be found in.
Be sure to fill out the item with a sufficient amount of field filler to let the program know how long the field is it is supposed to search for. Example:
This says, look in displacement 0 for a field of 15 characters (here identified by the a-o character string). Don’t worry about what you place in the string. It is only used as a place holder. The program will replace it as soon as it finds a proper match for the previous set.
When the previous 'r' set is matched, the characters in this set will be replaced in memory by the proper number of characters found at location 0 in the currently matched 'r' set record.
This now creates in memory a ‘0’ set with the proper search key taken from the good(hit) record (13) which we have already identified as a positive match on the first set, so that record is selected and safe.
And now since we have a properly designed second ‘0’ set, any records immediately following with that character sequence or identifier will also match this second ‘0’ set and be selected.
Here is the entire parameter file that would be used to pick out all records requested with the DAN MARES at locatin 125, and account ID 07999.....
0026=013 /*find a type 13 record, assume ID record */ 0060=10799999999 /* find our account id. number 0799.... */ 0125r1DAN MARES /* test this field and if a hit, then copy data this record to another 0 set memory location as if */ 0000=0place 15 chars. /* it was contained in position 0 and place the 15 chars here */
the 15 characters it places in the second 0 set, are those found in the current hit record containing the type 13 ID account information
or, to be fancy, and add more criteria to the 13 search:
0030=013 0050r1this_one 0000=0abcdefghijklmno
I caution the user to test this out an become familiar with its operation before putting it into production. It works fine. But its logic takes some getting used to. And above all a thorough knowledge of your input file is required.
This type of selection can be used for any file where there is a related field from the first record in a set to subsequent records.
-a: Appends output records to existing output file.
-A: Create/append to the accounting file ACCT-ING.
/A: Turns off auto-accounting set by environment option. If environment variable ACCT is set to on (set ACCT=ON), then -A is automatically installed each time program is run. The /A negates this operation.
-[8|9]: This option operates the same as the -A (accounting) option.
In addition, at the end of the search keys, if you place at least one blank line, you can then add comments to the parameter file. These comments will be added to the accounting file. If the byte count of the search keys is less than 250, then the entire parameter file is added to the accounting file.
-b or s: Used with the -f option. It causes certain specified output fields to be replaced by blanks. See filebreak.doc for more specific information on the c (convert) section of its parameter file. It was done for those who can’t decide whether to use a space or a blank in the field. See filebreak.doc for more specific info.
-C + #: Begin processing at # record number. This is similar to a checkpoint restart. *NOTE: If disk input is used, a control C will cause the program to abort and write whatever output is obtained to the output file. You can then resume later with the -C option.
-c: To convert any unprintable characters in the input file to the printable character ‘~’. Use this if you plan on printing the output file. As unprintable characters may play havoc with the printer or CRT. Mainframe filler characters (0x00’s) are especially troublesome.
-1 + #: To convert unprinting characters to #, where # is the DECIMAL representation of the character you wish to convert to. (I.E. a upper case A is decimal 65).
-1 + B: Special default, to convert unprinting characters to a ‘B’lank (decimal 32).
-1 + Z: Special default, to convert unprinting characters to a ‘Z’ero (decimal 48). The #, B, or Z MUST be the next argument, and it must be all by itself.
-d: Used with the -f option. It causes that output field to be replaced by dashes. (similar to -b option except it creates dashes instead of blanks) Ideal for reformatting ssn’s. See filebreak.doc for more specific info.
-e: Converts EBCDIC input to ASCII
-F + #: Make output record fixed length of # characters. No error checking on truncation of longer records is done. Appropriate adjustment is made for -r and -R options. However if the carriage returns ‘already’ exist in the record, they are not moved to the end. And others may be added.
-f + filename: Indicates that the filebreak option is to be used, and next argument is the name of the filebreak parameter file. This option allows you to build an output file from certain selected parts of the input record. The filebreak.doc has specifics on how to use filebreak.
If attempts to use the variable length -4 option (retain the 4 byte record length) with the -f (filebreak option) there is an incompatablility conflict and the -4 is ignored. (You can’t retain the 4 byte record length of a record in which you are changing its size).
-g: Picks only those records where the record key is greater than the search key.
Only one search key can be used in the parameter file. If more than one search key exist, then only the first is checked.
-h: Designates that the input tape has headers in it, and the program should bypass the headers. After the -h option, the next argument should be the number of header records (including the EOF) to pass. In most cases this will be 4. (3 headers and an EOF). Not available on 32 bit NT version.
-i: Ignore case of the search keys. This will convert all the search keys to upper case before checking for a match of the keys.
-I: Input tape is IBM format. The tape has a 4 byte BLOCK size at the beginning of each block. This option drops/ignores those 4 bytes. Not available on 32 bit NT version.
-k + filename: :Filename is name of file containing keys to search for in a second field. See -K below on 2nd key.
-K + #: Indicates which logical search to perform on the second key field.
Currently they are:
1= = A key in the first parameter file must match the record key, and a key in the second parameter file must match a key in the second record key field. This equates to a logical (AND) (field 1 AND field 2).
2= = Field one matches and none of the keys in field 2 match. logical (NOT) (field 1 and NOT 2) This second logical option is NOT. Meaning that a key in the first parameter file was found, but there were NOT any keys matching the second parameter key field. In simple terms. A key was matched in field 1 but not in field 2.
The -k and -K options are used to designate that a secondary field is also to be searched. The second key file is identical in structure to the first parameter key file. It must have the blocksize and record length the same as the parameter file. The displacement and length of the key field are naturally going to have to reflect the proper size and placement in the record of the second field to check. See parameter file descriptions and sample command lines in the SEARCH program.
The -kK options will not work with variable length records.
-L: Remove banner from screen for Long runs.
-l: (lower case L, not 1(one)) Picks only those records where the record key is less than the search key. Only one search key can be used in the parameter file. (similar to -g option)
-M + string: :Indicates to the filebreak option that you want ‘string’ to replace specific characters in the output record. See ‘m’ convert option in filebreak documentation.
-n: For use only with the -c option. This option when used with the -c option will cause any newline characters in the file to be untouched and will not be converted to printable characters. This is a way to convert all but the newline characters in a file. You cannot use this option by itself. If you select this option, you must also select the conversion option.
-N: No rewind of the input IDT tape drive after program execution. The is an outdated option.
-P: Used with filebreak (-f) option to automatically insert pipes ‘|‘ after each field in the output record
-p: This indicated that the search field is a packed decimal field. The program will automatically expand the field in the record to match the length indicated in the parameter file. (see filbreak ‘u’ option for unpacking the output record) (Do not confuse with same option in filbreak program. It is similar, but not exactly the same).
-r: Use to insert a carriage return/linefeed after the last character of the output record.
-R: Same as -r except that -R adds only line feed 0x0a, not both 0x0d and 0x0a.
-S + filename: Use this file as the source for the search keys. This means the search keys are NOT in the parameter file but are in their own serperate file. The keys in this file should be ONE per line, delimited with carriage returns and sorted.
-t: Indicates that the IDT tape drive is the input file. If mt0 is the input file name, then -t is a default. This is a very specialized option and is not used very often.
-T + #: Indicates the reel number to place in the accounting file if tape is used. This can also contain up to a total of 15 consecutive alphanumeric characters. Use this just to input minimal information into accounting file.
-u: Do not unload tape after processing. Default is to unload tape from tape drive.
-U: Indicates that the input is of variable length records. (see also the 4 options).
NOTE: If the second line (record length) of the parameter file is 0, this option is automatically installed, along with the -4 option. (only the -f option can override the -4 option once it is installed).
-4: This option is only valid when using the -U option. It tells the program to include the 4 byte record length counter in the output record. This is helpful if the output record is going to be used for future processing. If the -4 is not used, the 4 bytes are stripped and you get just the data record without the record length indicator.
Because the -f (filbreak) option creates records of a different length, this option is not allowed if the filebreak option is used.
NOTE on U option: If the second line of the parameter file (the record length line) is a 0, then the -U4 option is automatic, and need not be entered on the command line. If you wish to have only the -U option installed, you must first, place any number other than 0 on the record length line 2 of the parameter file, then choose the -U option, and DON’T attempt to use the -f (filbreak) options.
If the -A option is used with the -4 option, an additional line is added to the accounting file. It lists the largest record length found in the records selected. The size listed includes the 4 byte header.
-v: If $ (signed field conversion of filbreak parameter file) causes leading blanks or zeros to all be converted to zeros (0). This is used if the filbreak option is used, and the appropriate field is identified in the filbreak parameter file.
-V: Same as -v except if $ (signed field conversion) causes leading blanks or zeros to all be converted to blanks.
-x: Converts the first four characters of an informix version 3.x data record to blanks. Normally the first characters are the record number in hex. This will play havoc with most programs, as there are hex 00’s in part of the first four characters. If this option is used, the -c option need not be used to convert the 00 unprintable characters.
-z: Negates the search option. Instead of putting to the output file only records which match the search keys, this option deletes (zaps) from the file those records which match the search keys. All records not matching the search keys will be put to the output file.
-2:(UNDOCUMENTED in program help) If you have a variable length record, with leading record length, but have made a mistake and added CR/LF (now record length off by 2) use this option to eliminate the CR/LR from the output record.
NOTE: The following options are automatically negated if the GSEARCH options are instituted: -zgklS.
Options can be in any order and grouped. But if you designate headers, the header lengths will be the last items on the line.
NORMAL SEARCH PARAMETER FILE
9000 \* input blocksize - max of 65535. multiple of record length */ 90 \* input record length == 90 */ 7 \* search field begins at 7, first character is displacement 0 */ 9 \* length of field to be searched is 9 */ 123456789 \* individual keys to compare against */ 234567899 \*if -g or -l option is used, then only the first*/ 390483244 \* search key is used */ 444223456 Now I can put in comments after a blank line
GSEARCH PARAMETER FILE FOR VARIABLE LENGTH RECORDS
30000 0 0 0 0010=034 /* look at position 10 and match 34 OR */ 0020=012 /* look at position 20 and match a 12 OR */ 0100>09 /* look at position 100, match greater than 9 OR */ 0024=0ABC /* look at position 24 and match ABC AND*/ 0030=1987 /* look at position 30 and match 987 AND*/ 0050=230 /* match at least 1 of the following conditions*/ 0050=240 /* if any of 2 conditions is met choose the record*/ 0050=245
GSEARCH PARAMETER FILE FOR FIXED LENGTH RECORDS
30000 300 0 0 0010=034 /* look at position 10 and match 34 OR */ 0020=012 /* look at position 20 and match a 12 OR */ 0100>09 /* look at position 100, match greater than 9 OR */ 0024=0ABC /* look at position 24 and match ABC AND*/ 0030=1987 /* look at position 30 and match 987 AND*/ 0050=230 /* match at least 1 of the following conditions*/ 0050=240 /* if any of 2 conditions is met choose the record*/ 0050=245top
JERRY CAIN option parameter file
40 40 0 0 0004j0aaaaa seed the 2nd key. 0014=1aaaaa check the 2nd 0014j0aaaaa now that the second key is seeded, seed the 0026=1aaaaa seed the last key.
REPLACEMENT SEARCH FROM ONE RECORD TO NEXT, USE THIS ON MULTIPLE RECORDS of SAME ACCOUNT
30000 0 0 0 0026=013 /* find a type 13 record */ 0060=10799999999 /* find our id number 0799.... */ 0125r1DAN MARES /* find our field and copy data from it to 0 */ 0000=0place 15 chars. /* place 15 dummy chars here */
MULITPLE REPEATING SECTIONS >forming fixed length outputs using filbreak -f option.
mult 40 /* full length of repeating section */ 0000=026 /* fixed section, or part of it */ 0035=030 /* location and chosen length of repeating section to place to output record*/