Steve writes: > In article <[log in to unmask]>, [log in to unmask] (Dave Anderson) writes: > > > I have to sort and merge several files with possible duplicates. Is > > there a way to exclude duplicates?? > > sort file_list | uniq > output_file The dangers of Unix, :) ... it makes it too easy to combine tools, sometimes. I looked at the above, and said: gee, if a file has a lot of duplicates, that's an expensive proprosition...there ought to be a sorter that would optionally remove duplicates (and thereby speedup the sort, somewhat). Sure enough, a "man sort" shows the "-u" option: -u Unique: suppress all but one in each set of lines having equal keys. If used with the -c option, check to see that there are no lines with duplicate keys, in addition to checking that the input file is sorted. Thus, the above would be more efficient with: sort -u file_list > output_file Now (Hi Dan!), I don't know if "-u" is POSIX or not, but both HP-UX and AIX have it. Interestingly, "man sort" on MPE, HP-UX, AIX all produce slightly different explanations for "-u": MPE: -u ensures that output records are unique. If two or more input records have equal sort keys, sort writes only the first record to the output. When you use -u with -c, sort prints a diagnostic message if the input records have any duplicates. AIX: -u Suppresses all but the first line in each set of lines that sort equally according to the sort keys and options. HP-UX: -u Unique: suppress all but one in each set of lines having equal keys. If used with the -c option, check to see that there are no lines with duplicate keys, in addition to checking that the input file is sorted. NOTE: when read closely, and tested, only the HP-UX man page is not misleading. MPE and AIX refer to "the first record"... they should say (based on testing) "the first output record", since they may end up suppressing earlier (chronlogically) records when suppressing "duplicates". I.e., consider the file foo: cat cat3 foo cat cat2 foo cat cat2 foo sort -k3 -u < foo produces: cat cat2 foo The "first record" of the *INPUT* file with the key "foo" was "cat cat3 foo", which never appeared in the output. Instead, if you had sorted the input with an implicit list of fields like: sort -k3 -k1 -k2 < foo you'd have seen that the first OUTPUT record was "cat cat2 foo", followed by "cat cat2 foo", and "cat cat3 foo". Anyway, the points are: 1) the "-u" suppresses all but one record with the same set of keys. If you specify a set of keys that don't "cover" the entire record, you will lose records that differed only in non-key fields ... and the ones you get may not be the first chronological ones. 2) sort doesn't preserve the order of records with identical key fields. (There's a name for this, but I sure can't remember it!) 3) man pages often have to be read *very* carefully, and between the lines. -- Stan Sieler [log in to unmask] http://www.allegro.com/sieler.html