HP3000-L Archives

January 1996, Week 3

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Stan Sieler <[log in to unmask]>
Reply To:
Stan Sieler <[log in to unmask]>
Date:
Fri, 19 Jan 1996 11:36:38 -0800
Content-Type:
text/plain
Parts/Attachments:
text/plain (87 lines)
Steve writes:
> In article <[log in to unmask]>, [log in to unmask] (Dave Anderson) writes:
>
> > I have to sort and merge several files with possible duplicates.  Is
> > there a way to exclude duplicates??
>
> sort file_list | uniq > output_file
 
 
The dangers of Unix, :)   ... it makes it too easy to combine tools,
sometimes.  I looked at the above, and said: gee, if a file has a lot
of duplicates, that's an expensive proprosition...there ought to be a
sorter that would optionally remove duplicates (and thereby speedup the
sort, somewhat).  Sure enough, a "man sort" shows the "-u" option:
 
           -u          Unique: suppress all but one in each set of lines
                       having equal keys.  If used with the -c option, check
                       to see that there are no lines with duplicate keys,
                       in addition to checking that the input file is
                       sorted.
 
Thus, the above would be more efficient with:
 
    sort -u file_list > output_file
 
Now (Hi Dan!), I don't know if "-u" is POSIX or not, but both HP-UX and
AIX have it.
 
Interestingly, "man sort" on MPE, HP-UX, AIX all produce slightly
different explanations for "-u":
 
 
MPE:
    -u    ensures that output records are unique.  If two or more input
          records have equal sort keys, sort writes only the first record
          to the output.  When you use -u with -c, sort prints a diagnostic
          message if the input records have any duplicates.
 
AIX:
  -u      Suppresses all but the first line in each set of lines that
          sort equally according to the sort keys and options.
 
HP-UX:
  -u      Unique: suppress all but one in each set of lines
          having equal keys.  If used with the -c option, check
          to see that there are no lines with duplicate keys,
          in addition to checking that the input file is
          sorted.
 
NOTE: when read closely, and tested, only the HP-UX man page is not
misleading.  MPE and AIX refer to "the first record"...  they should say
(based on testing) "the first output record", since they may end up
suppressing earlier (chronlogically) records when suppressing "duplicates".
 
I.e., consider the file foo:
   cat cat3 foo
   cat cat2 foo
   cat cat2 foo
 
sort -k3 -u < foo
produces:
   cat cat2 foo
 
The "first record" of the *INPUT* file with the key "foo" was "cat cat3 foo",
which never appeared in the output.  Instead, if you had sorted the
input with an implicit list of fields like:  sort -k3 -k1 -k2 < foo
you'd have seen that the first OUTPUT record was "cat cat2 foo",
followed by "cat cat2 foo", and "cat cat3 foo".
 
Anyway, the points are:
 
   1) the "-u" suppresses all but one record with the same set of keys.
      If you specify a set of keys that don't "cover" the entire record,
      you will lose records that differed only in non-key fields ... and
      the ones you get may not be the first chronological ones.
 
   2) sort doesn't preserve the order of records with identical key fields.
      (There's a name for this, but I sure can't remember it!)
 
   3) man pages often have to be read *very* carefully, and between the lines.
 
 
 
--
Stan Sieler                                          [log in to unmask]
                                     http://www.allegro.com/sieler.html

ATOM RSS1 RSS2