LISTSERV - HP3000-L Archives

HP3000-L Archives

July 2002, Week 2

HP3000-L@RAVEN.UTC.EDU

	LISTSERV Archives
	HP3000-L Home
	HP3000-L July 2002, Week 2

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: Deduping files (wait a minute)
From:	Wirt Atmar <[log in to unmask]>
Reply To:	[log in to unmask][log in to unmask] "If at first you don't succeed... Web : http://www.hp3000links.com Don't take up sky-diving !" "All your HP e3000 resources on the Net" (Mirror: http://www.users.totalise.co.uk/~jdunlop/index1.htm) [...]44_11Jul200213:45:[log in to unmask]
Date:	Sat, 13 Jul 2002 18:13:26 EDT
Content-Type:	text/plain
Parts/Attachments:	text/plain (52 lines)

Jeff writes:

> > Since sort can take a list of files to sort as arguments, that's a wasted
>  > use of cat.  And sort also has the -u option which removes duplicate
keys,
>  > so that's a wasted use of uniq; use:
>  >
>  > sort -u FILE [FILE2 ...] >NODUPE
>
>  OK, I hand off the shortest answer award to Jeff Woods :-)  Kudos.

As far as succint goes, let me show you how you would accomplish the whole of
Allen's original request in QueryCalc. I'm going to presume that his list of
original, non-sorted, possibly duplicated names exists in a simple flat file,
40 characters wide, 50,000 records long. I'm also going to presume that his
master list of 5 million exists in an IMAGE automatic master, with a matching
40 character name field.

To do what Allen wants would require three cells in QueryCalc:

     @using sets, define !a from myflatfile;type=X40
     @using mymasterset, store in !b name when name=!a
     @using sets, !c=!a-!b

In QueryCalc syntax, we read !a, !b, and !c as Set A, Set B, and Set C, all
of which are boolean algebraic sets. It's simply the nature of a boolean set
that no element can appear twice in the same set, thus the first cell, the
one that defines Set A, will automatically eliminate all duplicates as it
creates Set A (it does this by first sorting the list, thus the list is not
only deduplicated but sorted as well). For 50,000 records, this operation
should only take a couple of seconds.

The second cell defines Set B based on the fact that there is an entry in the
master list that equals one of the values in Set A. This operation would
require perhaps only 10 to 15 seconds for 45,000 finds (presuming we
eliminated 5000 duplicates in the first operation).

The third cell creates Set C via a boolean set subtraction of the values
found in Set B (the set of all records that exist in the master list) from
Set A, the original deduplicated list. This operation should take just a
second or two. Perhaps only 200 records now populate Set C.

Set C is the set of records that Allen wanted (a list of all of the unique
entries in his original list that don't appear in his master list). Now that
it exists, you can do anything with that set you want: print it, store it an
auxiliary database, or use as a list to print letters.

Wirt Atmar

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

ATOM RSS1 RSS2

RAVEN.UTC.EDU