LISTSERV - HP3000-L Archives

HP3000-L Archives

July 2002, Week 3

HP3000-L@RAVEN.UTC.EDU

	LISTSERV Archives
	HP3000-L Home
	HP3000-L July 2002, Week 3

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: Deduping files
From:	Danny van Delft <[log in to unmask]>
Reply To:	Danny van Delft <[log in to unmask]>
Date:	Mon, 15 Jul 2002 17:02:04 -0500
Content-Type:	text/plain
Parts/Attachments:	text/plain (148 lines)

 In article <[log in to unmask]>, "Porter, Allen"
<[log in to unmask]> wrote:

> Lot's of good ideas, has anyone done any kind of speed comparisons?
>
>
>

Well, not a comparison, just one try with Perl. Probably be good enough
for me for a "run occasionally" scenario. I did the following program with
an about 51k line check file, and a 12M line master list. Both files had
lines with 70 chars a line. It took about 5 minutes to code, about 35
seconds to run (on a 1.2GHz linux machine, just to give a meaningless
number). Run time should be about linear with master file size, so your
example would take about 2 minutes.

Note that the programming style is not, not, not what you'd want to do in
real life. It'll take you at least a minute to figure out what's going on
after a week. But for a one shot... Use at your own risk. Have not done
rigorous tests, but the output was what I expected...

--CUT--
#!/usr/local/bin/perl
#first argument is filename containing lines to check for dups with, key starts
# at position 30 (base 0), length 40
#second argument is filename containing keys, starting at position 10
#output will be sorted into original order of first argument file.
#size of first file determines memory size, second file size is irrelevant
#neither file needs to be sorted

while(<>){push(@{$m{substr($_,30,40)}},[$.,$_]); last if(eof);}
while(<>){delete $m{substr($_,10,40)} if exists($m{substr($_,10,40)});}
print map {$_->[1]} sort {$a->[0] <=> $b->[0]} map {@{$_}} values(%m);
__CUT__

run as "whatever list_file master_file" if above program is named
"whatever".

The first line of code (starting with "while") builds a hashtable with the
wanted key from the list_file. Each entry in the table contains a list,
with elements a two element list containing the line number ($.) and the
line of the list_file ($_).

The second line iterates over the master list. If a key is found in the
hashtable, its entry will be destroyed. What remains is the hashtable
containing only those entries with no keys in the master list.

The third convoluted line just fetches all the lists from the hashtable,
sorts them into original order and prints them.

Of course, if you have a master file with only keys, if you don't need to
retain original order, if you only want a list of keys not present, if you
don't ..., above program can be simplified an run time shortened.
However, if you want readability, add some 50 lines of syntactic sugar,
options for providing start and length of the keys, error checking and
comments.

There's a whole lot more to tell, has more possibilities than meets the
eye, but I'll stop here.

regards,
Danny

> -----Original Message-----
> From: Michael Abootorab [mailto:[log in to unmask]] Sent: Friday, July 12,
> 2002 4:34 PM
> To: [log in to unmask]
> Subject: Re: [HP3000-L] Deduping files
>
>
> if you have suprtool then table lookup is the fastest way.
>
> if not , use a short perl script.
>
> thanks
> Michael
>
>
>
> On Fri, 12 Jul 2002 17:14:10 -0400, Porter, Allen
> <[log in to unmask]> wrote:
>
>>I'm looking for opinions and experiences with deduping large fixed ASCII
>>files.  For instance, if you have a list of names (50,000 records) and
>>you want to bounce that against a master list of names (5 million
>>records) to produce a third file of non-matching records ( something
>>less than 50,000 records), what would be the best tool to use?  Also,
>>for this little example, let's say that the matching field will be a 40
>>character name field.
>>
>>There are a multitude of ways to do this.  If you were patient, you
>>could even use QEdit, but who has that kind of patience?  So, what would
>>be your tool of choice...Image? SQL? Access? A custom C program?  Some
>>mystery UNIX utility?  Whatever your favorite solution would be.  I'm
>>interested in finding out what everyone thinks is the easiest and the
>>fastest way to accomplish something like this.
>>
>>> Allen Porter
>>> ENVOY
>>> ISO 9001 Registered
>>> Phone:  636-827-5704
>>> Fax:  636-827-5874
>>>
>>> Visit our Web site @ http://www.yourenvoy.com
>>>
>>>
>>>
>>>
>><font size="1">Confidentiality Warning:  This e-mail contains
>>information
> intended only for the use of the individual or entity named above.  If
> the reader of this e-mail is not the intended recipient or the employee
> or agent responsible for delivering it to the intended recipient, any
> dissemination, publication or copying of this e-mail is strictly
> prohibited. The sender does not accept any responsibility for any loss,
> disruption or damage to your data or computer system that may occur
> while using data contained in, or transmitted with, this e-mail.   If
> you have received this e-mail in error, please immediately notify us by
> return e- mail.  Thank you.
>>
>>* To join/leave the list, search archives, change list settings, * *
>>etc., please visit http://raven.utc.edu/archives/hp3000-l.html *
>
> * To join/leave the list, search archives, change list settings, * *
> etc., please visit http://raven.utc.edu/archives/hp3000-l.html *
>
>
> <font size="1">Confidentiality Warning:  This e-mail contains
> information intended only for the use of the individual or entity named
> above.  If the reader of this e-mail is not the intended recipient or
> the employee or agent responsible for delivering it to the intended
> recipient, any dissemination, publication or copying of this e-mail is
> strictly prohibited. The sender does not accept any responsibility for
> any loss, disruption or damage to your data or computer system that may
> occur while using data contained in, or transmitted with, this e-mail.
> If you have received this e-mail in error, please immediately notify us
> by return e-mail.  Thank you.
>
> * To join/leave the list, search archives, change list settings, * *
> etc., please visit http://raven.utc.edu/archives/hp3000-l.html *
>

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

ATOM RSS1 RSS2

RAVEN.UTC.EDU