LISTSERV - HP3000-L Archives

HP3000-L Archives

July 2002, Week 3

HP3000-L@RAVEN.UTC.EDU

	LISTSERV Archives
	HP3000-L Home
	HP3000-L July 2002, Week 3

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives
Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]
Subject:	Re: Deduping files
From:	Michael Abootorab <[log in to unmask]>
Reply To:	Michael Abootorab <[log in to unmask]>
Date:	Mon, 15 Jul 2002 19:14:04 -0400
Content-Type:	text/plain
Parts/Attachments:	text/plain (219 lines)
Hi,

another variation of the program follows :

#! /usr/bin/perl -w
#
#run with 3 parms : master file , inputfile and outputfile
#
#  outputfile =  masterfile - inputfile
#

my (%masterkey,$record,$total1,$cnt);

$total1=$cnt=0;

if (@ARGV != 3 ) {
    print "$0 requires 3 parms , master file , inputfile , outputfile\n";
    exit(1);
}

if (!open(MAST,"$ARGV[0]")){
   print "file open failed on $ARGV[0], abort\n";
   exit(1);
}

if (!open(INPUT,"$ARGV[1]")){
   print "file open failed on $ARGV[1], abort\n";
   exit(1);
}

if (!open(OUTPUT,">$ARGV[2]")){
   print "file open failed on $ARGV[2], abort\n";
   exit(1);
}



while ( defined ( $record = <MAST> )) {

      $masterkey{"$record"}++;
      $total1++;
}


while ( defined ( $record = <INPUT> )) {


      if ( defined($masterkey{$record})){
         next;
      }
      $cnt++;
      print OUTPUT  "$record\n";
}


print <<eot
 total number of record not found in master $cnt
 total number records in master $total1
eot


;
print "end of program $0\n";


On Mon, 15 Jul 2002 17:02:04 -0500, Danny van Delft <[log in to unmask]> wrote:

> In article <[log in to unmask]>, "Porter, Allen"
><[log in to unmask]> wrote:
>
>> Lot's of good ideas, has anyone done any kind of speed comparisons?
>>
>>
>>
>
>
>Well, not a comparison, just one try with Perl. Probably be good enough
>for me for a "run occasionally" scenario. I did the following program with
>an about 51k line check file, and a 12M line master list. Both files had
>lines with 70 chars a line. It took about 5 minutes to code, about 35
>seconds to run (on a 1.2GHz linux machine, just to give a meaningless
>number). Run time should be about linear with master file size, so your
>example would take about 2 minutes.
>
>Note that the programming style is not, not, not what you'd want to do in
>real life. It'll take you at least a minute to figure out what's going on
>after a week. But for a one shot... Use at your own risk. Have not done
>rigorous tests, but the output was what I expected...
>
>--CUT--
>#!/usr/local/bin/perl
>#first argument is filename containing lines to check for dups with, key
starts
># at position 30 (base 0), length 40
>#second argument is filename containing keys, starting at position 10
>#output will be sorted into original order of first argument file.
>#size of first file determines memory size, second file size is irrelevant
>#neither file needs to be sorted
>
>while(<>){push(@{$m{substr($_,30,40)}},[$.,$_]); last if(eof);}
>while(<>){delete $m{substr($_,10,40)} if exists($m{substr($_,10,40)});}
>print map {$_->[1]} sort {$a->[0] <=> $b->[0]} map {@{$_}} values(%m);
>__CUT__
>
>
>run as "whatever list_file master_file" if above program is named
>"whatever".
>
>The first line of code (starting with "while") builds a hashtable with the
>wanted key from the list_file. Each entry in the table contains a list,
>with elements a two element list containing the line number ($.) and the
>line of the list_file ($_).
>
>The second line iterates over the master list. If a key is found in the
>hashtable, its entry will be destroyed. What remains is the hashtable
>containing only those entries with no keys in the master list.
>
>The third convoluted line just fetches all the lists from the hashtable,
>sorts them into original order and prints them.
>
>
>Of course, if you have a master file with only keys, if you don't need to
>retain original order, if you only want a list of keys not present, if you
>don't ..., above program can be simplified an run time shortened.
>However, if you want readability, add some 50 lines of syntactic sugar,
>options for providing start and length of the keys, error checking and
>comments.
>
>There's a whole lot more to tell, has more possibilities than meets the
>eye, but I'll stop here.
>
>regards,
>Danny
>
>> -----Original Message-----
>> From: Michael Abootorab [mailto:[log in to unmask]] Sent: Friday, July 12,
>> 2002 4:34 PM
>> To: [log in to unmask]
>> Subject: Re: [HP3000-L] Deduping files
>>
>>
>> if you have suprtool then table lookup is the fastest way.
>>
>> if not , use a short perl script.
>>
>> thanks
>> Michael
>>
>>
>>
>> On Fri, 12 Jul 2002 17:14:10 -0400, Porter, Allen
>> <[log in to unmask]> wrote:
>>
>>>I'm looking for opinions and experiences with deduping large fixed ASCII
>>>files.  For instance, if you have a list of names (50,000 records) and
>>>you want to bounce that against a master list of names (5 million
>>>records) to produce a third file of non-matching records ( something
>>>less than 50,000 records), what would be the best tool to use?  Also,
>>>for this little example, let's say that the matching field will be a 40
>>>character name field.
>>>
>>>There are a multitude of ways to do this.  If you were patient, you
>>>could even use QEdit, but who has that kind of patience?  So, what would
>>>be your tool of choice...Image? SQL? Access? A custom C program?  Some
>>>mystery UNIX utility?  Whatever your favorite solution would be.  I'm
>>>interested in finding out what everyone thinks is the easiest and the
>>>fastest way to accomplish something like this.
>>>
>>>> Allen Porter
>>>> ENVOY
>>>> ISO 9001 Registered
>>>> Phone:  636-827-5704
>>>> Fax:  636-827-5874
>>>>
>>>> Visit our Web site @ http://www.yourenvoy.com
>>>>
>>>>
>>>>
>>>>
>>><font size="1">Confidentiality Warning:  This e-mail contains
>>>information
>> intended only for the use of the individual or entity named above.  If
>> the reader of this e-mail is not the intended recipient or the employee
>> or agent responsible for delivering it to the intended recipient, any
>> dissemination, publication or copying of this e-mail is strictly
>> prohibited. The sender does not accept any responsibility for any loss,
>> disruption or damage to your data or computer system that may occur
>> while using data contained in, or transmitted with, this e-mail.   If
>> you have received this e-mail in error, please immediately notify us by
>> return e- mail.  Thank you.
>>>
>>>* To join/leave the list, search archives, change list settings, * *
>>>etc., please visit http://raven.utc.edu/archives/hp3000-l.html *
>>
>> * To join/leave the list, search archives, change list settings, * *
>> etc., please visit http://raven.utc.edu/archives/hp3000-l.html *
>>
>>
>> <font size="1">Confidentiality Warning:  This e-mail contains
>> information intended only for the use of the individual or entity named
>> above.  If the reader of this e-mail is not the intended recipient or
>> the employee or agent responsible for delivering it to the intended
>> recipient, any dissemination, publication or copying of this e-mail is
>> strictly prohibited. The sender does not accept any responsibility for
>> any loss, disruption or damage to your data or computer system that may
>> occur while using data contained in, or transmitted with, this e-mail.
>> If you have received this e-mail in error, please immediately notify us
>> by return e-mail.  Thank you.
>>
>> * To join/leave the list, search archives, change list settings, * *
>> etc., please visit http://raven.utc.edu/archives/hp3000-l.html *
>>
>
>* To join/leave the list, search archives, change list settings, *
>* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *
ATOM RSS1 RSS2
RAVEN.UTC.EDU