HP3000-L Archives

June 2002, Week 3

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Ken Hirsch <[log in to unmask]>
Reply To:
Ken Hirsch <[log in to unmask]>
Date:
Tue, 18 Jun 2002 17:56:46 -0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (96 lines)
> Wirt Atmar wrote:
>
> > By comparison, to accomplish the same thing in 30-year-old BASIC, using
the
> > 25-year-old IMAGE database that I just referenced requires this much
code:
> >
> >      CALL XDBGET(B$,"MASTERSET;",M5,S[*],"WORD;",W$,"")
> >
> > That's it. Just stick that line in anywhere in your code. If S[1] = 0,
the
> > word is spelled correctly. If not, it's not. But even more than that,
it's
> > also really quite efficient.

This line
 CALL XDBGET(B$,"MASTERSET;",M5,S[*],"WORD;",W$,"")
is equivaled to this part of the Perl program:
  $dict{$_}

> i started to mention that your solution was far more efficient....but i
figured
> someone else would point that out. :-)

Not necessarily.  If you are spell checking a long document, the Perl
program you gave is probably faster since it holds the dictionary in memory.
That's a reasonable approach for a stand-alone program, but not ideal for a
subroutine.

I just tried it as an experiment.
  Test dictionary: 238,640 words (2,757,518 bytes)
  Test text: "Losing the War" by Lee Sandlin (about 33,000 words, or 198K
bytes).

This program took 2.1 seconds on a Windows/2000 PC with a Pentium III.
Almost all of that was to load the dictionary.
  #!perl -w
  use strict;
  my %dict;
  open D,"<words";
  while(<D>){
    chomp;
    $dict{lc($_)}=1;
  }
  close D;

  while (<>) {
    my @words=split /[^a-zA-Z0-9']+/,$_;
    foreach (@words){
      if(!$dict{lc($_)}) {
        print "\"$_\" is not in the dictionary\n";
      }
    }
  }


By comparison, I loaded the dictionary into a hashed database:
    #!perl -w
    use DB_File;
    my %dict;
    tie %dict, "DB_File", "hashed_dict", O_RDWR|O_CREAT, 0640, $DB_HASH
            or die "Cannot open file 'hashed_dict': $!\n";

    while (<>) {
      chomp;
      $dict{lc($_)}=1;
    }

That took 70 seconds, but of course you only need to do it once.

Then I used the hashed dictionary to spell check the same text file.  This
time it took 5.5 seconds.  So, I was right.  For any lengthy text, the first
approach is faster (but uses a lot more memory).

    #!perl -w
    use DB_File;
    use strict;
    my %dict;
    tie %dict, "DB_File", "hashed_dict", O_RDONLY, 0640, $DB_HASH
            or die "Cannot open file 'hashed_dict': $!\n";

    while (<>) {
      my @words=split /[^a-zA-Z0-9']+/,$_;
      foreach (@words){
        if(!$dict{lc($_)}) {
          print "\"$_\" is not in the dictionary\n";
        }
      }
    }

For a short file (1359 words), it was the other way around.  The in-memory
program took 2.0 seconds, the hash file program took 0.3 seconds

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

ATOM RSS1 RSS2