LISTSERV - HP3000-L Archives

HP3000-L Archives

January 1995, Week 4

HP3000-L@RAVEN.UTC.EDU

	LISTSERV Archives
	HP3000-L Home
	HP3000-L January 1995, Week 4

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: "Fuzzy" string matching?
From:	Wirt Atmar <[log in to unmask]>
Reply To:	[log in to unmask]
Date:	Fri, 27 Jan 1995 14:20:27 -0500
Content-Type:	text/plain
Parts/Attachments:	text/plain (66 lines)

Jeff Kell writes:

>Does anyone have, know of, or have source code for a "fuzzy" string
>match?  Not just a phonetic key (Soundex), but rather one which could
>tell you "how closely" two strings match?

>Ideally it would be "sort of" like a spelling checker, but extended to
>a string.  Simple transpositional errors (2 letters reversed), spelling,
>omitted substrings, etc., would be accounted for.

Precisely this kind of "fuzzy" string matching is extremely commonly done
nowadays in most every university's molecular genetics department. Indeed,
it's a good portion of their bread and butter. The general technique is
called PAUP (phylogenetic analysis using parsimony).

The idea behind PAUP is that there initially existed a single source string
(of DNA) that is now represented by a whole handful of existing strings,
where each existing string has very likely been permuted multiple times by a
number of simple character substitutions (called point mutations),
translocations (sections of the string are rearranged), insertions,
deletions, and/or inversions (parts of the strings inverted, including simple
transpositions).

The manner by which information gets screwed up genetically is exactly the
same as the way information gets screwed up in computers: one error at a
time.

The problem you're asking to solve is quite a bit easier than the one PAUP is
applied to. In yours, I'm presuming that you will have a dictionary of fixed
strings to match an unknown string against. In a phylogenetic (family tree)
analysis, the problem is inverted -- and more complex. You are given a whole
dictionary of strings and your problem is to cluster the strings together so
that the least number of changes are required (parsimony) to get the strings
to converge back to a single (often hypothetical) ancestral root string. By
doing this, you can draw a similarity diagram for the similarity distance
that exists between each of the strings or (higher up in the diagram) between
each of the clusters of strings.

To achieve a minimized PAUP diagram, you generally have to make many
interations of all possible combinations; the problem quickly becomes an
optimization problem on an optimization surface pocked by many local optima.
 But your problem is much easier. All that you need is a single sweep across
your existing dictionary of strings such that a resulting similarity measure
will be provided for each of the strings; this part of the analysis requires
only the core processor of PAUP that is called in each iteratative sweep
through the dictionary of strings.

The key to making PAUP work well is the initial recognition of all of the
various types of  mutational events that can (or are likely to) corrupt your
string. To some great degree, this part of the programming process is
governed simply by intuition and experience. If a mutational event, such as
the transposition of two letters, is common and you don't account for it
properly, your similarity measure between two strings will be different than
a routine that does. That's even more true for bulk insertions of new string
text or the bulk deletions of old text. A routine that doesn't account for
these mutational events will declare very little similarity.

Walk across the campus to your molecular genetics department, Jeff. I'm sure
that the people there would be quite pleased to talk about what they're doing
and explain the PAUP algorithm in greater detail. But don't let them get away
with dumping words like "anamorphies" and "synapomorphisms" on you. The ideas
underlying these words are very simple; it's just the words that cost a
dollar apiece.

Wirt Atmar

ATOM RSS1 RSS2

RAVEN.UTC.EDU