HP3000-L Archives

June 1995, Week 3

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Denys Beauchemin <[log in to unmask]>
Reply To:
Date:
Thu, 15 Jun 1995 10:18:44 -0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (100 lines)
Martin Knapp writes:
 
<<Guy Smith wrote asking how best to randomise distribution of keys in his
master dataset.
I passed this on the FLEXIBASE/IMAGE development team, and thought people out
there might like to share the words of wisdom they came up with....
 
The nature of a character key is to randomize similar strings of
characters (yo!). For example, an X8 key with a capacity of 2,000,001
an the following key values gives the following primary key
positions:
A0000000 = 0
A0000010 = 1,714,307
A0000011 = 1,196,469
 
So it is not necessarily the "similarity" of keys that causes a
problem. The algorithm takes the first 4 bytes and creates a number
for these (which will be the same in all three numbers above), it
then converts the next 4 bytes and "randomizes" the 1st number by
cycling the bits according to the second number. An X12 key would
randomize even more (even an X10 would be better). If disc space
allows then adding 4 more bytes to the key may help.
However, the choice of capacity may well be the real problem. Some
capacities can be really bad with the same distribution of key
values. Even numbers, and multiples of 250 are almost always bad.
How do you find a good capacity? Well, if you don't have one of the
main 3rd-party tools to help manage your database, then trial and
error is probably the only method (though there may be something on
the CSL to help, I don't know). FLEXIBASE does have a function to
recommend good capacities based on the actual contents of the
dataset, and I believe that Adager and DBGeneral will do the same
(please note this is _not_ a plug).
Of course, the advantage of improving performance simply by changing
capacity is that you don't have to change the existing keys, or even
necessarily the algorithm to produce them.
If you wanted to use integer key values, then there is one caveat.
You must make sure that the maximum possible value that you generate
for you key does not exceed the capacity of the dataset.
 
Hope this provides useful food for thought.
 
Yours sincerely,
 
Martin Knapp
 ==========================================================================
Any opinions expressed herein are my own, and should not be used to
incriminate my employer or anybody else: living, deceased, or imaginary.
 
Martin Knapp                             Voice:  (+44)-1689-877-933
Proactive Systems                          FAX:  (+44)-1689-891-475
Central Court                           E-Mail:  [log in to unmask]
Knoll Rise
Orpington
Kent, BR6 0JA
United Kingdom
 
>>
 
Well, there are a number of errors in this one.
 
1- Similarity of the keys does indeed affect distribution.  I understand the
concept behind the hashing algorithm and the underlying theme of breaking up
similar strings, but. . .  A few years ago, I conducted some (not very)
scientific tests and came to the following conclusion: If you use an alpha
key and the beginning sequence is very similar for a large proportion of the
entries, hashing generated 33% more synonyms than if the keys were more
randomized.  1/3 more synonyms is not a tremendous impact, but it definitely
is there.  I have neither the time nor the desire to repeat the experiment, I
leave it to someone else to prove it wrong.  I do not specifically say that
Martin's contention is erroneous, but I do not like such a clear cut broad
sweeping statement.  So debate on this one.
 
2- Increasing the key length will not really help, because the new characters
will probably be 00 for the most part.  For example:
 
A0000001    will become   A000000001.
 
In French we call this: un coup d'epee dans l'eau  (a sword stroke in the
water, i.e. lots of energy expended for no result.)
 
3- Even numbers and multiples of 250 (???) are not inherently bad capacities
from which to choose.  Numbers which are powers of 2 may adversely affect the
hashing.  Prime numbers, as a rule, are not the most detrimental values to
pick from, however, they are usually easily supplanted by another number.  In
other words, you probably will not go wrong picking a prime, but you can
usually do better.  For hashing keys, sampling can help, but for integer
keys, one must calculate the appropriate capacity, see below.
 
4- Integer keys are the best bet, IF you control the key values and know the
lower and upper limit.  It is very incorrect to say that an integer key value
must not be greater than the capacity of the set.  Since IMAGE uses the key
value modulo the capacity to yield the record number, I leave it as an
exercise to the reader to figure out why a value greater than the capacity is
not inherently bad.
 
 
Kind regards,
 
Denys. . .

ATOM RSS1 RSS2