LISTSERV - HP3000-L Archives

HP3000-L Archives

October 1995, Week 2

HP3000-L@RAVEN.UTC.EDU

	LISTSERV Archives
	HP3000-L Home
	HP3000-L October 1995, Week 2

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives
Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]
Subject:	Re: Adager's date-conversion project
From:	Wirt Atmar <[log in to unmask]>
Reply To:	[log in to unmask]
Date:	Thu, 12 Oct 1995 01:36:49 -0400
Content-Type:	text/plain
Parts/Attachments:	text/plain (207 lines)
As expected, several people wrote back defending their use of numeric dates.
However, let me extend and further explain my earlier comments concerning
dates being encoded only as text (parsible) fields. Let me also add two other
comments about common mistakes in date fields in many databases that we often
see.
 
1. NUMERIC DATE FIELDS
 
Tony Furnivall writes in response, "Of course the fact that every digit in
every numerical field means something to some may tend to reduce the impact
of your elegant rhetoric! I agree that important semantic constraints should
not be coded in non-eye-readable formats. This is *exactly* the reason why we
generally code our dates in the internal packed decinal format I describe
elsewhere."
 
There are numbers, of course, where the character position means nothing, as
with the case of a counter. In these types of numbers, there is no logical
reason to pull the third character position out of the number and use it as
the basis of a query compare. But that's not true for a date value or for any
other form of concatenated values. Intermediate character positions do carry
information, information that is often quite valuable.
 
Tony writes, "Our experience has been that when people want to do do extracts
of data over date ranges for comparison, they are generally far more
interested in comparing equivalent date ranges (week 34 of the year, all of
May except Memorial Day, etc) rather than the simple 'anything with a '05' in
the third and fourth digits."
 
Tony is quite correct, of course. The most likely condition for most queries
is a simple date range, and in that circumstance, any date format that is
properly collatable will suffice, whether it is numeric- or text-formatted.
But if you need to extract the third and fourth digits out of a set of eight
-- and it is a far more common demand than Tony gives it credit --  you could
be quite stuck. Human-readable encodings (such as I- or P-datatypes) are not
sufficient, as Tony suggests they might be. Date fields must be readily
parsible if they are to be of real value.
 
The question is one of productivity, and there are two distinct attributes to
productivity. The group that reads this list is inherently prone to
technophilia. While the condition may not be curable, it can be controlled
with some self-restraint. The point of entering dates into a database is not
to satisfy our innate curiosities or intellectual proclivities, but to make
the end product as useful and usable to the greatest possible number of
users, especially so that fundamental business decisions can be made from the
data.
 
There exists a "pain threshold" that is associated with every product and
every project. If the pain is too high, then a project is never completed, or
even begun. If a process is made just a little bit too complex, it simply
doesn't get done. And that lack of completion represents an extraordinary
productivity loss -- one that cannot be begun to be quantified.
 
But there is a second productivity attribute, too, and that is processing
time. Let us presume that requirement is to find all of the invoices for May,
regardless of the year, and the date is numerically encoded as a Julian date,
offset from some arbitrary date, as it is MANMAN, then the query must be
formulated something like this (using a generic query syntax):
 
     find when substring(ascii(date(jd(DATE-DATAITEM,offset)))) = 05
 
In comparison, if the date were text-field encoded, the same query would be:
 
     find when DATE-DATAITEM(3,4) = 05
 
What's the difference? If the date was encoded as an X8 (8 bytes) field (in
CCYYMMDD format) and there were 10 million records in the dataset, the date
field would account for 80 megabytes of data by itself. On the other hand, if
the date were encoded as an I2 (4  bytes) field (still CCYYMMDD), then the
date would require only 40 megabytes of storage. Encoding the date field as a
text item carries with it a penalty of 40 MB.
 
But disc space has become inexpensive when compared to costs of CPU
utilization, especially on a large, multiuser system. If you had to perform a
serial search of the 10 million records, then the
"substring(ascii(date(jd(...,offset))))" function would have to be performed
10 million times. The second version is essentially without CPU cost.
 
The difference in processing times for a very fast processor reading 10
million records serially, using MR-NOBUF, would likely expand from 3-5
minutes for a single substring read, using the text-encoded field, to 1 to 4
hours, using the numerically encoded field (indeed, a 20x processing penalty
may be quite conservative).
 
By encoding dates as a text field, you've shot yourself in the foot twice
over: once for the dramatically increased processing time that you've
burdened the system with, and secondly for the increase in complexity that
you've foisted on your users -- who, for the most part, will not share your
joy in the intracacies of the screwy date format scheme that you've devised.
 
2. ROBUSTNESS
 
Duane Percox writes, "Respectively I ask: Why not [process data this way]?  I
think the folks who write report writers should be able to do this. You don't
always get to control the format of the database you might be reporting
against, so it would be a real shame to have to depend on a tool that wasn't
robust enough to do the conversion automatically.
 
"The databases came first. Then the report writers. The authors of the report
writers should be turning out tools that address the needs of the users. The
users of the report writers shouldn't have to adapt to the limitations of the
report writers."
 
The answer to the first part of Duane's question is answered above. But he is
absolutely correct in the second part of his comments. The databases did come
first. The report writer is a secondary purchase. And people do call us every
day asking how to formulate a query to process data in some manner to the
example above. At some point in the conversation, they almost always say,
"Damn, this seems unneccessarily complicated." And I agree with them.
 
I tell them that the database could have been designed a great deal simpler
and more accessible -- with enormously faster processing times. But I also
tell them that this is the way their database is and there's nothing really
that can be done about it now, but together we'll make the best of the
situation that they're facing. It's also at this point that I tend to quote
the immortal lines of Gary Cooper, "A man's gotta do what a man's gotta do"
-- meaning that if this is what you really want to do, we're just going to
have to plow through the data and take our lumps.
 
However, Duane presumes perhaps too much as to most report writers. The
high-end report writers can do all of what is generally neccessary, but the
most common report writer in use, by far and away, remains Query/3000. And
while everyone knows its limitations, a great many more reports would be
capable of being written in Query if only the databases had been designed
with more care and consideration of the end users.
 
In this era of bad public poetry, let me add to Johnny Cochran's catchphrase,
"If it doesn't fit, you must acquit!", the equally poetic line, "If you can't
use the item in Query, then it shouldn't be in your database, Mary!" -- or
alternatively, "If it doesn't fit, you must omit". While Query may have its
limitations, it is also represents the minimum universal set. If your
database proves to be successful, its use will outlive your tenure, and even
that of the data entry language you chose many times over and spread much
further than you ever initially imagined.
 
3. DDMMYYCC
 
Continuing the Bibical theme started in the initial posting, formatting a
date in the DDMMYYCC format in a database is an abomination in the sight of
the Lord. No matter how such a date field is built, text or numerically, it
saves no space over the proper format, but this format cannot be collated,
and any attempt to do so will require some additional (significant)
processing in the report writer to properly reverse the order.
 
It must be remembered that for whatever form of additional processing you
impose on a dataitem, you don't perform that manipulation just once, but once
for every record that the query touches. As databases grow larger, as they
will for the foreseeable future, you are imposing tremendous -- and wholly
unneccessary -- burdens on your machines.
 
4. DEGREES OF FREEDOM
 
Thomas Knoedler writes, "The date fields [in our database], while stored in
TurboImage as X8 fields, are known to Dictionary, Transact and BRW as 9(8)
fields, with four (4) subfields:  year  as  9(4), month  as  9(2),  day  as
 9(2),  month-day  as  9(4)."
 
There is absolutely nothing wrong with entering the subfields into a report
writer's dictionary in this manner; indeed, this method allows the easy
possibility of asking just for the month of May's invoices regardless of the
year, which was the question that concerned us above. But the complete X8
field must also remain in the dictionary. Without the whole of the date being
available to the report writer, each of Thomas' four subfields is free to
vary independently and you cannot logically ask query questions such as:
 
     find when year>90 and month>05 and year<95 and month<10
 
The month of January, 1992 does not satisfy the query as it is written,
although the user would almost certainly generally mean it to. If the date
fields are regarded only as subfields, and not as a concatenated whole, then
the only way to ask the question properly is to write:
 
     find when year*100+month>9005 and year*100+month<9510
 
The imposition of all of this additional processing burden on the machine can
be completely eliminated by retaining the date as a unit whole. While we do
have approx. 20 customers who do have their databases put together without a
complete date appearing anywhere in their records (they have only YEAR,
MONTH, DAY), this mistake in database construction is not all that common.
The far more common mistake we see is when the two fields, DATE, TIME, are
separate fields in the same record.
 
If DATE and TIME are separate items, they become independent variables, and
each assumes one degree of freedom. The trick necessary to rectify the
problem is the same as used above. If the user wishes to find all of the
activity reports from one 6AM period to the next, the query he must write is:
 
     find when DATE*10000+TIME ib 9510190600,9510200559
 
In some very lucky instances, in some databases, the DATE and TIME items are
adjacent to each other in the database and both are text, so that they may be
concatenated by nothing more than a dictionary redefinition.
 
Every process has an inherent level of time granularity. If that granular
level is the minute at one end and the century on the other, the date must be
recorded as a unit whole (CCYYMMDDHHMM). Nonetheless, the moral remains the
same. If the date is to be queried as to any interior part of that whole, as
it almost certainly will be, then the date/time stamp must be recorded as a
text-encoded item so that it may either be addressed as Thomas has done it,
through the dictionary redefinition of subfields, or by substring parsing in
the query portion of the report writer in use.
 
Any other form of formatting is going to impose significant processing
burdens on your machines -- and unnecessarily raise the barrier to its common
use among your users.
 
Wirt Atmar
ATOM RSS1 RSS2
RAVEN.UTC.EDU