HP3000-L Archives

January 2000, Week 3

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Randall Davis <[log in to unmask]>
Reply To:
Randall Davis <[log in to unmask]>
Date:
Sat, 15 Jan 2000 18:58:24 -0700
Content-Type:
multipart/mixed
Parts/Attachments:
text/plain (6 kB) , rdavis.vcf (6 kB)
our 989/250 has also experienced a great number of problems over the past 5
months.  We started out by having problems with socket table overflows.  This was
caused by a combination of MPE bugs and application bugs that combined to cause
the system to "shut down" from a networking perspective.  The only way to recover
was to reboot.  After making changes to the application, we relieved the problem
adequately enough such that a weekly reboot (scheduled) was adequate to prevent
unscheduled problems.  Unfortunately, this relief unmasked the next problem which
was the inbound buffer table filling up.  This problem, again, was a result of
both MPE and application problems.  Fortunately, a simple shutdown/restart of NS
relieved this problem.  Not great, but much preferable to a reboot.  We limped
through our Christmas season with these problems.  HP had identified quite a
number of patches that they felt would relieve the issues, but due to the
seasonality of our business, we couldn't make the changes until after Christmas.
On Dec. 26, I installed power patch 1 (there are all 6.0 issues, by the way), and
11 supplemental patches that HP recommended.  I hoped that my problems were in the
past.  On Jan 4, I crashed, and I have crashed, on average, once per day since
that time.  HP was quick to review the memory dumps, and were quite adamant that
these "new" issues were not as a result of the patches.  Since this is new, and
only started after the patches, I found that difficult to believe, but encouraged
(strongly) HP to continue to pursue a resolution.  On Thursday this week, I was
told that the lab had identified the issue (after 10 days).  On Friday, I was
supplied with a patch.  HP rolled this particular patch into their current NS
related patch (NSTFDM2), and although it had not gone through HP's standard 72
hour regression testing, I installed the patch last night (Friday).  After 24
hours, we have not crashed again... I know that does not sound like much, I'm
encouraged that maybe we've found a stable point.

Okay, my issue does not sound quite like the one described below, but it does
point out that HP does add credence to the thought that HP has some serious
problems on the 989 platform.  In my call with HP yesterday, in response to the
question of "what's different about our environment", HP even eluded to the
potential that new new faster CPU's in the 989 series are exposing some bugs that
previously had not shown themselves.  I also think that 6.0 with it's new BSD
compliant network services are not quite up to previous reliability standards.

Our next problem is one of performance.  Our I/O throughput is not close to what
I've previously experience using older HP disk arrays.  We're using AutoRAID disk
arrays, and have not been very thrilled with the performance.  HP has been very
involved in helping to discover what is going on in this arena as well.  At this
point, I have nothing to report, as we're still in the information gathering
phase.  HP's first thought was that it had to do with the firmware level on our
hardware.  However, after upgrading firmware on both the SCSI cards and the
AutoRAID controllers, we have not seen any significant change.  The thought that
these arrays were contributing to our system stability issues has also been
considered, but do not really believe that's the case at this time.

The jury is still out in relation to the new patch we installed.  The analysis of
our I/O issues is still underway.  However, I have not had any statements from HP
that they were/are not willing to work on these issues in deference to 6.5.  I
have been pushing HP very hard from every angle (sales, support and marketing),
and have had our system in an escalated state since November.  I recommend that
you continue to push HP for research and resolution to any/all problems you have.
We long time HP3000 system supporters/users are not used to poor reliability like
this, and we should not allow HP to consider this a norm.  I've tried to describe
our problems, and would love to hear other's situations as well.  I've been so
focused on our NS issues that I don't want to relax if there are other problems in
other subsystems.

Randall Davis

Wirt Atmar wrote:

> Gary writes:
>
> > We have an HP3000 989/450.  We have what we consider a very serious problem
> >  with it.  Evidently, the system is so fast and powerful that it is possible
> >  to fill up certain system tables before it can post it all to disk.  When
> >  this happens the system aborts.  HP has stated that they will not fix this
> >  problem (and we are on CSS support) because evidently, they are reworking
> >  the system tables in MPEiX 6.5 so they do not see a need to fix this issue.
> >
> >  We exposed this problem by using Adager.  I will stress, however, that it
> is
> >  not Adager's problem.  Their product is merely the vehicle by which it was
> >  exposed.  Adager has gone all out to try to find the problem and actually,
> >  they have produced a version which actually is somewhat crippled (my words)
> >  because it keeps track of how much data it has changed and when it reaches
> a
> >  certain point, it stops and forces the OS to post.  In my opinion this was
> >  nice of Adager to do, but completely irresponsible of HP to force a vendor
> >  to cripple their product so that it can work on one of HP's highest end
> >  machines.  From what I have heard from people at HP, this is not the first
> >  instance of this problem on the 989, but not all of them were exposed by
> >  Adager.  I guess we were just lucky that a vendor willing to go the extra
> >  mile was the vehicle for us.
>
> Ken Sletten found a very similar problem recently while he was integrating
> his new, massively resized databases with gigabytes of data that had long
> been archived. The details, as I understand them, are that he crashed his
> machine (a 959, I believe) by doing too many DBPUTs in a row, while doing
> nothing else. In that process, a table overflowed and the machine crashed, a
> condition that sounds very much the one Gary describes.
>
> Although Ken can speak to all of this far better than I can, they solved
> their problem by updating their IMAGE datasets in pieces, thereby allowing
> the OS to complete its tasks and clear out whatever table was overflowing,
> which again, is a solution that sounds very much like Gary's.
>
> All in all, it sounds like the symptoms of internal scheduling failure in
> MPE. Critical tasks are simply not being performed on a timely basis, perhaps
> because some IMAGE-based process is not periodically releasing itself to the
> control of a task scheduler.
>
> Wirt Atmar


ATOM RSS1 RSS2