HP3000-L Archives

March 1996, Week 1

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Ken Paul <[log in to unmask]>
Reply To:
Date:
Tue, 5 Mar 1996 19:14:05 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (308 lines)
Dear Fellow Listers,
 
As I said when I started this thread, "I will report back to the group!".
 
Here is my article that appears in th current issue of InterexPress.
 
I would like to thank all of those people on this list for feedback
on this very important topic.
 
I thought that I would post it here for those who don't receive
InterexPress.
 
Comments welcome.
 
Start of article:
 
Reliability of 2-GB Disk Drives Update.
 
When I posted my "rantings" to the HP3000 list at the end of September
1995, I had no idea that I would spend the next 3 months gathering
information from several users and also visiting two HP sites and three HP
divisions looking into this issue.  I guess I have Dick Kranz (editor of
InterexPress) to thank/blame for this.  This issue probably would have died
on the HP3000 list if Dick hadn't called me and asked if he could print my
posting in an upcoming InterexPress Open Forum.  I said yes and thought
that that would be the end of it.  I did not know that Dick had gone to HP to
get an official reply to my posting.
 
In early November, I received a call from Orly Larson of HP who is the
liaison between Interex and HP.  Orly said that Clive Surfleet (Operations
Manager for CSO Mass Storage Operation) would like me to look into this
issue in conjunction with HP.  I sent Orly a fax of my original posting and
also copies of the responses I had received off of the Internet to my
posting.  Clive called a few days later and told me that HP had indeed
looked into this issue already and had found "no smoking gun" which would
suggest a problem.  He also told me that HP had nothing to hide on this
issue and wanted to know if I would be willing to join him in Boise, Idaho
and Roseville, California to talk to several people within HP about this
issue.  I accepted his offer and we agreed to meet on December 4-6.
 
The night of December 4th, I met with Clive, Susan Thom (HP - OEM
Sales Rep - Information Storage Group) and Bob Tillman (General
Manager - Disk Memory Division) for dinner in Boise, Idaho.  We talked at
length about where I was coming from on this issue and my feelings on
why this was becoming such a hot issue.  I told them that from an HP3000
perspective disk drives arriving DOA or failing within the first 90 days was
a very unusual, if not unheard of, problem.  HP3000 customers have
always been able to count on their hardware for reliable performance and
more importantly little downtime.  HP had built a 20+ year relationship with
HP3000 users and their expectations had been set very high and those
expectations were not being met with the problems being experienced with
the 2GB disk drives.
 
It was a very enjoyable dinner with everyone laying their cards out on the
table and voicing where they were coming from..  As a humorous aside,
Bob Tillman told a wonderful story of when he was in England working with
the division that created the DDS DAT tape drives.  It seems that after the
first DDS DAT drives were released, HP started hearing stories about
tapes being stuck in the units.  The engineers could find nothing wrong with
the units yet the problems still were being reported.  Finally someone
decided to place several DDS DAT drives outside the cafeteria and allow
people to play with the units as they were going to or coming from lunch.
Within no time several units had tape cartridges stuck in them and the HP
engineers found that if you pushed on one side of the tape or the other
more, the tape may get stuck.  The Engineers then wrote a spec on how to
insert tape cartridges into the drives so that they do not get stuck.  This
reminded me of the programmer saying that "Programmers should never
test their own code".
 
The morning of December 5th, Clive, Susan and I met with several people
within the Disk Memory Division (DMD).  The meeting started with Clive
discussing the HP Quality Task Force that was created shortly after Clive
assumed his current position at the begginning of 1995.  This Task Force
studied the 2GB disk drive issue for 3 months and found "no smoking gun".
The Task Force looked into the areas of Mech Quality, Systems
Manufacturing, Marketing/Sales, Delivery & Installation, Repair
Parts/Materials and Support.
 
Corvin Kuklinski of DMD gave a presentation on Mean Time Between
Failure (MTBF).  This was an excellent discussion on how MTBF has
become (and possibly always has been) a very worthless representation of
how long your disk drive (or light bulbs) will last.  Within this discussion
it
became clear that MTBF was the MIPS (Millions of Instructions Per
Second) of the 90s.  During the 80s, MIPS came to stand for Meaningless
Indictator of Processor Speed.  The only correlation I can come up with for
MTBF is Meaningless Thing By Far.  If you have a better one (and I'm sure
you do) let me know.
 
What MTBFs have done is set the customer expectation much higher than
can be delivered.  One example that came up was a recent advertisement
by a disk manufacturer in which they were boldly stating an MTBF of 34
YEARS.  This represents approximately 300,000 hours, which is where
disk technology is today.  The unfortunate thing about this advertisement,
however, is that it sets the end users expectations unrealistically.  The
laws
of Physics make it impossible for any bearing grease to last 34 years in an
active disk drive.  I was always under the impression that MTBF had to do
with the arithmetic mean and if the MTBF of a disk drive was 34 years then
that meant that some drives would last less than 34 years and some would
last more but on average they would last 34 years.  I now know that this is
not what is meant by MTBF and I don't know if I could really give a good
definition of what MTBF means.
 
This whole discussion of MTBF and its false expectations is why HP tried
to find an industry standards committee to develop and use a different
metric for classifying disk drive reliability.  One possible suggestion was
Annualized Failure Rate (AFR), which is a cumulative measurement of
disk drives working over time compared to the number of drives that have
failed so far.  Unfortunately, no industry standards committee would take
up the cause.  While MIPS has been replaced by Megahertz (MHZ),
SPECmarks, and TPC Benchmarks there is no replacement for MTBF in
the forseeable future unless users let manufacturers know how worthless
MTBF is as a metric.
 
Jeff Allen (DMD R & D Manager) spoke on Performance/Capacity/
Reliability Tradeoffs.  This was a history lesson on where disk drive
technology has come from within the last 15 years.  Storage capacity of
disk drives has increased from 6MB in 1980 to 380MB in 1987 to 2GB in
1993 to 9GB in 1996.  The size of the platters has gone from 14 inches to
5 1/4 inches to 3 1/2 inches in that same time period.  Average access
rates have gone from 80 ms. to 18 ms. to 10 ms to 9 ms. and MTBF has
gone from 10,000 hours (> 1 year) to 30,000 hours (3.5 years) in 1983 to
50,000 hours (5.7 years) in 1984 to 150,000 hours (> 17 years) in 1987 to
300,000 hours (> 34 years) in 1993.  Overall physical size of the units has
obviously decreased and the tolerances allowed have become smaller and
smaller.  The distance the read/write heads are from the surface of the
platter is now about 1/4 of the width of a piece of hair.
 
Rick Seymour (DMD R & D Manager) lead the next section on Disk Drive
Reliability Improvement.  This presentation focused on how HP is using the
latest state-of-the-art technologies in developing their next generation of
disk drives.  The Printed Circuit Assembly (PCA), also known to HP3000
users as the Controller, is being pushed to greater and greater tolerances.
Previous PCAs had to withstand voltages variances from -7 to 10 percent
where the current PCAs have to withstand voltage variances from -15 to
15 percent.  Temperature variances used to be 0 degrees Centigrade to 60
degrees Centigrade but new PCAs have to handle -40 degrees Centigrade
to 100 degrees Centigrade.  DMD is also using Design Verification Testing
(DVT) to study the success rate of the manufacturing process.
 
The morning ended with a tour of the manufacturing process of the latest
DMD creation.  The COUGAR disk drive has an 8.75GB capacity on 10 3
1/2 inch platters enclosed in a case 1.625 inches high.  Seeing the process
that these drives go through it is hard to imagine any of them showing up
DOA.  Each drive goes through an extensive burn-in process both before
and after it has been formatted.  This was definitely a worthwhile part of
my visit because I was able to see the great care that is put into every disc
drive that is manufactured by HP.
 
During lunch we discussed several issues.  The 2GB disk drives in
question are the C3010 which in DMD is known as the Coyote IV and the
C2490 which is known in DMD as the Wolverine III.  I was able to see a
Cougar drive and a Lynx drive (1 inch high disc drive with 1GB or 1.6 GB
capacity) but unfortunately not a Coyote IV or a Wolverine III.  One of the
major issues that had come up on the HP3000 list after my initial posting
was whether the Controller (PCA) was soldered to the Mech (Head Disk
Assembly (HDA)).  All of the engineers present from DMD had read
through the postings that I had forwarded to Clive and they were very
surprised by this.  The DMD people felt that there was no way that the PCA
was soldered to the HDA.  They were not happy that Customer Engineers
(CEs) or customers themselves may be switching the PCAs from one HDA
unit to another to see if it was the PCA that had failed.  They worried about
the fact that the right tools may not be used and that the environment may
not be suitable for doing such work (i.e static free).  They also told me
that
on the Cougar drive the PCA would be finely tuned to the HDA in order to
get the performance and capacity that was required so that swapping PCAs
would not even be an option.
 
(Huge aside)
 
This is the issue which is probably the reason for the "problems" with the
2GB drives.  The question is whether CEs are told to replace 2GB drives
when a failure occurs (no matter what the cause).  Several end-users have
confirmed that the PCA can be removed from one 2GB drive and placed
onto another one.  Other end-users have been told by CEs that the PCA
and HDA are an integral unit and can not be seperated.  If DMD does not
like the PCA to be removed from one drive and placed on another than it is
very possible that CEs are being trained that this is an integral unit and
should be replaced as a whole.  Other CEs, who have been around awhile
or who are forced/encouraged by their customers, may be replacing just a
PCA when a disk drive fails.
 
Several times the discussion centered around the philosophy of what HP
thinks the customers shoud be doing and what the HP3000 users have
been used to for over 20 years.  HP states that if your data is so critical
that you can not afford to lose it than you should be using Disk Mirroring,
or
RAID technology for your data.  I stated that HP3000 users have had very
few problems with losing data off of their disk drives.  Their disk drives
rarely failed and when they did it was usually a controller that went bad and
a new controller was put on with no loss of data.  Because the disk drives
were so reliable they were moved from one HP3000 to the next as the
systems were upgraded to a faster machine.  HP would like users to buy
Disk Mirroring software which requires twice the amount of disk space or
RAID devices which are the latest and greatest (i.e. expensive) disk
devices.  End users can not go to management and say we need to buy
twice as much disc with no added disk space or we need the latest RAID
box.  Companies have a hard enough time getting permission to upgrade
the speed of their machine.
 
In my opinion, HP is asking the user base to make a significant change in
the way they think about the integrity of their system.  If I have not had to
worry about the integrity of my data for 20 years on this machine, how
come I have to start worrying about it even though I'm using the same
machine?  Is there something wrong with my new peripherals?  Is it just a
ploy by HP to get me to buy more hardware?
 
The long time HP3000 users have always been able to tinker with their
machines.  Many have collected old devices that they have been able to
cannibalize when an active device fails.  All this has been done with very
little if any loss of data.  Now when a 2GB disk drive fails a CE comes in
replaces the whole unit with a new one and takes the old one out the door.
The recovery process is very painful for customers.  If the disk drive was
on the system volume they have the choice of reloading the system
volume from tape (and possibly losing some work that was done since the
backup) or they can try to get the data off of the old 2GB disk drive using
an HP diagnostic utility.  This second process requires unloading up to
2GB of data to tape and then loading the 2GB of data back onto the new
disk from tape.  Both processes can take a considerable amount of time.
One user told me that this process took them 13 hours before they were
back up and running.
 
An old HP3000 user would wonder what harm there would be in removing
the PCA off the old drive and replacing it with a new PCA to see if the
problem is the controller.  Several people have told me that this process
can take from 15 minutes to at most 1 hour and there is potentially no loss
of data.  If this process does not work then I can decide which of the other
two painful processes I would like to perform.  HP's approach used to be
let's try a bypass and if that doesn't work we may be looking at a heart
transplant.  Now HP is saying that the only option is a heart transplant even
if the problem is a clogged valve.  This is a big, painful pill for HP3000
users to swallow given their past history and it is even more painful when
no warning is given that it is the new policy.
 
Alot of the above is speculation on my part but the questions still remain.
Who decides the repair policy of 2GB disk drives?  Is it DMD, the CE
organization, CSO or some other party?  I would hope that these new
issues are resolved.
 
As more food for thought, I was talking to a customer when they brought up
the point that when a terminal, memory board, CPU board, DDS tape drive
etc. etc. goes bad and the CE replaces it they are not leaving with any
data.  When a 2GB disk drive fails and it is replaced, as much as 2GB of
their company's data is going out the door and they have to figure out how
they are going to recover it.
 
Since my visit to HP in December, I have read the HP3000 list very closely
and have found two very good quotes from HP employees that seem to
say it all.  The first was on a discussion of spool files being lost and the
quote was "HP considers data loss among its most urgent priorities".  The
second was on the topic of users doing unsupported things and the quote
was "Remember, your peripherals are the lifeblood of your system.".  I
know that everyone within HP would agree with both of these statements
but they have very different ways of acheiving them.  If a disk drive
replacement is necessary than utilities need to be available that will
facilitate the transfer of data from a bad drive to a good one and do it in a
more timely fashion than is currently available.
 
(Back to lunch)
 
I was also asked if there was anything else I wanted to know that had not
been covered yet.  I asked about getting actual numbers as far as the
failure rates were concerned because we were always getting nebulous
phrases like "the failure rate is below the norm".  I was told by Greg
Engelbreit (Technical Marketing Manager) that every drive that fails comes
back to DMD for analysis to find out what the problem was.  The AFR for
the Coyote IV drive is 3.6% and for the Wolverine III is 3.8%.  Greg
stressed that disK manufacturers may calculate AFR differently.  Some
may use a figure of their disks only running 12 hours or 16 hours a day but
DMD uses a figure of 24 hours a day usage.  If HP used 12 hours a day
instead of 24 these AFR numbers would be cut in half.  He also brought up
the fact that the Wolverine III had not done as well as HP had hoped
because it was the first family on the 3.5 inch media and HP may had been
a little too optimistic.
 
I also brought up the fact that Clive was sitting on the Management
Roundtable in Toronto and that the work of the HP Quality Task Force was
not brought up.  I felt that this was an excellent opportunity for HP to
bring
up the issue and clear the air since it was at the last Management
Roundtable in Denver that the problem was brought up and the participants
were told "we'll get back to you".  Clive explained that he was not aware of
that promise being made at Denver and had he known he would definitely
have brought it up.
 
In the afternoon I visited the Storage Systems Division (SSD) and learned
about their future offerings in the area of HP AutoRAID.  The next day,
Clive and I traveled to Roseville, California and met with the System
Interconnect Lab (SIL) and learned about the extensive testing that all of
the HP equipment goes through before being shipped and the new
approach being taken by HP to provide users with the best peripherals in
the least amount of time.  Those two visits will be covered in a future
article (I hope).
 
In closing this latest rambling I would like to thank Clive Surfleet and
Susan Thom and all the people I met at the three HP divisions.  They were
very willing to listen to what I had to say and were not at all evasive at
answering any of my questions.  There are still several areas that I would
like to follow up on and I know that they will try to answer all of my
questions.
 
Respectfully submitted,
 
Ken Paul
[log in to unmask]

ATOM RSS1 RSS2