HP3000-L Archives

August 2003, Week 3

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Wirt Atmar <[log in to unmask]>
Reply To:
Date:
Tue, 19 Aug 2003 13:13:13 EDT
Content-Type:
text/plain
Parts/Attachments:
text/plain (269 lines)
I've enclosed below a story that I told the list six years ago that explains
a bit about why grids have to cascade their failures. Although it recounts a
New Mexico state-wide power failure, the number of people who were affected was
much smaller than last Thursday's blackout, only about 2 million. In this
instance, we (AICS Research) designed and built a controller for a generating
facility at Kennecott Copper. We were only responsible for the "grid" that was
Kennecott, but you can mentally scale up the entire process and substitute New
York City for Kennecott, if you wish. The source of the failure started outside
of Kennecott, but the system we designed was designed to react to exactly
that sort of situation -- and it did, but in the end, it had to kill the power on
the grid internal to Kennecott as well, just as each system on the East Coast
had to as well.

Electricity is the oddest of all products: it's consumed the instant it's
produced, and supply and demand must always be in near-perfect balance. If demand
exceeds supply, the rotational frequency of the generators (and thus the
frequency of the AC current on the line) begins to drop. But on the other hand, if
the load suddenly disappears, the generators immediately begin to overspeed.
Either condition, if abruptly applied, has the possibility of doing an
enormous amount of harm to primarily the generating equipment, and secondarily to the
transmission lines. Generator shafts will warp under the strains.

The material below was excerpted from a much longer posting that appeared on
HP3000-L on June 9, 1997:

======================================

AICS Research is somewhat of a schizophrenic engineering company. The HP3000,
until recently, only represented a small part of our activities...

A second, primary part of our work, up until the fall of the Berlin Wall and
the end of the Cold War, was contract engineering. We were a standard (for
New Mexico) scientific prototype instrumentation engineering organization.
Because I worked for eleven years in nuclear weapons development and testing
work, AICS has always adamantly rejected any form of classified weapons
development contract since its founding in 1976. But we did take on a fair
number of contracts to develop elaborate instrumentation for NASA, USDA, Navy
Research Labs, and most especially, the Atmospheric Research Laboratory of
the Army Research Labs.

The reason that I mention this is that there have been any number of kids
that have worked here and who have all had to sit through my lengthy and
often repeated lectures on the value of building extremely well designed
power distribution systems before you build anything else. The
instrumentation that we tended to build was primarily composed of
time-synchronized, high-speed analog computers. You can get a lot more
behavior in a much smaller volume, with far less power consumption, in an
analog computer than you can a digital one. But analog computers are
exceptionally sensitive to noise. Moreover, these computers were mounted
inside lasers, nephelometers, scanning infrared spectrophotometers, and all
sorts of other electronic instruments, and had to be indistinguishably
capable of being run off of battery-powered inverters, diesel generators, and
clean lab power. Secondly, they had to be capable of withstanding
high-explosives set off directly adjacent to them. In ten years of building
these kinds of systems, we had no failures (up to the point of catastrophic
failure) and no lost data.

In these situations, where you have no clear idea about grounding, voltage
levels, voltage regulation, spectral composition, or even line frequency, it
is fundamentally critical that you create an electrical environment that is
exceptionally stable and completely isolated from the rest of the world.

Ideally, a UPS is going to do something similar for you. But it's important
to understand, you aren't going to get much of this for a hundred dollars.

[As long as I'm going on, let me also tell a power regulation story that I'm
particularly proud of -- even though it also represents a system that we
designed and built and caused us (and HP) to come as close to getting sued as
we ever came -- and losing the company.

AICS not only designed small, highly specialized computer systems, we have
also designed large power controllers in our past -- for items such as
rapidly positioning large radio telescopes, solar power reflectors, and so
on. One of the systems we designed was an autoswitching, fault-intolerant
power regulating system for Kennecott Copper using two HP 21MX, RTE-M based
computers, coupled with a small, hand-crafted, extremely simple (=reliable)
computer that constantly interrogated the two 21MX's to be sure that they
were still operating. If it found the primary machine to be failed, it would
automatically switch all control over to the second and sound an alarm.

The purpose of this system was to buy as much power off of the commercial
grid as possible, simply because of its price. Although Kennecott had 60MW of
local power generation capability (half of it resident in one relatively new
steam turbine generator), it was much cheaper for Kennecott to buy all the
power possible from the grid and make up whatever difference was necessary
locally. Kennecott was given a diurnal table of their allowed power draw and
the acceptable power factor angle for every fifteen minute period during the
day. If they went over either condition (as an average) for the 15 minute
interval, they were charged an enormous penalty.

It was the responsibility of the system that we designed to get as close as
we could to drawing every bit of power as legitimately allowed -- but never,
ever allow the draw to go over.

Kennecott operated an open-pit copper mine about 10 miles from its generating
facility. The blades of the enormous electric shovels that operate in the
open pit are large enough to comfortably park three large pickup trucks,
side-by-side, inside a blade. The shovels are DC driven. On board each
electric shovel is a three-phase AC/DC motor-generator. Each time a shovel's
blade strikes the ground, approx. 2 megawatts of power is electrically
transmitted to the shovel as it picks up about 2 million pounds of earth. At
any one time, several shovels could be operating. The whole operation is so
gargantuan that it teeters at the edge of incredible. It is the kind of
engineering that even amazes the engineers.

In addition to the power draws of the shovels, the ore that was returned to
the refinery was separated electrically, thus the average draw tended to be
about 45MW. Evening out all of that draw at the absolute cheapest cost was
our responsiblity. If we did it correctly, Kennecott would save several
million dollars a year in power costs.

The system we put together was housed in a large ray-proof box on the
generating facility's floor (on a mezzanine, actually) for absolute EMI/RFI
protection against all of the electrical noise in the building. The
40-column, 10-line display mechanism was a then-new gas plasma display built
by Raytheon, which itself was sealed behind a oil proof, wire mesh, necessary
for electrical shielding. Input to the terminal was accomplished using a
membrane switch, mounted on the outside of the box, again with an oil-proof
covering.

The two 21MXs were powered by a triply redundant set of Topaz power
conditioners, which themselves were powered from (or charged) a large bank of
NiCad batteries in a separate, adjacent ray-proof box. Electric power was
supplied to the Topaz conditioners through a single ferroresonant isolation
transformer and RFI suppression system. Although the system drew its power
from right in the middle of a power generating facility, that power was never
to be trusted. The power system associated with the computers had to be
capable of sustaining independent operation, including operating all of the
actuators and annuciators for 24 hours.

Why were we so very close to getting sued (and losing AICS)? Two reasons. The
more (very) minor of the two was that the gas plasma displays built by
Raytheon repeatedly failed. The individual cells of the display leaked and
lost their pressure. That was unacceptably irritating -- and eventually we
had to canabalize an HP2645 and build a CRT-based display in its stead.

But the far and away more major reason was that there was a severe error in
the scheduling table in RTE (HP's operating system at the time for HP1000s)
-- and it had always been there, but no one had ever found it before. And
that was the worst of all possible situations. Although I came to strongly
believe that the error was in the event scheduler, HP, quite reasonably, said
that there were thousands of these systems in operation and no one had ever
reported such a problem before.

But the damn 21MXs continuously failed unexpectably once we put them into
operation -- without warning. Mining is a tough business and everyone was
already very much afraid for their jobs. Moreover, people in these industries
don't sit around a coffee table and talk in genteel fashions about estoric
subjects. They scream at each other over a great deal of noise. What they do
is dangerous -- and extraordinarily expensive if they screw up. And people
were screaming at everyone involved in putting in this system. They were most
especially screaming at me. That was probably the hardest environment that
I've ever worked in an attempt to debug a significant problem.

Back in Las Cruces, we built all sorts of test rigs to try to simulate the
problems that we were seeing at Kennecott -- and ultimately, I wound up
sleeping on the floor next to the computer system for three weeks in order to
get enough data to deduce what the problem was. In the meantime, Kennecott
wanted back the quarter million dollars they had paid us as an advance, the
system we built taken out, and us gone. Things got so desperate that at one
time I called Bill Little, our local HP district manager (and a good friend)
at 2 in the morning and talked for an hour so that I could see what kind of
help we could get we could get from HP.

All of the time that I was sleeping on the floor, I kept cursing myself that
I hadn't written an RTE-like operating system from scratch, using 8080
single-board computers, as we had done in most of our other similar projects.

Ultimately, I was able to create a three-line program that could fail the RTE
scheduler reliably in the same manner that we were seeing. An RTE expert in
the HP St. Paul/Minneapolis office (name unfortunately no longer remembered)
verified the problem almost immediately and got a fix back to us in just a
few days.

Even with all of this (what I considered heroic) effort, things were not
smoothed over at Kennecott. In their eyes, RTE's problems were my problems --
and if I was dumb enough to choose HP (a company they weren't familiar with),
then it was my fault. To a degree, I agreed with them. But I had absolutely
no way to pay them back a quarter million dollars, either -- especially since
our out-of-pocket expenses for this project were approaching $400,000.
Finding the RTE scheduler error (just one error in one line of code) saved
the company, literally.

But the story isn't over yet. Indeed, the most movie-like event that I've
ever experienced in real life occurred a week or so later. When I absolutely
certified to Kennecott (and to myself) my total satisfaction that the system
was operating reliably, Kennecott had several engineering officers fly down
from Salt Lake to observe the switchover to the new system. They were fully
aware of all of the problems that we'd had and they were more than aware that
we were now months late in getting this system on-line.

With absolutely minimum fanfare, I pushed one button and the system was
on-line. After about 30 minutes of observing the quality of its control, the
group broke up and I called Valerie, my wife and co-founder of AICS, and told
her that it looked like we didn't owe Kennecott a quarter million dollars
after all and that we could probably soon bill them for the remaining
$200,000 they owed us. Just then, all hell broke. An incredibly large bang
occurred in the building, which Valerie heard over the phone. She said that
power went out at exactly the same time and asked, with her voice shaking,
"Did we do this?"

I didn't know -- because things got very much worse very quickly. We turned
over responsibility to the new system mid-day, precisely the time that
Kennecott was allowed to purchase its maximum power draw off of the grid --
but the 21MXs just dropped the connection to the grid. That extremely violent
bang was the central breakers in the high-voltage park just outside of the
building tripping.

All of a sudden, the Kennecott power generation facility was faced with an
approx. 30MW power deficit, falling frequency, and an abyssmal power factor,
so the 21MXs commanded the only two turbines that were on line to
dramatically rev up their power production. Up until that point, I never knew
that it was possible for a 30MW generator to extinquish the fire in one of
its boilers because of the instantaneous power draw, but that happened almost
immediately. But, before things got wildly out of hand, the 21MXs
disconnected the local generators from the internal grid. If things had gone
on for another few seconds the way they were, the shaft on the main turbine
would have warped and caused millions of dollars of damage.

As soon as the local generators dropped their load they immediately began to
overspin, thus the 21MXs popped the pressure relief valves on the boilers.
Now we we sitting in a large, darkened building with a noise that sounded
like someone just lit all four engines of a 747 in the room. Sirens were
going off. Clangers were going off. All of the annunciator alarms had
tripped. And everyone was looking at me -- while I was still on the phone
with Valerie. I was only about 30 years old at the time -- and I thought that
AICS was now down perhaps a million dollars in damage. I should have been
more scared than I was.

One of the more reasonable Kennecott engineers, name forgotten, and Hugh
Gardner, my partner in this design and a superior engineer, who now works for
NASA, and I spent the rest of the night analyzing what had happened while
Kennecott brought the power system back up by hand.

But what happened was exactly what was supposed to happen. About a 100 miles
away, midway between Las Cruces and Kennecott, El Paso Electric was putting
into place a new high-voltage interstate interconnect tie line. They were in
the process of stringing a section of the new line over an existing, fully
operational high-voltage tie line when they dropped one of the wires,
instantly shorting out the line and causing a power failure that took out
most of New Mexico, El Paso, Texas, and a good part of eastern Arizona.

The 21MXs immediately recognized that the power flow had reversed and was now
going out of Kennecott, onto the grid, and instantly popped the primary
relays, exactly as they had been programmed to do. It was just bad luck that
this happened at a time when Kennecott's local capacity was only 1/3 that of
current draw. The computers first immediately tried to make up the difference
-- and then decided that the generators were in danger, so it dropped them
off of the line and worked to bring them to a gentle idle.

Everything worked perfectly well, exactly as designed -- but it took us a day
to absolutely certify that, too.

If anything, this story does strongly reinforce the value of placing the
computers in their own electrical universe -- unperturbable, isolated, and
stable -- because none of that was true otherwise for those few minutes. In
fact, it is hardly ever true. But that moral was no different for the tiny
analog computers we also built inside of quantum-tuned lasers.

It also profoundly reinforces the conclusion that you're better off writing
your own operating systems from scratch :-)]

=======================================

Wirt Atmar

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

ATOM RSS1 RSS2