HP3000-L Archives

January 1998, Week 4

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Stan Sieler <[log in to unmask]>
Reply To:
Stan Sieler <[log in to unmask]>
Date:
Thu, 22 Jan 1998 14:10:31 -0800
Content-Type:
text/plain
Parts/Attachments:
text/plain (207 lines)
Wirt writes:
> Bill's comments rekindle one of my primary disgruntlements with simple
> performance analyses. We manufacture a report writer, QueryCalc, in which we
> do everything we can to drive CPU utilization to 100%. In fact, if we were to
> do anything else, we would be derelict in our duty. One hundred per cent CPU
> utilization is not a sin in a report writer, it is a profound virtue.

I beg to differ.  To paraphrase Bill, "it depends".

If your process is using 100% of the CPU, and nearly 100% of the
theoretical maximum number of I/Os per second, you might at first
glance say "this is good".  But, let's probe deeper, and try for a shot
at understanding why the simple rule of "100 % CPU usage is good" doesn't
hold.

Let's take the above scenario ... we're driving the machine to the max,
and we're happy.  But ... the *OTHER* users aren't happy, are they?

Ok...let's discuss single-job batch performance.  We're still happy, right?
Not necessarily!  If half the the I/Os are needless for some reason, and
if half of the CPU usage is wasted somehow, then we can easily see how
performance tuning could potentially cut the elapsed time in half.

*That's* why performance analysis is important.

I've found it very difficult to do enough "work" to keep a CPU busy
during the processing of large amounts of data from disk.  The imbalance
between the CPU speed and the I/O rate & speed is just too large.

> quiet. Reading information out of main memory is 100,000 times faster than

One of the problems with the simple "100% CPU is good" is that we don't
know what the I/O utilization is, and we don't know whether or not we
could benefit from a faster CPU, and: critically, we don't know if
we could be doing the same work with the same "wait for disk" time.

I.e., if my single-threader batch job takes 60 minutes (elapsed time)
(on a single-user machine)
of which 30 are CPU minutes, then we pretty well know that we were waiting
for something for a total of 30 minutes (or, 50% of the time)  Since we
said "single user", it's probably not various interlocks (DBPUT sempahore,
etc.), it's probably wait-for-disk (generally due to a read request).

Now, if we take that same process (30 minutes CPU) and re-write it in,
say, CM Pascal, and recompile it ... we might see that the batch job
now takes 60 minutes of CPU and 70 to 90 minutes of "wait" time ...
we're closer to 100% CPU usage, but is that better?  NO!!!!!!!!!

In short, there's two reasons you could be using 100% CPU:

   1) you've got the best possible code in the world, and your CPU
      usage just about exactly matches the prefetch performance
      of MPE (for file access).  I.e., you're taking X+k millseconds
      to process each record, and each record takes X millseconds to be
      read (or written).  The question is, what is "k" (see next section)

   2) you don't have the best possible code in the world, and your
      CPU usage could be cut down, providing more machine resources for
      other people *and* probably cutting your elapsed time too.

"k"...
In many cases, people fail to do the arithmetic to determine if they
*could* benefit from a faster CPU or a faster I/O mechanism.
Assuming you have optimal code, let's look at a case where you're
handling 1000 records in 100 elapsed seconds (and 100 CPU seconds).

Each record takes about 10 milliseconds to "handle".  That's fine.  But,
how long did it take to get the record from disk?  We don't know, because
we don't know what percentage of the 10 milliseconds was used by the
memory manager (resolving page faults and prefetch requests) or by
the file system (issuing a prefetch, transferring data).  All we know is that
the prefetch logic makes it look like the records are read very quickly.

Let's assume it takes 30 milliseconds to fetch 64KB from disk, and that
every record is 64 bytes (1024 records per 64 KB chunk, a typical size
prefetched by the file system).  So, each 64 KB block from disk ought to
take about 10240 milliseconds to handle (10 millisecs/rec * 1024),
which means the file system would have had to prefetch it about  *TEN*
seconds before the block was going to be needed.

but...

the file system doesn't prefetch that far ahead, time-wise.

I.e., this would have required that 10 seconds ago, the file system
said "you're about to read the record at byte offset QQQ ... I'll
initiate a no-wait prefetch for a 64 KB chunk at QQQ + 64KB , which should
be ready in 10 seconds).

But, it takes far less than 10 seconds for the block to come in ...
say, 30 milliseconds (remember?!).  This means that the the I/O rate
is *WAY* below what the maximum is ... which, in turn, implies that
a reduction in CPU per record would pay off in direct elapsed time savings.

> reading it from a disc drive. More than that, when you read something off of a
> disc, you're placing the disc heads where somebody else (even your own

Remember that "somebody else" tends to imply multi-user, which reinforces
the concept that you want to use CPU and I/O efficiently to be a
good "computer citizen".

> machine we have all to ourselves, if we don't drive CPU utilization to 100%,
> it means nothing more complicated than the CPU is sitting idle while we're
> sitting and waiting for information to be retrieved from the various discs --
> and that's our fault. We're not getting the data off of the discs in the most
> efficient manner.

Ah...but we *may be* getting the data off in the most efficient manner ...
we don't know from this data point!  If we are, it may turn out that
CPU usage reduction translates to elapsed time reduction.

Tests that one can do to see if the CPU or the I/O is a bottleneck:

   1) prefetch a big test file (but not bigger than, say, 1/2 of
      physical memory).
      time your program.

      knock the test file out of memory.
      time your program.

      Although CPU usage will probably rise slightly, the elapsed time
      should rise a lot (how much? see (2) below).
      If it doesn't, then you know you're using a lot of CPU, and
      aren't blocked by I/O all of the time.

   2) modify the program to use a canned input source, and $NULL as the
      output.  (Later, try replacing the output writes with empty calls)
      (i.e., no disk I/O)
      time your program.

      This should give you an idea how much CPU time was used per record
      by your program (Note: the real amount of CPU time needed will be
      slightly larger, due to the CPU used by file system calls and
      memory manage activity on your behalf).

      This test should give you a handle on how much CPU *you* use per
      record (excluding file system, etc.)

      You can compare this value to the per-record CPU usage of a normal
      run to get an idea of the average amount of file system (and other
      MPE) overhead you're using.

   3) modify your code to not "process" the input records, and time
      a sample run (still reading the input records, and
      writing a dummy output record).

      time your program.

      This should give you an idea of how much CPU is used by
      various MPE items, and how elapsed time is used by waiting for disk.

Note: these aren't easy tests, and it's easy to get misled when analyzing
the data.  I remember one case where a program that did:

    read
    work a little
    write

took *LESS* time to run than the same program modified to do:

    read
    write

Our guess is that subsequent FREADs were getting requested before the
prior prefetches were being completed, and at such a point that the
file system took longer to block our process (until the read was
complete) than it would have if we'd waited a little before doing the
fread.    Highly unusual, but computer science has a lot of examples
of unusual corner cases fouling people up (the classic one is where
adding extra memory to a machine might slow it down, because the impact
on page fault handling)


One final note...I remember going to a user site where they were
writing 30 records per second to disk.  They were happy with their
program, but wanted to test the (then new) Kelly RAMDISC to see how
much the program would speed up.

We put in the RAMDISC, did a file equate to direct the output to a file
in a group on the RAMDISC, and saw the program writing ...  30 records
per second.

Suspicious, I did a file equate to direct the output to $NULL...
the program still wrote 30 records per second.

In all three cases, it was following Wirts rule of "use 100% of the
CPU" ... and it was a report writer (no, not Wirt's excellent QueryCalc,
which I recommend).

I'd modify Wirt's statement to be:

   On a single user machine, you want to design a reasonably efficent
   program, which uses CPU and I/O in a balanced manner.  If necessary,
   err on the side of using a little more CPU, because processors
   get faster quicker than I/O gets faster.  If you're using not hitting
   the theoretical max number of I/Os per second, investigate it ...
   regardless of CPU usage.  If you're hitting the theoretical max
   number of I/Os per seconds, and using 100% of the CPU ... don't
   congratulate yourself too much ... if you reduced that CPU usage, maybe
   something else could creep in from time to time.
   Or, if you re-think things, perhaps you can do less I/Os and still get
   the job done faster.

--
Stan Sieler                                          [log in to unmask]
                                     http://www.allegro.com/sieler.html

ATOM RSS1 RSS2