HP3000-L Archives

June 1995, Week 4

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Jeff Kell <[log in to unmask]>
Reply To:
Jeff Kell <[log in to unmask]>
Date:
Wed, 21 Jun 1995 17:09:13 EDT
Content-Type:
text/plain
Parts/Attachments:
text/plain (153 lines)
Due to popular demand, I'll go ahead and post this on hp3000-l as well as
the new list.  If the hp3000-l folks don't mind this thread, we'll let
the hp-posix list die quietly.  I'm not sure where to suggest that you
send follow-ups (presumably to hp-posix).
 
Several people (myself included) are experiencing problems with httpd.
Chris Bartram and Stan Sieler have both reported CPU-bound "hangs" of
their servers.  I am having fork(), exec(), and malloc() problems.  In
the process of tracing down my troubles, I cleaned up a few things but
still haven't really nailed down the problem.  A little history...
 
In it's initial distribution, httpd (from Mike Belshe) was supplied in
two basic forms:  the source distribution (make it yourself) and the
full distribution with object code, directory structure, and sample
files.  We were using the latter, and until recently I had never even
compiled the code (fear of the unknown :-) ).
 
Meanwhile, I had also retrieved some things that Steve Elmer had ported,
particularly gopher.  His ports were based on a common "libbsd" package
that he had assembled (the beginnings of which were shipped in-line with
the httpd package).  I was able to "make" gopher and some other packages
with success.  He has since expanded the original libbsd distribution in
a more comprehensive package.
 
All of the early experiments were with 5.0 Pull, which had a host of other
quirks as well to deal with (c89, ar, make) that are outlined very well in
Steve Elmer's "Porting" paper (http://jazz.external.hp.com/src.bsd/porting).
Essentially several changes are required to the standard shell to get
things to work.  Most (if not all) of these are irrelevant on 5.0 Push,
at least after the PowerPatch -- I had done all of my Posix "playing" on
our library system and had to make the changes.  After 5.0 Push+Patches
the supplied shell files work properly (well, let's just say they work :-)
 
I then set out to build a "clean" porting environment on my previously
untouched administrative system.  In the process I was able to remove the
"in-line" library code from httpd and substitute Steve's latest offering.
If Steve/Mike don't mind, I can package this and offer it via FTP for
anyone that wants it.  If you want to "roll your own" here's a capsule
summary:
 
(1) Extract libbsd.R from jazz into /usr/include/bsd and /usr/lib/libbsd.a
(2) Change /usr/include/bsd/strings.h to grab <string.h> instead of the
    existing /bsd/include/string.h
(3) chmod 444 /usr/include/bsd/* (and subdirs); chmod 555 /usr/lib/libbsd.a
(4) Copy old httpd.../include/arpa/inet.h to /usr/include/bsd/arpa/inet.h
(5) Tweak the Makefile (here's a diff):
 
   31,32c31,32
   < -DNO_KILLPG -I./include -I/usr/include
   < EXTRA_LIBS= -s ../../lib/libbsd.a -s /usr/lib/libsocket.a
   ---
   > -DNO_KILLPG -I/usr/include/bsd -I/usr/include
   > EXTRA_LIBS= -s /usr/lib/libbsd.a -s /usr/lib/libsocket.a
 
Then "make".  You'll get a few warnings about a redefined macro, but
otherwise it's clean.
 
This has nothing to do with the problems though, as the old and new
object code exhibit the same errors.  Speaking of which...
 
On 5.0 Pull the server worked fine except for an occasional "idle hang"
where it would refuse connections.  Abortjob was ineffective, and you
had to shut the system down to clear it out.  I presume this to be the
error that Mike Belshe created the "httpd_1.3-p1" patch to fix.  The
httpd patch alone did not fix the problem, but in Mike's notes about
installing the patch he suggests that you obtain HP patch NSTDDP1.
 
I called the RC about this patch and was told it didn't exist, but
there was a patch NSTDDT1 and co-requisite NMSDDT3.  After installing
these, the hang went away.  Then the seemingly "random" errors started
popping up as I mentioned previously on the list.  Specifically, when
trying to invoke a cgi-bin script, you get the following abort:
 
**** Data memory protection trap (TRAPS 68).
 ABORT: SH.HPBIN.SYS
        PC=551.000428ac malloc+$32c
 NM* 0) SP=418498f8 RP=551.000427ac malloc+$22c
 NM  1) SP=41849878 RP=551.00023424 clearenv+$4c
 NM  2) SP=418497f8 RP=551.000236a8 mkenviron+$20
 NM  3) SP=418497b8 RP=551.0001e768 e_cmd+$30c
 NM  4) SP=41849738 RP=551.0001d14c execute+$2f8
 NM  5) SP=418495f8 RP=551.0001d208 execute+$3b4
 NM  6) SP=418494b8 RP=551.0001d3f4 execute+$5a0
 NM  7) SP=41849378 RP=551.0001d208 execute+$3b4
 NM  8) SP=41849238 RP=551.0001d3f4 execute+$5a0
 NM  9) SP=418490f8 RP=551.0001a18c shell+$1f4
 NM  a) SP=41848fb8 RP=551.00019ccc mks_main+$9c0
 NM  b) SP=41848f38 RP=551.0003f7ec main+$80
 NM  c) SP=41848e38 RP=551.00046668 _start+$bc
 NM  d) SP=41848d38 RP=551.000192ec $START$+$1c
 NM  e) SP=41847bb8 RP=551.00000000
      (end of NM stack)
 
Another more common abort condition was having the server receive a SIGBUS
error while doing an malloc(), and the signal handler trapped back to itself
trying to log the abort to error_log.  I removed the SIGBUS trap in hopes of
getting more meaningful trace/error messages, but the new error is:
 
  **** Data memory protection trap (TRAPS 68).
 
  ABORT: HTTPD.BIN.WWW
 
         PC=674.000295d4 malloc+$32c
  NM* 0) SP=41841870 RP=674.000294d4 malloc+$22c
  NM  1) SP=418417f0 RP=674.0001ed04 get_mime_headers+$1f8
  NM  2) SP=41841770 RP=674.00019274 process_request+$110
  NM  3) SP=418414f0 RP=674.00018ac4 standalone_main+$218
  NM  4) SP=41837470 RP=674.00018c78 main+$15c
  NM  5) SP=418373b0 RP=674.0002dc2c _start+$bc
  NM  6) SP=41837330 RP=674.00015bcc $START$+$1c
  NM  7) SP=418361b0 RP=674.00000000
       (end of NM stack)
 
Both are aborts from the same location in malloc(), and the first one
matches a known SR 4701-275941 relating the problem to low disc space.
But I didn't have low disc space -- millions of free sectors.  There
was a beta patch MPEHXB5 to address this.  I've tried it (twice) and
the only difference is that with the patch, httpd will log an error
message from this code fragment in httpd.c:
 
        /* we do this here so that logs can be opened as root */
        if((pid = fork()) == -1)
            log_error("unable to fork new process");
 
It will log the error ONCE, then promptly fail to fork on any subsequent
attempts to connect (browser gets a null page).  On changing this code
fragment to log errno as well, we get error 11, SAGAIN, "Resource busy,
try again".
 
On a whim I did a volutil: contigvol on ldev 1 to be sure to have high
levels of contiguous transient space available, but it appears to have
little if any effect (there's nearly a million contiguous sectors there).
I then did a contigvol on the other system volumes and voila, the errors
go away.  This morning, I had a few errors logged, and found that ldev 2
had a largest contiguous space of just over 8K sectors.  Another run of
contigvol and no more problems.
 
So why won't the system allocate transient space on ldev 1?  I recall an
earlier post on hp3000-l about permanent file allocation tending to
avoid ldev 1, but transient space?  And why is only malloc() and/or fork()
having trouble, and everything else is running fine?  Stan suggested it
might be a memory leak, but the error can occur when you first start the
server just as easily as it can two days later, and seems bound to the
lower limit of contiguous free space on the system volumes.
 
That question is in the response center as of this morning.  For now I'm
currently running without the patch (still get errors, but the server
doesn't die altogether like it does with the patch).
 
Sorry if I rambled too long...  Anyone else have httpd war stories?
 
[\] Jeff

ATOM RSS1 RSS2