HP3000-L Archives

December 2006, Week 5

HP3000-L@RAVEN.UTC.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Mark Wonsil <[log in to unmask]>
Reply To:
Mark Wonsil <[log in to unmask]>
Date:
Sat, 30 Dec 2006 09:44:35 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (61 lines)
Hi John,
> The data consists of snapshots of web pages (made with wget).  The
> snapshots are captured every 15 minutes by a PHP script running as a "run
> at boot" task on Windows XP Pro. Most of the files are small (0k to 10k
> bytes) but there are millions of them in the course of a month.
> 
> The directory structure is:
> 
> Page_name\yyyy\mm\dd\Page_name_yyyy-mm-dd_HH-MM\files
...
> Each day adds about 12,200 files per web page (on average) and about 132 MB
> of used storage (on average).  Thus, there about 378,200 files per web page
> per month, utilizing about 4.092 GB of storage.  Space wise, the files
> should fit.  In reality, I can't make it work.  I can get about 9 to 10
> days of files for a single web page on a DVD at best, and that is if I'm
> willing to let Roxio run overnight to do the burning (it takes HOURS to
> burn those 10 days of data while the system just thrashes).

In production and inventory control, we learn that there are two kinds of lead
time: internal and external. External lead time you have no control over
(vendor deliveries, time to bake, etc.) but internal lead time is something
you can control and there's a LOT that you could squeeze out of your current
process.

When Roxio (or other software packages) create a CD, they have to copy each
and every file into the burn image first and then burn the image to the CD.
This is what is thrashing your system. You are choosing to wait and add nearly
380K files to the "project" all at once.

In the Theory of Constraints [1], one looks for the bottle neck in the process
and creating your burn image is clearly your bottle-neck. So what can you do
about it? You have 15 minutes between wget calls of which I would guess that
ten of those minutes the PC is doing nothing. Why not use this time to
preprocess the files?

The first thought would be to see if Roxio has a command line option that
would allow you to add the newly created folder to your Data Project. If it
does, add your folder to the cd data project immediately after the wget
completes. This eliminates a lot of the directory traversal that you're doing.


However, I don't know if Roxio is actually adding files to the burn image at
that time. If not, you may be able to create an ISO image on disc at this time
as well. You can check the size of this image and when it gets to a certain
size, copy it to another folder for burning.

I would also take Bruno's advice and compress your files. HTML is highly
compressible. Instead of adding files to the data project after each wget, you
could use a compress-archive (tar -Z, WinZip, etc.) to add the new folder to a
daily archive and then burn the 30 compressed-archive files in relatively
short order.

HTH,

Mark W.

1. http://en.wikipedia.org/wiki/Theory_of_constraints

* To join/leave the list, search archives, change list settings, *
* etc., please visit http://raven.utc.edu/archives/hp3000-l.html *

ATOM RSS1 RSS2