RE: [XML-Binary] ZIP file format using XPATH for directory entries proposal from Cutler, Roger (RogerCutler) on 2005-02-18 (public-xml-binary@w3.org from February 2005)

From: Cutler, Roger (RogerCutler) <RogerCutler@chevrontexaco.com>
Date: Fri, 18 Feb 2005 13:57:38 -0600
To: "Fred P." <fprog26@hotmail.com>, public-xml-binary@w3.org
Message-ID: <71C38086EA230D43941DD0A3BAFF8CA90595BC@bocnte2k3.hou150.chevrontexaco.net>
Hey, that's really interesting.  Learn something every day. Sounds more
workable than my initial impression of what you were proposing.
Basically, however, I'll let the experts in XBC evaluate it -- I just
wanted to comment from the point of view of that use case and make sure
that our environment is clearly understood.

Yes, at least one of the problems with directories containing huge
numbers of files was ls.  You might say, "Well, don't do an ls", but
that's really not very workable in practice.  I don't recall if there
were other problems.  This was a LONG time ago.

I am, however, mercifully out of this business now -- we just retired
the last system I built with mega-numbers of small files.  It was
essentially a hack, devised early in the history of the Web ('95 or so),
to get around lack of tools and techniques which have since appeared.  I
made huge directories of little html files (including certain metadata,
essentially in XML but this was more or less pre-XML -- I was thinking
SGML if I was thinking at all) and put a metadata-aware fulltext search
engine on top of them.  Unfortunately, the systems I built this way were
rather successful and their functionality was a bit hard to duplicate
with more modern systems at a reasonable cost, so it was hard to get rid
of them.

-----Original Message-----
From: Fred P. [mailto:fprog26@hotmail.com] 
Sent: Friday, February 18, 2005 1:37 PM
To: Cutler, Roger (RogerCutler); public-xml-binary@w3.org
Subject: RE: [XML-Binary] ZIP file format using XPATH for directory
entries proposal


Hi M. Cutler,

First of all, let me say that I'm not an expert in the seisdata field in
any 
way
and I don't wish to claim to be one either.

Sorry, if that looked like such, that was not the point in any case.

I'm just saying that your kind of dataset among others that I used more 
frequently
inspired me to write this proposal email.

>We have not found the performance of any compression algorithm adequate

>for our usage cases.  We have extensive experience with this, and are 
>very confident that compression is NOT a good thing for seismic data in

>this context.  Quoting from the Usage Case document (which is quoting 
>an industry expert), "Been there, done that, doesn't work, not 
>interested". Doesn't mean it might not work for other use cases, of 
>course.
>
>I'm not sure if I should say this, but I will -- Please don't think you

>know more about compression than our poeople.  That would really be a 
>mistake.  We may be a bunch of redneck roughnecks in the field, but 
>we've got a LONG history of cutting edge involvement with digital 
>signal processing. We invented some of the key techniques, in fact. 
>(That's not me personally, incidentally.  I'm very modestly 
>knowledgable about these
>things.)

I read in the use case that "file compression" is not a good idea for
such 
data.
I'm fully aware of it. What you might not be aware is that ZIP file
content 
can be stored
"uncompressed". On the good old DOS prompt this means using the "pkzip
-e0" 
or
"No compression" options or "zip -0" (ZER0) on *nix systems.

This will give you what the Java community used to call a JAR file,
which consist of packing data within a ZIP file uncompressed. This is
used to stored uncompressed data when needed.

Of course, similarly you could have use TAR file, which for your use
case 
would
be also sufficient; however, it doesn't help other use cases where
compression is actually needed. That's why I was proposing a ZIP 
format
for binary packaging that could be compressed or not.

Which means that even though the file contains millions of small binary 
files,
the uncompressed format can be used 'efficiently' to work around this.
This means that you use the zip file like an internal file system, which
you can directly random access without any real parsing to be 
performed
and extract the binary data directly in memory, since it's small binary 
chunk.
It should be quite easy and efficient to implement.

It is probably very similar to your actual binary data format that you
are 
using.
Probably where absolute offset are used to access data efficiently, like

binary dump of a float array.
If not I would be happy to know more about how it is stored currently in
the 
industry.

Of course, some file system do not like having thousands of files,
that's 
okay,
if you work directly within the JAR/ZIP file, as I said above.

A work around, if you actually need to extract all files would be to
create multiple sub directories and extract by range. It's feasible...
or get a modern file system that support it like ReiserFS or similar and
use it as a data storage server.

Another problematic with ZIP is the 2 GB problem,
that can be solve with ZIP64 format or 7-zip format.

Feel free to try it out and see if it might works for you or not.

As far as life tells me, the more you brainstorm about a given subject,
the more you will probably find out solutions to fix those problems.

>About your specific proposal for handling the seismic data (which is 
>our contribution -- including an example dataset), compression aside, I

>still don't know.  Is it really reasonable to fling millions of small 
>files around?

>I recall that some operating systems don't like that at
>all.  As a specific example, I have experience on Solaris Unix systems 
>making directories containing hundreds of thousands of small 
>auto-generated files.

I worked on Solaris, HP-UX, Unix System V, among various Linux
distribution 
and BSD variants.

>The OS choked -- really fundamentally choked --
>if you tried to put them all in one directory.

Well, on Linux it chokes if you try to use BASH on it,
since doing things like "ls trace*" won't work since BASH will try to
expand 
this into
"ls trace[1].bin trace[2].bin trace[3].bin ..." and at some point it
will 
exceed the limit.

Like I said, one way to handle this is to create subdirectories when you

extract it:

.\1\trace[1*].bin
.\2\trace[2*].bin

As far as Solaris, I tried it out on one of our Solaris servers (SunOS
5.9 
sun4u sparc SUNW,Ultra-2)
with few 100 MB of data. Having 25,000 touch files on /tmp didn't do any

harm
and ls 4* actually worked in few ms with tcsh compared to RedHat Linux
on 
ext2 with bash.
I will try with bigger sets of files (250,000) and let you know if it
works 
well.
I will try tcsh on RedHat Linux to see if that works too.

time zip -0 4.zip 4*.txt
took 0.37u 0.72s 0:02.37 45.9%

time zip -0 all25k.zip *.txt
took 2.55u 4.43s 0:43.65 15.9%

Here's the little script:
#!/usr/bin/perl
$path = "/tmp/";
for $j ( 1 .. 100 )
{
  print "$j\t";
  for $i ( 1 .. 1000 )
  {
     $a = $path . "$j.$i.test.txt";
     print qx( touch $a );
  }
}


>I was forced to make
>directory trees with leaf directories that had some max number of files

>in them (I used 1000, if I recall correctly).  This necessitated, of 
>course, a bunch of pain-in-the-neck logic and code.

Yes, the 1024 limit might be due to your shell program.
It can be work around using perl to perform your "ls *" logic, like I
had a use case where "ls" did work on 30000 but "ls a*" did not OR use a
different shell.

Another way like I was saying was to work directly from the ZIP file and
only extract what you need on the fly.

As a quick reminder, ZIP files are not "solid archives" like RAR or ACE
or 
tar/gz.
Where one needs to decompress the entire file to work with it.

Also,
time unzip -o all25k.zip 1.850.test.txt
time unzip -o all25k.zip 4.680.test.txt
time unzip -o all25k.zip 9.880.test.txt

took
0.09u 0.01s 0:00.10 100.0%
0.06u 0.04s 0:00.10 100.0%
0.07u 0.03s 0:00.10 100.0%

As you can see, it's quite fast!

>This was a while ago, so maybe things have improved -- I throw the 
>experience out for what it is worth.  But I am dubious and would 
>certainly want to see demonstrations before committing to this 
>approach.

I fully agree with you.
Sceptism is a *good thing*. =]
The best is to see if it could work or not.
To see what works and what does not and try to see what could be fixed
if 
it's fixable.

As a result, it would be quite interesting to see if this technique
could 
work for you or not.

You could see this proposal as an alternative similar to XOP, which was
also 
proposed:
http://www.w3.org/TR/2005/REC-xop10-20050125/

Just so you know, one of the advantage of <seisdata> use case 3.2 was
that it had some explicit XML sample, while most use case did not, so it
was a better candidate to work on. It was also the first use case in
XBC.

If you have any more comments, suggestions, improvements or feed back,
please send them! =)

Sincerely yours,

Fred.
Received on Friday, 18 February 2005 19:58:22 UTC