RE: [XML-Binary] ZIP file format using XPATH for directory entries proposal from Fred P. on 2005-02-18 (public-xml-binary@w3.org from February 2005)

From: Fred P. <fprog26@hotmail.com>
Date: Fri, 18 Feb 2005 14:37:06 -0500
To: RogerCutler@chevrontexaco.com, public-xml-binary@w3.org
Message-ID: <BAY104-F274002C32C967610971144A76E0@phx.gbl>
Hi M. Cutler,

First of all, let me say that I'm not an expert in the seisdata field in any 
way
and I don't wish to claim to be one either.

Sorry, if that looked like such, that was not the point in any case.

I'm just saying that your kind of dataset among others that I used more 
frequently
inspired me to write this proposal email.

>We have not found the performance of any compression algorithm adequate
>for our usage cases.  We have extensive experience with this, and are
>very confident that compression is NOT a good thing for seismic data in
>this context.  Quoting from the Usage Case document (which is quoting an
>industry expert), "Been there, done that, doesn't work, not interested".
>Doesn't mean it might not work for other use cases, of course.
>
>I'm not sure if I should say this, but I will -- Please don't think you
>know more about compression than our poeople.  That would really be a
>mistake.  We may be a bunch of redneck roughnecks in the field, but
>we've got a LONG history of cutting edge involvement with digital signal
>processing. We invented some of the key techniques, in fact. (That's not
>me personally, incidentally.  I'm very modestly knowledgable about these 
>things.)

I read in the use case that "file compression" is not a good idea for such 
data.
I'm fully aware of it. What you might not be aware is that ZIP file content 
can be stored
"uncompressed". On the good old DOS prompt this means using the "pkzip -e0" 
or
"No compression" options or "zip -0" (ZER0) on *nix systems.

This will give you what the Java community used to call a JAR file,
which consist of packing data within a ZIP file uncompressed.
This is used to stored uncompressed data when needed.

Of course, similarly you could have use TAR file, which for your use case 
would
be also sufficient; however, it doesn't help other use cases
where compression is actually needed. That's why I was proposing a ZIP 
format
for binary packaging that could be compressed or not.

Which means that even though the file contains millions of small binary 
files,
the uncompressed format can be used 'efficiently' to work around this.
This means that you use the zip file like an internal file system,
which you can directly random access without any real parsing to be 
performed
and extract the binary data directly in memory, since it's small binary 
chunk.
It should be quite easy and efficient to implement.

It is probably very similar to your actual binary data format that you are 
using.
Probably where absolute offset are used to access data efficiently, like 
binary dump of a float array.
If not I would be happy to know more about how it is stored currently in the 
industry.

Of course, some file system do not like having thousands of files, that's 
okay,
if you work directly within the JAR/ZIP file, as I said above.

A work around, if you actually need to extract all files would be to create
multiple sub directories and extract by range. It's feasible...
or get a modern file system that support it like ReiserFS or similar
and use it as a data storage server.

Another problematic with ZIP is the 2 GB problem,
that can be solve with ZIP64 format or 7-zip format.

Feel free to try it out and see if it might works for you or not.

As far as life tells me, the more you brainstorm about a given subject,
the more you will probably find out solutions to fix those problems.

>About your specific proposal for handling the seismic data (which is our
>contribution -- including an example dataset), compression aside, I
>still don't know.  Is it really reasonable to fling millions of small
>files around?

>I recall that some operating systems don't like that at
>all.  As a specific example, I have experience on Solaris Unix systems
>making directories containing hundreds of thousands of small
>auto-generated files.

I worked on Solaris, HP-UX, Unix System V, among various Linux distribution 
and BSD variants.

>The OS choked -- really fundamentally choked --
>if you tried to put them all in one directory.

Well, on Linux it chokes if you try to use BASH on it,
since doing things like "ls trace*" won't work since BASH will try to expand 
this into
"ls trace[1].bin trace[2].bin trace[3].bin ..." and at some point it will 
exceed the limit.

Like I said, one way to handle this is to create subdirectories when you 
extract it:

.\1\trace[1*].bin
.\2\trace[2*].bin

As far as Solaris, I tried it out on one of our Solaris servers (SunOS 5.9 
sun4u sparc SUNW,Ultra-2)
with few 100 MB of data. Having 25,000 touch files on /tmp didn't do any 
harm
and ls 4* actually worked in few ms with tcsh compared to RedHat Linux on 
ext2 with bash.
I will try with bigger sets of files (250,000) and let you know if it works 
well.
I will try tcsh on RedHat Linux to see if that works too.

time zip -0 4.zip 4*.txt
took 0.37u 0.72s 0:02.37 45.9%

time zip -0 all25k.zip *.txt
took 2.55u 4.43s 0:43.65 15.9%

Here's the little script:
#!/usr/bin/perl
$path = "/tmp/";
for $j ( 1 .. 100 )
{
  print "$j\t";
  for $i ( 1 .. 1000 )
  {
     $a = $path . "$j.$i.test.txt";
     print qx( touch $a );
  }
}


>I was forced to make
>directory trees with leaf directories that had some max number of files
>in them (I used 1000, if I recall correctly).  This necessitated, of
>course, a bunch of pain-in-the-neck logic and code.

Yes, the 1024 limit might be due to your shell program.
It can be work around using perl to perform your "ls *" logic,
like I had a use case where "ls" did work on 30000 but "ls a*" did not
OR use a different shell.

Another way like I was saying was to work directly from the ZIP file
and only extract what you need on the fly.

As a quick reminder, ZIP files are not "solid archives" like RAR or ACE or 
tar/gz.
Where one needs to decompress the entire file to work with it.

Also,
time unzip -o all25k.zip 1.850.test.txt
time unzip -o all25k.zip 4.680.test.txt
time unzip -o all25k.zip 9.880.test.txt

took
0.09u 0.01s 0:00.10 100.0%
0.06u 0.04s 0:00.10 100.0%
0.07u 0.03s 0:00.10 100.0%

As you can see, it's quite fast!

>This was a while ago, so maybe things have improved -- I throw the
>experience out for what it is worth.  But I am dubious and would
>certainly want to see demonstrations before committing to this approach.

I fully agree with you.
Sceptism is a *good thing*. =]
The best is to see if it could work or not.
To see what works and what does not and try to see what could be fixed if 
it's fixable.

As a result, it would be quite interesting to see if this technique could 
work for you or not.

You could see this proposal as an alternative similar to XOP, which was also 
proposed:
http://www.w3.org/TR/2005/REC-xop10-20050125/

Just so you know, one of the advantage of <seisdata> use case 3.2 was that
it had some explicit XML sample, while most use case did not,
so it was a better candidate to work on.
It was also the first use case in XBC.

If you have any more comments, suggestions, improvements or feed back,
please send them! =)

Sincerely yours,

Fred.
Received on Friday, 18 February 2005 19:39:23 UTC