RE: [XML-Binary] ZIP file format using XPATH for directory entries proposal from Cutler, Roger (RogerCutler) on 2005-02-18 (public-xml-binary@w3.org from February 2005)

From: Cutler, Roger (RogerCutler) <RogerCutler@chevrontexaco.com>
Date: Fri, 18 Feb 2005 10:58:18 -0600
To: "Fred P." <fprog26@hotmail.com>, public-xml-binary@w3.org
Message-ID: <71C38086EA230D43941DD0A3BAFF8CA90595B9@bocnte2k3.hou150.chevrontexaco.net>
Since you mention the "Floating Point Arrays in the Energy Industry"
usage case, I probably should comment:

We have not found the performance of any compression algorithm adequate
for our usage cases.  We have extensive experience with this, and are
very confident that compression is NOT a good thing for seismic data in
this context.  Quoting from the Usage Case document (which is quoting an
industry expert), "Been there, done that, doesn't work, not interested".
Doesn't mean it might not work for other use cases, of course.

I'm not sure if I should say this, but I will -- Please don't think you
know more about compression than our poeople.  That would really be a
mistake.  We may be a bunch of redneck roughnecks in the field, but
we've got a LONG history of cutting edge involvement with digital signal
processing. We invented some of the key techniques, in fact. (That's not
me personally, incidentally.  I'm very modestly knowledgable about these
things.)

About your specific proposal for handling the seismic data (which is our
contribution -- including an example dataset), compression aside, I
still don't know.  Is it really reasonable to fling millions of small
files around?  I recall that some operating systems don't like that at
all.  As a specific example, I have experience on Solaris Unix systems
making directories containing hundreds of thousands of small
auto-generated files.  The OS choked -- really fundamentally choked --
if you tried to put them all in one directory.  I was forced to make
directory trees with leaf directories that had some max number of files
in them (I used 1000, if I recall correctly).  This necessitated, of
course, a bunch of pain-in-the-neck logic and code.

This was a while ago, so maybe things have improved -- I throw the
experience out for what it is worth.  But I am dubious and would
certainly want to see demonstrations before committing to this approach.

-----Original Message-----
From: public-xml-binary-request@w3.org
[mailto:public-xml-binary-request@w3.org] On Behalf Of Fred P.
Sent: Friday, February 18, 2005 1:40 AM
To: public-xml-binary@w3.org
Subject: [XML-Binary] ZIP file format using XPATH for directory entries
proposal



Hi everyone,

Here's some very straight-forward proposal:

The following proposal is made to address, the following use cases:

- 3.2 Floating Point Arrays in the Energy Industry
- 3.3 X3D Graphics Model Compression, Serialization and Transmission
- 3.5 Web Services within the Enterprise
- 3.6 Embedding External Data in XML Documents
- 3.7 Electronic Documents

and to some point to this use case and others where complexity matters:
- 3.4 Web Services for Small Devices


Many here might know the OASIS OpenDocument format,
which consist of a ZIP files of XML documents.

The following idea is an extension of this idea.

It was derived by looking at FixML 4.3, svg and seisdata
and various other use cases, which needs binary content.


Proposed name/extension:
------------------------
.BML  = Binary Markup Language
.ZML  = Zipped Markup Language
.7ML  = 7-zip Markup Language
.XMLZ = eXtensible Markup Language Zipped (similar to svgz)

One of the use cases needed a 'very small footprint' for a decompressor.

I looked around for commonly used compression format, they are mainly:
bzip2, gzip, tar, jar/zip, arj, rar, ace, 7-zip

bzip2 and gzip were already considered there main problem are:
- They cannot random access a file  (solid archive)
- They cannot contain multiple file (mostly via tar)
- They are complex to implement for small device.

tar cannot be compressed, so it's eliminated.

rar and ace are proprietary with only extracting algorithms being
provided, 
so it's eliminated.

arj was interesting, but not quite enough:
+ Random access
+ Can contain multiple file
+ Source code available on sourceforge (GPL)
+ Did about the same as zip for mixed and binary content
- Did worst than zip on compression of english text, log file or sorted
word 
list
- Is claimed to be: "ARJ is a CPU-intensive program"

zip/jar was interesting, but with some algo/size limitations:
+ Random access
+ Can contain multiple file
+ Algorithm is available
+ Source code available
+ Industry standard
+ Compress/Decompress about the same speed/size than gzip in fast mode
- 2 GB limitation
- Does not support Unicode file names
- 1.5x to 3x bigger than 7-zip

7-zip was interesting, with some speed limitation:
+ Random access
+ Can contain multiple file
+ Algorithm is available
+ Source code available (LGPL)
+ Small code size for decompressing: about 5 KB
+ Very suitable for embedded applications
+ Small memory requirements for decompressing (depend from dictionary 
+ size) Supports encryption with AES-256 algorithm Use LZMA derived from

+ LZ77 Support Unicode file names
+ Compression of archive headers
+ Support more than 2 GB content (2^64 bytes)
- Not an industry standard
- Decompressing speed: about 10-20 MB/s on 2 GHz CPU
- Compressing   speed: about     1 MB/s on 2 GHz CPU
-  4 times slower than gzip when compressing/decompressing in fast mode 
-mx=1
- 10 times slower than gzip when compressing in maximum mode
- Total time including transfer is 1.5x slower than gzip or zip

=========================================================

As a result, zip and 7-zip can be considered.

The obvious advantage of zip is that it is an industry standard and
works fast/size and similar to gzip.

The obvious advantage of 7-zip is file size over slow links (28Kbps or 
slower),
Unicode file names and the small 5KB footprint for the decompressor.

=========================================================

The goal is to use an XPath like syntax for directories within the
archive.

With 7-zip, this means Unicode Entities can be supported,
while this is not possible with zip.

Question to debate are:
- Do we put file extensions or not?
  + Easier for external viewer.
  - XPATH Association must be done on the filename WITHOUT the
extension.

- Do we want to support 2 GB+ archives (zip64, 7-zip) ?
- Do we want to support Unicode XPath (7-zip) ?
- Is file size more important than compression speed ?

- Should we support many compression scheme: gzip, bzip2, zip and 7-zip
?


Assuming the following test case:
http://www.w3.org/TR/xbc-use-cases/#3.2

<seisdata>
  <lineheader>...<lineheader>
  <header>
    <linename>westcam 2811</linename>
    <ntrace>1207</ntrace>
    <nsamp>3001</nsamp>
    <tstart>0.0</tstart>
    <tinc>4.0</tinc>
    <shot>120</shot>
    <geophone>7</geophone>
  </header>
  <trace>0.0 0.0 468.34 3.245672E04 6.9762345E05 ... (3001
floats)</trace>
  <header>
    ...
  </header>
  <trace> ... </trace>
  ...
</seisdata>

Would be stored like this:
==========================
/seisdata.xml
/seisdata/trace[1].bin
/seisdata/trace[2].bin
/seisdata/trace[3].bin
/seisdata/trace[4].bin
/seisdata/trace[5].bin
/seisdata/trace[6].bin


Where /seisdata.xml contains:
============================
<seisdata>
  <lineheader>...<lineheader>
  <header>
    <linename>westcam 2811</linename>
    <ntrace>1207</ntrace>
    <nsamp>3001</nsamp>
    <tstart>0.0</tstart>
    <tinc>4.0</tinc>
    <shot>120</shot>
    <geophone>7</geophone>
  </header>
  <trace><![BDATA[/seisdata/trace[1].bin]]></trace>
  <header>
    ...
  </header>
  <trace><![BDATA[/seisdata/trace[2].bin]]></trace>
  ...
</seisdata>

Where /seisdata/trace[1].bin contains opaque
IEEE floating point binary digits.


The <trace> node could be empty like this <trace></trace>

The advantage of the placeholder is for conventional DOM manipulation to
assess that such thing as zipped binary data exist... and should be
fetched on the fly, as needed.

In this case, if you just want the XML without the binary,
it should load quickly and be easy to parse/modify/save.


+ This means accessing individual trace is extremelly fast and easy,
  since the archive does not have to be fully extracted or parsed.

+ Also, no encoding is needed.

+ It's very easy to create/modify/extract/view any files within the 
+ archive.

+++ Readability is preserved.
+++ No big changes to existing XML tools/parser


Another way, would be the following:
====================================

/seisdata.xml
/seisdata/header[1].xml
/seisdata/trace[1].bin
/seisdata/header[2].xml
/seisdata/trace[2].bin
/seisdata/header[3].xml
/seisdata/trace[3].bin
/seisdata/header[4].xml
/seisdata/trace[4].bin
/seisdata/header[5].xml
/seisdata/trace[5].bin
/seisdata/header[6].xml
/seisdata/trace[6].bin

This means that new nodes are appended using zip operations instead.

It also means that the XML parser must work a bit harder.
It is also safer since the original is kept as is
for financial, banking, crucial data, where a modification log is
needed.

/seisdata.xml

could contain <![XDATA[/seisdata/header[6].xml]]> or not.

Obviously, adding such placeholder for XML add-on would be penalising.

However, an alternative scenario or syntax could be derived:

<seisdata>
  <lineheader>...<lineheader>
  <header><![XDATA[/seisdata/header[*].xml]]></header>
  <trace><![BDATA[/seisdata/trace[*].bin]]></trace>
</seisdata>


i.e. pkunzip -d seisdata.zip /seisdata/trace[*].bin


XDATA: XML data     [parsable by DOM]
BDATA: Binary data  [non parsable]


Another use case, Word2000 HTML: ================================

Currently, it save into a "folder"
with external metadata, images, sounds.
That could be zipped like this:

/html.xhtml
/html/head/metadata.bin
/html/body/img[1].gif
/html/body/img[2].jpg
/html/body/img[3].png
/html/body/img[4].bmp
/html/body/img[5].svg
/html/body/embed[1].mp3
/html/body/embed[2].mid
/html/body/embed[3].au

========================================================

Reference:
==========

XML Binary Characterization Use cases:
http://www.w3.org/TR/xbc-use-cases/

OASIS OpenDocument:
http://www.oasis-open.org/committees/download.php/10765/office-spec-1.0-
cd-2.pdf

Compression Algorithm Comparision:
http://www.maximumcompression.com/data/text.php
http://www.maximumcompression.com/data/summary_sf.php
http://en.wikipedia.org/wiki/Comparsion_of_file_archivers
http://www.malayter.com/compressiontest.html
http://www.cs.tut.fi/~warp/ArchiverComparison/
http://flashlight.slad.cz/files/compression.pdf
http://www.compression.ru/artest/texts25.html

7-zip file format:
http://www.7-zip.org/7z.html
http://en.wikipedia.org/wiki/7-Zip
http://en.wikipedia.org/wiki/7z

Arj file format: http://datacompression.info/ArchiveFormats/arj.txt
http://arj.sourceforge.net/

Rar file format:
http://www.geocities.com/marcoschmidt.geo/rar-archive-file-format.html

Zip file format:
http://www.geocities.com/marcoschmidt.geo/zip-archive-file-format.html
http://datacompression.info/Zip.shtml
http://www.faqs.org/rfcs/rfc1951.html
http://www.geocities.com/SiliconValley/Lakes/6686/zip-archive-file-forma
t.html
http://en.wikipedia.org/wiki/ZIP_file_format

XPath tutorial:
http://www.w3schools.com/xpath/xpath_syntax.asp

CDATA tutorial:
http://www.w3schools.com/xml/xml_cdata.asp


=========================================================

Comments, suggestions, improvements, feed back welcome ? =)


Sincerely yours,

Fred.
Received on Friday, 18 February 2005 16:59:12 UTC