[XML-Binary] ZIP file format using XPATH for directory entries proposal

Hi everyone,

Here's some very straight-forward proposal:

The following proposal is made to address, the following use cases:

- 3.2 Floating Point Arrays in the Energy Industry
- 3.3 X3D Graphics Model Compression, Serialization and Transmission
- 3.5 Web Services within the Enterprise
- 3.6 Embedding External Data in XML Documents
- 3.7 Electronic Documents

and to some point to this use case and others where complexity matters:
- 3.4 Web Services for Small Devices


Many here might know the OASIS OpenDocument format,
which consist of a ZIP files of XML documents.

The following idea is an extension of this idea.

It was derived by looking at FixML 4.3, svg and seisdata
and various other use cases, which needs binary content.


Proposed name/extension:
------------------------
.BML  = Binary Markup Language
.ZML  = Zipped Markup Language
.7ML  = 7-zip Markup Language
.XMLZ = eXtensible Markup Language Zipped (similar to svgz)

One of the use cases needed a 'very small footprint' for a decompressor.

I looked around for commonly used compression format, they are mainly:
bzip2, gzip, tar, jar/zip, arj, rar, ace, 7-zip

bzip2 and gzip were already considered there main problem are:
- They cannot random access a file  (solid archive)
- They cannot contain multiple file (mostly via tar)
- They are complex to implement for small device.

tar cannot be compressed, so it's eliminated.

rar and ace are proprietary with only extracting algorithms being provided, 
so it's eliminated.

arj was interesting, but not quite enough:
+ Random access
+ Can contain multiple file
+ Source code available on sourceforge (GPL)
+ Did about the same as zip for mixed and binary content
- Did worst than zip on compression of english text, log file or sorted word 
list
- Is claimed to be: "ARJ is a CPU-intensive program"

zip/jar was interesting, but with some algo/size limitations:
+ Random access
+ Can contain multiple file
+ Algorithm is available
+ Source code available
+ Industry standard
+ Compress/Decompress about the same speed/size than gzip in fast mode
- 2 GB limitation
- Does not support Unicode file names
- 1.5x to 3x bigger than 7-zip

7-zip was interesting, with some speed limitation:
+ Random access
+ Can contain multiple file
+ Algorithm is available
+ Source code available (LGPL)
+ Small code size for decompressing: about 5 KB
+ Very suitable for embedded applications
+ Small memory requirements for decompressing (depend from dictionary size)
+ Supports encryption with AES-256 algorithm
+ Use LZMA derived from LZ77
+ Support Unicode file names
+ Compression of archive headers
+ Support more than 2 GB content (2^64 bytes)
- Not an industry standard
- Decompressing speed: about 10-20 MB/s on 2 GHz CPU
- Compressing   speed: about     1 MB/s on 2 GHz CPU
-  4 times slower than gzip when compressing/decompressing in fast mode 
-mx=1
- 10 times slower than gzip when compressing in maximum mode
- Total time including transfer is 1.5x slower than gzip or zip

=========================================================

As a result, zip and 7-zip can be considered.

The obvious advantage of zip is that it is an industry standard
and works fast/size and similar to gzip.

The obvious advantage of 7-zip is file size over slow links (28Kbps or 
slower),
Unicode file names and the small 5KB footprint for the decompressor.

=========================================================

The goal is to use an XPath like syntax for directories within the archive.

With 7-zip, this means Unicode Entities can be supported,
while this is not possible with zip.

Question to debate are:
- Do we put file extensions or not?
  + Easier for external viewer.
  - XPATH Association must be done on the filename WITHOUT the extension.

- Do we want to support 2 GB+ archives (zip64, 7-zip) ?
- Do we want to support Unicode XPath (7-zip) ?
- Is file size more important than compression speed ?

- Should we support many compression scheme: gzip, bzip2, zip and 7-zip ?


Assuming the following test case:
http://www.w3.org/TR/xbc-use-cases/#3.2

<seisdata>
  <lineheader>...<lineheader>
  <header>
    <linename>westcam 2811</linename>
    <ntrace>1207</ntrace>
    <nsamp>3001</nsamp>
    <tstart>0.0</tstart>
    <tinc>4.0</tinc>
    <shot>120</shot>
    <geophone>7</geophone>
  </header>
  <trace>0.0 0.0 468.34 3.245672E04 6.9762345E05 ... (3001 floats)</trace>
  <header>
    ...
  </header>
  <trace> ... </trace>
  ...
</seisdata>

Would be stored like this:
==========================
/seisdata.xml
/seisdata/trace[1].bin
/seisdata/trace[2].bin
/seisdata/trace[3].bin
/seisdata/trace[4].bin
/seisdata/trace[5].bin
/seisdata/trace[6].bin


Where /seisdata.xml contains:
============================
<seisdata>
  <lineheader>...<lineheader>
  <header>
    <linename>westcam 2811</linename>
    <ntrace>1207</ntrace>
    <nsamp>3001</nsamp>
    <tstart>0.0</tstart>
    <tinc>4.0</tinc>
    <shot>120</shot>
    <geophone>7</geophone>
  </header>
  <trace><![BDATA[/seisdata/trace[1].bin]]></trace>
  <header>
    ...
  </header>
  <trace><![BDATA[/seisdata/trace[2].bin]]></trace>
  ...
</seisdata>

Where /seisdata/trace[1].bin contains opaque
IEEE floating point binary digits.


The <trace> node could be empty like this <trace></trace>

The advantage of the placeholder is for conventional DOM manipulation
to assess that such thing as zipped binary data exist...
and should be fetched on the fly, as needed.

In this case, if you just want the XML without the binary,
it should load quickly and be easy to parse/modify/save.


+ This means accessing individual trace is extremelly fast and easy,
  since the archive does not have to be fully extracted or parsed.

+ Also, no encoding is needed.

+ It's very easy to create/modify/extract/view any files within the archive.

+++ Readability is preserved.
+++ No big changes to existing XML tools/parser


Another way, would be the following:
====================================

/seisdata.xml
/seisdata/header[1].xml
/seisdata/trace[1].bin
/seisdata/header[2].xml
/seisdata/trace[2].bin
/seisdata/header[3].xml
/seisdata/trace[3].bin
/seisdata/header[4].xml
/seisdata/trace[4].bin
/seisdata/header[5].xml
/seisdata/trace[5].bin
/seisdata/header[6].xml
/seisdata/trace[6].bin

This means that new nodes are appended using zip operations instead.

It also means that the XML parser must work a bit harder.
It is also safer since the original is kept as is
for financial, banking, crucial data, where a modification log is needed.

/seisdata.xml

could contain <![XDATA[/seisdata/header[6].xml]]> or not.

Obviously, adding such placeholder for XML add-on would be penalising.

However, an alternative scenario or syntax could be derived:

<seisdata>
  <lineheader>...<lineheader>
  <header><![XDATA[/seisdata/header[*].xml]]></header>
  <trace><![BDATA[/seisdata/trace[*].bin]]></trace>
</seisdata>


i.e. pkunzip -d seisdata.zip /seisdata/trace[*].bin


XDATA: XML data     [parsable by DOM]
BDATA: Binary data  [non parsable]


Another use case, Word2000 HTML:
================================

Currently, it save into a "folder"
with external metadata, images, sounds.
That could be zipped like this:

/html.xhtml
/html/head/metadata.bin
/html/body/img[1].gif
/html/body/img[2].jpg
/html/body/img[3].png
/html/body/img[4].bmp
/html/body/img[5].svg
/html/body/embed[1].mp3
/html/body/embed[2].mid
/html/body/embed[3].au

========================================================

Reference:
==========

XML Binary Characterization Use cases:
http://www.w3.org/TR/xbc-use-cases/

OASIS OpenDocument:
http://www.oasis-open.org/committees/download.php/10765/office-spec-1.0-cd-2.pdf

Compression Algorithm Comparision:
http://www.maximumcompression.com/data/text.php
http://www.maximumcompression.com/data/summary_sf.php
http://en.wikipedia.org/wiki/Comparsion_of_file_archivers
http://www.malayter.com/compressiontest.html
http://www.cs.tut.fi/~warp/ArchiverComparison/
http://flashlight.slad.cz/files/compression.pdf
http://www.compression.ru/artest/texts25.html

7-zip file format:
http://www.7-zip.org/7z.html
http://en.wikipedia.org/wiki/7-Zip
http://en.wikipedia.org/wiki/7z

Arj file format:
http://datacompression.info/ArchiveFormats/arj.txt
http://arj.sourceforge.net/

Rar file format:
http://www.geocities.com/marcoschmidt.geo/rar-archive-file-format.html

Zip file format:
http://www.geocities.com/marcoschmidt.geo/zip-archive-file-format.html
http://datacompression.info/Zip.shtml
http://www.faqs.org/rfcs/rfc1951.html
http://www.geocities.com/SiliconValley/Lakes/6686/zip-archive-file-format.html
http://en.wikipedia.org/wiki/ZIP_file_format

XPath tutorial:
http://www.w3schools.com/xpath/xpath_syntax.asp

CDATA tutorial:
http://www.w3schools.com/xml/xml_cdata.asp


=========================================================

Comments, suggestions, improvements, feed back welcome ? =)


Sincerely yours,

Fred.

Received on Friday, 18 February 2005 07:40:36 UTC