- From: Cutler, Roger (RogerCutler) <RogerCutler@chevrontexaco.com>
- Date: Fri, 18 Feb 2005 10:58:18 -0600
- To: "Fred P." <fprog26@hotmail.com>, public-xml-binary@w3.org
Since you mention the "Floating Point Arrays in the Energy Industry" usage case, I probably should comment: We have not found the performance of any compression algorithm adequate for our usage cases. We have extensive experience with this, and are very confident that compression is NOT a good thing for seismic data in this context. Quoting from the Usage Case document (which is quoting an industry expert), "Been there, done that, doesn't work, not interested". Doesn't mean it might not work for other use cases, of course. I'm not sure if I should say this, but I will -- Please don't think you know more about compression than our poeople. That would really be a mistake. We may be a bunch of redneck roughnecks in the field, but we've got a LONG history of cutting edge involvement with digital signal processing. We invented some of the key techniques, in fact. (That's not me personally, incidentally. I'm very modestly knowledgable about these things.) About your specific proposal for handling the seismic data (which is our contribution -- including an example dataset), compression aside, I still don't know. Is it really reasonable to fling millions of small files around? I recall that some operating systems don't like that at all. As a specific example, I have experience on Solaris Unix systems making directories containing hundreds of thousands of small auto-generated files. The OS choked -- really fundamentally choked -- if you tried to put them all in one directory. I was forced to make directory trees with leaf directories that had some max number of files in them (I used 1000, if I recall correctly). This necessitated, of course, a bunch of pain-in-the-neck logic and code. This was a while ago, so maybe things have improved -- I throw the experience out for what it is worth. But I am dubious and would certainly want to see demonstrations before committing to this approach. -----Original Message----- From: public-xml-binary-request@w3.org [mailto:public-xml-binary-request@w3.org] On Behalf Of Fred P. Sent: Friday, February 18, 2005 1:40 AM To: public-xml-binary@w3.org Subject: [XML-Binary] ZIP file format using XPATH for directory entries proposal Hi everyone, Here's some very straight-forward proposal: The following proposal is made to address, the following use cases: - 3.2 Floating Point Arrays in the Energy Industry - 3.3 X3D Graphics Model Compression, Serialization and Transmission - 3.5 Web Services within the Enterprise - 3.6 Embedding External Data in XML Documents - 3.7 Electronic Documents and to some point to this use case and others where complexity matters: - 3.4 Web Services for Small Devices Many here might know the OASIS OpenDocument format, which consist of a ZIP files of XML documents. The following idea is an extension of this idea. It was derived by looking at FixML 4.3, svg and seisdata and various other use cases, which needs binary content. Proposed name/extension: ------------------------ .BML = Binary Markup Language .ZML = Zipped Markup Language .7ML = 7-zip Markup Language .XMLZ = eXtensible Markup Language Zipped (similar to svgz) One of the use cases needed a 'very small footprint' for a decompressor. I looked around for commonly used compression format, they are mainly: bzip2, gzip, tar, jar/zip, arj, rar, ace, 7-zip bzip2 and gzip were already considered there main problem are: - They cannot random access a file (solid archive) - They cannot contain multiple file (mostly via tar) - They are complex to implement for small device. tar cannot be compressed, so it's eliminated. rar and ace are proprietary with only extracting algorithms being provided, so it's eliminated. arj was interesting, but not quite enough: + Random access + Can contain multiple file + Source code available on sourceforge (GPL) + Did about the same as zip for mixed and binary content - Did worst than zip on compression of english text, log file or sorted word list - Is claimed to be: "ARJ is a CPU-intensive program" zip/jar was interesting, but with some algo/size limitations: + Random access + Can contain multiple file + Algorithm is available + Source code available + Industry standard + Compress/Decompress about the same speed/size than gzip in fast mode - 2 GB limitation - Does not support Unicode file names - 1.5x to 3x bigger than 7-zip 7-zip was interesting, with some speed limitation: + Random access + Can contain multiple file + Algorithm is available + Source code available (LGPL) + Small code size for decompressing: about 5 KB + Very suitable for embedded applications + Small memory requirements for decompressing (depend from dictionary + size) Supports encryption with AES-256 algorithm Use LZMA derived from + LZ77 Support Unicode file names + Compression of archive headers + Support more than 2 GB content (2^64 bytes) - Not an industry standard - Decompressing speed: about 10-20 MB/s on 2 GHz CPU - Compressing speed: about 1 MB/s on 2 GHz CPU - 4 times slower than gzip when compressing/decompressing in fast mode -mx=1 - 10 times slower than gzip when compressing in maximum mode - Total time including transfer is 1.5x slower than gzip or zip ========================================================= As a result, zip and 7-zip can be considered. The obvious advantage of zip is that it is an industry standard and works fast/size and similar to gzip. The obvious advantage of 7-zip is file size over slow links (28Kbps or slower), Unicode file names and the small 5KB footprint for the decompressor. ========================================================= The goal is to use an XPath like syntax for directories within the archive. With 7-zip, this means Unicode Entities can be supported, while this is not possible with zip. Question to debate are: - Do we put file extensions or not? + Easier for external viewer. - XPATH Association must be done on the filename WITHOUT the extension. - Do we want to support 2 GB+ archives (zip64, 7-zip) ? - Do we want to support Unicode XPath (7-zip) ? - Is file size more important than compression speed ? - Should we support many compression scheme: gzip, bzip2, zip and 7-zip ? Assuming the following test case: http://www.w3.org/TR/xbc-use-cases/#3.2 <seisdata> <lineheader>...<lineheader> <header> <linename>westcam 2811</linename> <ntrace>1207</ntrace> <nsamp>3001</nsamp> <tstart>0.0</tstart> <tinc>4.0</tinc> <shot>120</shot> <geophone>7</geophone> </header> <trace>0.0 0.0 468.34 3.245672E04 6.9762345E05 ... (3001 floats)</trace> <header> ... </header> <trace> ... </trace> ... </seisdata> Would be stored like this: ========================== /seisdata.xml /seisdata/trace[1].bin /seisdata/trace[2].bin /seisdata/trace[3].bin /seisdata/trace[4].bin /seisdata/trace[5].bin /seisdata/trace[6].bin Where /seisdata.xml contains: ============================ <seisdata> <lineheader>...<lineheader> <header> <linename>westcam 2811</linename> <ntrace>1207</ntrace> <nsamp>3001</nsamp> <tstart>0.0</tstart> <tinc>4.0</tinc> <shot>120</shot> <geophone>7</geophone> </header> <trace><![BDATA[/seisdata/trace[1].bin]]></trace> <header> ... </header> <trace><![BDATA[/seisdata/trace[2].bin]]></trace> ... </seisdata> Where /seisdata/trace[1].bin contains opaque IEEE floating point binary digits. The <trace> node could be empty like this <trace></trace> The advantage of the placeholder is for conventional DOM manipulation to assess that such thing as zipped binary data exist... and should be fetched on the fly, as needed. In this case, if you just want the XML without the binary, it should load quickly and be easy to parse/modify/save. + This means accessing individual trace is extremelly fast and easy, since the archive does not have to be fully extracted or parsed. + Also, no encoding is needed. + It's very easy to create/modify/extract/view any files within the + archive. +++ Readability is preserved. +++ No big changes to existing XML tools/parser Another way, would be the following: ==================================== /seisdata.xml /seisdata/header[1].xml /seisdata/trace[1].bin /seisdata/header[2].xml /seisdata/trace[2].bin /seisdata/header[3].xml /seisdata/trace[3].bin /seisdata/header[4].xml /seisdata/trace[4].bin /seisdata/header[5].xml /seisdata/trace[5].bin /seisdata/header[6].xml /seisdata/trace[6].bin This means that new nodes are appended using zip operations instead. It also means that the XML parser must work a bit harder. It is also safer since the original is kept as is for financial, banking, crucial data, where a modification log is needed. /seisdata.xml could contain <![XDATA[/seisdata/header[6].xml]]> or not. Obviously, adding such placeholder for XML add-on would be penalising. However, an alternative scenario or syntax could be derived: <seisdata> <lineheader>...<lineheader> <header><![XDATA[/seisdata/header[*].xml]]></header> <trace><![BDATA[/seisdata/trace[*].bin]]></trace> </seisdata> i.e. pkunzip -d seisdata.zip /seisdata/trace[*].bin XDATA: XML data [parsable by DOM] BDATA: Binary data [non parsable] Another use case, Word2000 HTML: ================================ Currently, it save into a "folder" with external metadata, images, sounds. That could be zipped like this: /html.xhtml /html/head/metadata.bin /html/body/img[1].gif /html/body/img[2].jpg /html/body/img[3].png /html/body/img[4].bmp /html/body/img[5].svg /html/body/embed[1].mp3 /html/body/embed[2].mid /html/body/embed[3].au ======================================================== Reference: ========== XML Binary Characterization Use cases: http://www.w3.org/TR/xbc-use-cases/ OASIS OpenDocument: http://www.oasis-open.org/committees/download.php/10765/office-spec-1.0- cd-2.pdf Compression Algorithm Comparision: http://www.maximumcompression.com/data/text.php http://www.maximumcompression.com/data/summary_sf.php http://en.wikipedia.org/wiki/Comparsion_of_file_archivers http://www.malayter.com/compressiontest.html http://www.cs.tut.fi/~warp/ArchiverComparison/ http://flashlight.slad.cz/files/compression.pdf http://www.compression.ru/artest/texts25.html 7-zip file format: http://www.7-zip.org/7z.html http://en.wikipedia.org/wiki/7-Zip http://en.wikipedia.org/wiki/7z Arj file format: http://datacompression.info/ArchiveFormats/arj.txt http://arj.sourceforge.net/ Rar file format: http://www.geocities.com/marcoschmidt.geo/rar-archive-file-format.html Zip file format: http://www.geocities.com/marcoschmidt.geo/zip-archive-file-format.html http://datacompression.info/Zip.shtml http://www.faqs.org/rfcs/rfc1951.html http://www.geocities.com/SiliconValley/Lakes/6686/zip-archive-file-forma t.html http://en.wikipedia.org/wiki/ZIP_file_format XPath tutorial: http://www.w3schools.com/xpath/xpath_syntax.asp CDATA tutorial: http://www.w3schools.com/xml/xml_cdata.asp ========================================================= Comments, suggestions, improvements, feed back welcome ? =) Sincerely yours, Fred.
Received on Friday, 18 February 2005 16:59:12 UTC