RE: [XML-Binary] ZIP file format using XPATH for directory entries proposal from Fred P. on 2005-02-20 (public-xml-binary@w3.org from February 2005)

From: Fred P. <fprog26@hotmail.com>
Date: Sun, 20 Feb 2005 06:04:06 -0500
To: RogerCutler@chevrontexaco.com, public-xml-binary@w3.org
Message-ID: <BAY104-F412BE9CE92BAA0B02C6105A7600@phx.gbl>

Hi M. Cutler,

I did some experiment with two files:

hi.html:
<html><body>HI</body></html>

hi.pl:
#!/usr/bin/perl
print "hi\n";
exit;

pkzip -a -e0 hi.zip hi.html
pkzip -a -e0 hi.zip hi.pl
copy hi.zip hi2.zip
pkzip -a -e0 hi2.zip hi.zip

The last one is to check if there is any "translation/encoding" issues.

It gives the following zip output in binary:

PK\x3\x4\xA
\0\0\0\0\0D)T2\x10\x91\0\xDD\x1E
\0\0\0\x1E\0\0\0\x7\0\0\0
hi.html
<html><body>HI</body></html>\xD\xA

PK\x3\x4\xA
\0\0\0\0\0[)T2\xAE\xD8S0'
\0\0\0'\0\0\0\x5\0\0\0
hi.pl
#!/usr/bin/perl\xD\xA
print "hi\n";\xD\xA
exit;\xD\xA

PK\x3\x4\xA
\0\0�\0\0i)T2{2\xE5F\xB\x1\0\0
\xB\x1\0\0\x6\0\0\0
hi.zip

PK\x3\x4\xA
\0\0\0\0\0D)T2\x10\x91\0\xDD
\x1E\0\0\0\x1E\0\0\0\x7\0\0\0
hi.html
<html><body>HI</body></html>\xD\xA

PK\x3\x4\xA
\0\0\0\0\0[)T2\xAE\xD8S0'\0\0\0'\0\0\0\x5\0\0\0
hi.pl
#!/usr/bin/perl\xD\xA
print "hi\n";\xD\xA
exit;\xD\xA

PK\x1\x2\x19\0\xA
\0\0\0\0
\0D)T2\x10\x91\0\xDD\x1E\0\0\0\x1E\0\0\0\x7
\0\0\0\0\0\0\0\x1\0 \0\0\0\0\0\0\0
hi.html

PK\x1\x2\x19\0\xA
\0\0\0\0\0[)T2\xAE\xD8S0'\0\0\0'\0\0\0\x5
\0\0\0\0\0\0\0\x1\0 \0\0\0C\0\0\0
hi.pl

PK\x5\x6
\0\0\0\0
\x2\0\x2\0h\0\0\0\x8D\0\0\0\0\0

PK\x1\x2\x19\0\xA
\0\0\0\0
\0D)T2\x10\x91\0\xDD
\x1E\0\0\0
\x1E\0\0\0\x7
\0\0\0\0\0\0\0\x1\0 \0\0\0\0\0\0\0
hi.html

PK\x1\x2\x19\0\xA
\0\0\0\0
\0[)T2\xAE\xD8S0'\0\0\0'\0\0\0\x5
\0\0\0\0\0\0\0\x1\0 \0\0\0C\0\0\0
hi.pl

PK\x1\x2\x19\0\xA
\0\0\0\0
\0i)T2{2\xE5F\xB\x1\0\0\xB\x1\0\0\x6
\0\0\0\0\0\0\0\x1\0 \0\0\0\x8D\0\0\0
hi.zip

PK\x5\x6
\0\0\0\0
\x3\0\x3\0\x9C\0\0\0\xBC\x1
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0


As you can see, the entire text is not modified or encoded in any given way,
so you could fseek() fread() your data directly within the file.

You might also fwrite() it directly without changing the size of the file,
but the CRC and similar won't match so it needs to be recomputed.

The header "PK\x3\x4\xA" for each entry is not even translated
when you add a zip inside a zip.

So, you could barely search using this string for any zip entry using
memchr(buf, 'P', len), !memcmp( buf, "PK\x3\x4\xA", len),
altough it's not safe, since an equivalent binary string is not encoded at 
all.

Notice also that the filename/path is not encoded,
so it could be loaded via memcpy and searched for.

The conventional way is to use the 22 bytes (LOCLEN) uncompressed length to 
fseek()
into the file stream and check for header info.

Since files are appended, you need to visit every entry header
using a O(n) algorithm to find your desired file.

However, you may cache this index information for future retrieval in 
memory,
since those header are quite small (32 bytes each + filename/path).

So even for 1000 files, you get something under 64KB

Once you found your desired /seisdata/trace[1].bin
you can directly fread() it into a float array and use it in no time.

As I said before, there's no encoding/translation/compression for -e0,
so the data is packed as is.


The unzip algo can be found here, less than 200 lines of code:

http://www.koders.com/c/fidC5CE35109E7F4A32464FB8B809E311E324085A6F.aspx

funzip.c file content can be found here:
http://computing.ee.ethz.ch/sepp/unzip-551-rs.SEPP/src/unzip-5.51/funzip.c


Sincerely yours,

Fred.

Received on Sunday, 20 February 2005 11:05:33 UTC