Re: [XML-Binary] ZIP file format using XPATH for directory entries proposal

Hi M. Williams,

>We don't want to get too far afield with operating system details, but the 
>'problem' with ls and the 'shell' issues mentioned in an earlier note are 
>things like ls doing a qsort by default on filenames before returning them 
>and operating system limitations of command line length.  While in 
>Unix/Linux command line length is relatively huge, wildcards must be 
>expanded by the shell before being passed to a command like ls which isn't 
>reasonable when the list needs to be gigantic.  As a result, you need to 
>use commands like 'find' and 'xargs' when dealing with large numbers of 
>files.

Exactly. Altough it seems that tcsh on Sun OS 5.9 works well with +40,000 
files.
Didn't have enough quota on /tmp to create more than 1GB.

>I believe that JAR files are usually compressed zip files, although 
>certainly the components don't need to be compressed.

Concerning that I did some research.
In the old days of JDK 1.0.x and JDK 1.1.x,
JAR files were not compressed at all,
but they introduced compression in JDK 1.2.x

Up to my knowledge .zip files were compressed and .jar files were not;
however, nowadays it seems that both could be compressed or not using ZLIB.


Some people don't send compressed JAR due to old JRE issues
or because the cost of decompression/bandwidth saving are not worth it.
JAR = Java ARchive

http://www.rgagnon.com/javadetails/java-0153.html

>The problem I have with the ZIP file format approach to representing 
>arbitrary XML is that it's not going to be efficient for every case.

Some use case like FixML are better served with less verbose XML than 
anything else
or other standards like wbXML:

http://www.w3.org/TR/wbxml/

Also, you might notice that JAR files have META-INF/
which can be digitally SIGNED and indexed.

Tools already exist on almost every platform to deal with such format.
http://java.sun.com/j2se/1.3/docs/guide/jar/jar.html

>Some of the characteristics that make it somewhat useful should be 
>considered in a new format, but it is designed for the granularity of 
>files, not tags, and it doesn't seem especially elegant for representing 
>many proposed instances of data.

Could you be more precise?

The granularity of files = XML tags, as far I'm concern,
since the content of a given file is some children of some XML tags,
where the file name without the extension is equivalent to the XPath of that 
XML tag.

I don't see what's not elegant in having:

/html.xml
/html/body/img[1].svgz
/html/body/img[2].gif
/html/body/img[3].png

It's very easy to parse, if you trim the extension you get pure XPath with 
content attached:
i.e. s/\.[a-zA-Z0-9_\$\!]+$//;

/html --> returns all child nodes of /html
/html/body/img[1] --> returns all child nodes of /html/body/img[1] --> 
binary SVGz
/html/body/img[2]
/html/body/img[3]

Another thing is that some file may be compressed while others might not,
for efficiency purposes.

i.e. You compress the XML file,
but you don't compress float arrays binary file dump, mp3, jpg, ...

The only thing I can see are Unicode XML tags that can be fixed with 7-zip
or another revision/extension to the ZIP file format.

Another way around would be to 'encode' them in &#decimal; notation or in 
HEX notation?

So, <Lotus\u2081\u2082\u2083> could become <Lotus&#8321;&#8322;&#8323;>

Which could be represented like one of those maybe:

/Lotus&#8321;&#8322;&#8323;.xml

/Lotus#8321#8322#8323.xml

/Lotus%2081%2082%2083.xml

If you have any more comments, suggestions, improvements or feed back,
please send them! =)

Sincerely yours,
Fred.

Received on Saturday, 19 February 2005 00:38:33 UTC