RE: Tagged/untagged size ratio (was: Re: [x3d-contributors] Array s in XML Schema - Last Call Issue LC-84 - Schema WG response

Wasn't this whole compression issue beaten into the ground in 1997?  

After following this thread, I don't understand the problem.  For arrays in
markup, Schema allows you to do fixed length multi-dimensional arrays
through min and maxOccurs, which is a huge step forward from DTDs.  For
attributes and textOnly, all you get are vectors, but they can also be of
fixed length.  If you want multidimensional arrays in textOnly, then have an
attribute on the open tag giving the shape of the array as a list of
numbers.  Encoding multidimensional arrays in attributes is a bit harder.
Or (tada) put the shape information into the appinfo tag - we're putting
together such a suggestion for child order.

CR is upon us.  Provide concrete evidence of how lack of the desired feature
significantly hampers development of your implementation.  Provide a
concrete proposal and demonstrate how it generally improves XSDL and cannot
be achieved by other means.

Matthew

> -----Original Message-----
> From: C. M. Sperberg-McQueen [mailto:cmsmcq@acm.org]
> Sent: Wednesday, October 11, 2000 2:45 PM
> To: ht@cogsci.ed.ac.uk
> Cc: Don Brutzman; Frank Olken; Joe D Willliams; Robert Miller; Jane
> Hunter; X3D Contributors; mpeg7-ddl; www-xml-schema-comments@w3.org;
> w3c-xml-schema-ig@w3.org; fallside@us.ibm.com
> Subject: Tagged/untagged size ratio (was: Re: 
> [x3d-contributors] Arrays
> in XML Schema - Last Call Issue LC-84 - Schema WG response
> 
> 
> At 2000-10-05 02:56, Henry S. Thompson wrote:
> >I compared an array of 614400 numbers of the form n.n encoded with
> >whitespace and markup separators:
> >
> >  wc raw64k
> >   614400  614400 2457600 raw64k
> >  wc cooked64k
> >   614400  614400 6758400 cooked64k
> >
> >So a factor of 2.75 larger as is
> >
> >   gzip -c raw64k | wc
> >        0       3    7218
> >   gzip -c cooked64k | wc
> >        0       3   23014
> >
> >and a factor of 3.2 compressed
> 
> I am puzzled here.  I have just spent a few minutes 
> generating 100, then
> 10,000, then 100,000, and finally 1,000,000 random integers, 
> and putting
> them into (a) a file, one integer per line and (b) an XML document,
> with each number tagged and each 10, 100, or 1000 numbers grouped into
> a superelement.  The start of the second file runs like this:
> 
>    <array>
>    <!--* Untagged random numbers generated by genarray.rexx
>        * 11 Oct 2000 15:21:40
>        *-->
>    <row>
>    <cell>78886</cell>
>    <cell>73598</cell>
>    <cell>66082</cell>
>    ...
> 
> Like Henry, I find the tagged version about three times as 
> large as the
> untagged version, in raw form:
> 
>    RAW
>    Size of matrix   Untagged     Tagged    Ratio tagged/untagged
> 
>    10 by 10              779      2,258    2.89
>    100 by 100         68,994    200,523    2.906
>    100 by 1000       688,955  1,990,484    2.889
>    1000 by 1000    6,887,949 19,902,978    2.889
> 
> But after running gzip, I find the ratio is not 3.2 but about 1.18
> on average:
> 
>    COMPRESSED
>    Size of matrix   Untagged     Tagged    Ratio tagged/untagged
> 
>    10 by 10              445        554    1.244
>    100 by 100         28,674     32,934    1.148
>    100 by 1000       278,165    323,894    1.164
>    1000 by 1000    2,771,874  3,236,422    1.167
> 
> This seems more plausible to me, since the two files should
> contain almost the same amount of raw information (the tagged
> version does, in my case, contain more information, since it
> explicitly taggs row boundaries while the untagged version does
> not).
> 
> I don't draw any immediate conclusions from this except that
> under compression tagging does not necessarily triple the size of an
> array.
> 
> Michael Sperberg-McQueen
> 

Received on Wednesday, 11 October 2000 18:54:11 UTC