W3C home > Mailing lists > Public > www-xml-schema-comments@w3.org > October to December 2000

Tagged/untagged size ratio (was: Re: [x3d-contributors] Arrays in XML Schema - Last Call Issue LC-84 - Schema WG response

From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
Date: Wed, 11 Oct 2000 15:44:40 -0600
Message-Id: <4.3.2.7.1.20001011152501.024227e8@espanola.com>
To: ht@cogsci.ed.ac.uk (Henry S. Thompson)
Cc: Don Brutzman <brutzman@nps.navy.mil>, Frank Olken <olken@lbl.gov>, Joe D Willliams <JOEDWIL@earthlink.net>, Robert Miller <Robert.Miller@gxs.ge.com>, Jane Hunter <jane@dstc.edu.au>, X3D Contributors <x3d-contributors@web3d.org>, mpeg7-ddl <mpeg7-ddl@darmstadt.gmd.de>, www-xml-schema-comments@w3.org, w3c-xml-schema-ig@w3.org, fallside@us.ibm.com
At 2000-10-05 02:56, Henry S. Thompson wrote:
>I compared an array of 614400 numbers of the form n.n encoded with
>whitespace and markup separators:
>
>  wc raw64k
>   614400  614400 2457600 raw64k
>  wc cooked64k
>   614400  614400 6758400 cooked64k
>
>So a factor of 2.75 larger as is
>
>   gzip -c raw64k | wc
>        0       3    7218
>   gzip -c cooked64k | wc
>        0       3   23014
>
>and a factor of 3.2 compressed

I am puzzled here.  I have just spent a few minutes generating 100, then
10,000, then 100,000, and finally 1,000,000 random integers, and putting
them into (a) a file, one integer per line and (b) an XML document,
with each number tagged and each 10, 100, or 1000 numbers grouped into
a superelement.  The start of the second file runs like this:

   <array>
   <!--* Untagged random numbers generated by genarray.rexx
       * 11 Oct 2000 15:21:40
       *-->
   <row>
   <cell>78886</cell>
   <cell>73598</cell>
   <cell>66082</cell>
   ...

Like Henry, I find the tagged version about three times as large as the
untagged version, in raw form:

   RAW
   Size of matrix   Untagged     Tagged    Ratio tagged/untagged

   10 by 10              779      2,258    2.89
   100 by 100         68,994    200,523    2.906
   100 by 1000       688,955  1,990,484    2.889
   1000 by 1000    6,887,949 19,902,978    2.889

But after running gzip, I find the ratio is not 3.2 but about 1.18
on average:

   COMPRESSED
   Size of matrix   Untagged     Tagged    Ratio tagged/untagged

   10 by 10              445        554    1.244
   100 by 100         28,674     32,934    1.148
   100 by 1000       278,165    323,894    1.164
   1000 by 1000    2,771,874  3,236,422    1.167

This seems more plausible to me, since the two files should
contain almost the same amount of raw information (the tagged
version does, in my case, contain more information, since it
explicitly taggs row boundaries while the untagged version does
not).

I don't draw any immediate conclusions from this except that
under compression tagging does not necessarily triple the size of an
array.

Michael Sperberg-McQueen
Received on Wednesday, 11 October 2000 17:56:38 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Sunday, 6 December 2009 18:12:48 GMT