RE: Tagged/untagged size ratio (was: Re: [x3d-contributors] Arrays in XML Schema - Last Call Issue LC-84 - Schema WG response from Matt Timmermans on 2000-10-11 (www-xml-schema-comments@w3.org from October to December 2000)

From: Matt Timmermans <mtimmerm@opentext.com>
Date: Wed, 11 Oct 2000 18:37:44 -0400
To: "'C. M. Sperberg-McQueen'" <cmsmcq@acm.org>, "'Henry S. Thompson'" <ht@cogsci.ed.ac.uk>
Cc: "'Don Brutzman'" <brutzman@nps.navy.mil>, "'Frank Olken'" <olken@lbl.gov>, "'Joe D Willliams'" <JOEDWIL@earthlink.net>, "'Robert Miller'" <Robert.Miller@gxs.ge.com>, "'Jane Hunter'" <jane@dstc.edu.au>, "'X3D Contributors'" <x3d-contributors@web3d.org>, "'mpeg7-ddl'" <mpeg7-ddl@darmstadt.gmd.de>, <www-xml-schema-comments@w3.org>, <w3c-xml-schema-ig@w3.org>, <fallside@us.ibm.com>
Message-ID: <000c01c033d3$e0217920$8f82a8c0@ott.opentext.com>

I got this 3 times, so perhaps it's directed to too many lists, and I
apologize to people who get this 3 times or more.

In any case, the discrepancy is with how you generated your numbers.

In Henry's case:

> >  wc raw64k
> >   614400  614400 2457600 raw64k
> >  wc cooked64k
> >   614400  614400 6758400 cooked64k
> >
> >So a factor of 2.75 larger as is
> >
> >   gzip -c raw64k | wc
> >        0       3    7218
> >   gzip -c cooked64k | wc
> >        0       3   23014

we see incredible compression ratios, suggesting that is numbers were all
the same (probably all 0), or at least have _very_ little variation.

The tagging still compresses reasonably well, though.  23014-7218 = 15796
bytes to represent all 614400  tags -- about 0.025 bytes/tag_pair.  Again,
this is possible due to the incredible uniformity in the file.

Michael's stats are more realistic:
>    RAW
>    Size of matrix   Untagged     Tagged    Ratio tagged/untagged
>    1000 by 1000    6,887,949 19,902,978    2.889
>    COMPRESSED
>    1000 by 1000    2,771,874  3,236,422    1.167

Micheal's numbers are random, and can only be compressed because they don't
use the entire 8-bit character set.  The result is closer to typical for
numeric data -- about 2.6 bits/symbol.  Real numeric data will often
compress to about 2 bits/symbol.

And here we see about 0.46 bytes/tag_pair spent in the compressed file.
It's a lot more than in Henry's, but still pretty reasonable. Henry did so
much better only because of the way gzip works by looking for repeated
strings.  Micheal's tags repeat, but entire Henry's file has very large
repeating sections.  Michael has to spend the extra bits to indicate where
the tags go.

In actual practice, for uniform coding of integer arrays and matrices, you
can probably expect to spend about 0.2 bytes/tag_pair, if you're careful to
tag every integer in exactly the same way.  Micheal appears to use more only
because, as he said, his tagging also indicates row boundaries, while the
raw markup did not.

Received on Wednesday, 11 October 2000 18:39:13 UTC