- From: Matt Timmermans <mtimmerm@opentext.com>
- Date: Wed, 11 Oct 2000 18:37:44 -0400
- To: "'C. M. Sperberg-McQueen'" <cmsmcq@acm.org>, "'Henry S. Thompson'" <ht@cogsci.ed.ac.uk>
- Cc: "'Don Brutzman'" <brutzman@nps.navy.mil>, "'Frank Olken'" <olken@lbl.gov>, "'Joe D Willliams'" <JOEDWIL@earthlink.net>, "'Robert Miller'" <Robert.Miller@gxs.ge.com>, "'Jane Hunter'" <jane@dstc.edu.au>, "'X3D Contributors'" <x3d-contributors@web3d.org>, "'mpeg7-ddl'" <mpeg7-ddl@darmstadt.gmd.de>, <www-xml-schema-comments@w3.org>, <w3c-xml-schema-ig@w3.org>, <fallside@us.ibm.com>
I got this 3 times, so perhaps it's directed to too many lists, and I apologize to people who get this 3 times or more. In any case, the discrepancy is with how you generated your numbers. In Henry's case: > > wc raw64k > > 614400 614400 2457600 raw64k > > wc cooked64k > > 614400 614400 6758400 cooked64k > > > >So a factor of 2.75 larger as is > > > > gzip -c raw64k | wc > > 0 3 7218 > > gzip -c cooked64k | wc > > 0 3 23014 we see incredible compression ratios, suggesting that is numbers were all the same (probably all 0), or at least have _very_ little variation. The tagging still compresses reasonably well, though. 23014-7218 = 15796 bytes to represent all 614400 tags -- about 0.025 bytes/tag_pair. Again, this is possible due to the incredible uniformity in the file. Michael's stats are more realistic: > RAW > Size of matrix Untagged Tagged Ratio tagged/untagged > 1000 by 1000 6,887,949 19,902,978 2.889 > COMPRESSED > 1000 by 1000 2,771,874 3,236,422 1.167 Micheal's numbers are random, and can only be compressed because they don't use the entire 8-bit character set. The result is closer to typical for numeric data -- about 2.6 bits/symbol. Real numeric data will often compress to about 2 bits/symbol. And here we see about 0.46 bytes/tag_pair spent in the compressed file. It's a lot more than in Henry's, but still pretty reasonable. Henry did so much better only because of the way gzip works by looking for repeated strings. Micheal's tags repeat, but entire Henry's file has very large repeating sections. Michael has to spend the extra bits to indicate where the tags go. In actual practice, for uniform coding of integer arrays and matrices, you can probably expect to spend about 0.2 bytes/tag_pair, if you're careful to tag every integer in exactly the same way. Micheal appears to use more only because, as he said, his tagging also indicates row boundaries, while the raw markup did not.
Received on Wednesday, 11 October 2000 18:39:13 UTC