- From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
- Date: Wed, 11 Oct 2000 15:44:40 -0600
- To: ht@cogsci.ed.ac.uk (Henry S. Thompson)
- Cc: Don Brutzman <brutzman@nps.navy.mil>, Frank Olken <olken@lbl.gov>, Joe D Willliams <JOEDWIL@earthlink.net>, Robert Miller <Robert.Miller@gxs.ge.com>, Jane Hunter <jane@dstc.edu.au>, X3D Contributors <x3d-contributors@web3d.org>, mpeg7-ddl <mpeg7-ddl@darmstadt.gmd.de>, www-xml-schema-comments@w3.org, w3c-xml-schema-ig@w3.org, fallside@us.ibm.com
At 2000-10-05 02:56, Henry S. Thompson wrote: >I compared an array of 614400 numbers of the form n.n encoded with >whitespace and markup separators: > > wc raw64k > 614400 614400 2457600 raw64k > wc cooked64k > 614400 614400 6758400 cooked64k > >So a factor of 2.75 larger as is > > gzip -c raw64k | wc > 0 3 7218 > gzip -c cooked64k | wc > 0 3 23014 > >and a factor of 3.2 compressed I am puzzled here. I have just spent a few minutes generating 100, then 10,000, then 100,000, and finally 1,000,000 random integers, and putting them into (a) a file, one integer per line and (b) an XML document, with each number tagged and each 10, 100, or 1000 numbers grouped into a superelement. The start of the second file runs like this: <array> <!--* Untagged random numbers generated by genarray.rexx * 11 Oct 2000 15:21:40 *--> <row> <cell>78886</cell> <cell>73598</cell> <cell>66082</cell> ... Like Henry, I find the tagged version about three times as large as the untagged version, in raw form: RAW Size of matrix Untagged Tagged Ratio tagged/untagged 10 by 10 779 2,258 2.89 100 by 100 68,994 200,523 2.906 100 by 1000 688,955 1,990,484 2.889 1000 by 1000 6,887,949 19,902,978 2.889 But after running gzip, I find the ratio is not 3.2 but about 1.18 on average: COMPRESSED Size of matrix Untagged Tagged Ratio tagged/untagged 10 by 10 445 554 1.244 100 by 100 28,674 32,934 1.148 100 by 1000 278,165 323,894 1.164 1000 by 1000 2,771,874 3,236,422 1.167 This seems more plausible to me, since the two files should contain almost the same amount of raw information (the tagged version does, in my case, contain more information, since it explicitly taggs row boundaries while the untagged version does not). I don't draw any immediate conclusions from this except that under compression tagging does not necessarily triple the size of an array. Michael Sperberg-McQueen
Received on Wednesday, 11 October 2000 17:56:38 UTC