- From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
- Date: Wed, 11 Oct 2000 15:44:40 -0600
- To: ht@cogsci.ed.ac.uk (Henry S. Thompson)
- Cc: Don Brutzman <brutzman@nps.navy.mil>, Frank Olken <olken@lbl.gov>, Joe D Willliams <JOEDWIL@earthlink.net>, Robert Miller <Robert.Miller@gxs.ge.com>, Jane Hunter <jane@dstc.edu.au>, X3D Contributors <x3d-contributors@web3d.org>, mpeg7-ddl <mpeg7-ddl@darmstadt.gmd.de>, www-xml-schema-comments@w3.org, w3c-xml-schema-ig@w3.org, fallside@us.ibm.com
At 2000-10-05 02:56, Henry S. Thompson wrote:
>I compared an array of 614400 numbers of the form n.n encoded with
>whitespace and markup separators:
>
> wc raw64k
> 614400 614400 2457600 raw64k
> wc cooked64k
> 614400 614400 6758400 cooked64k
>
>So a factor of 2.75 larger as is
>
> gzip -c raw64k | wc
> 0 3 7218
> gzip -c cooked64k | wc
> 0 3 23014
>
>and a factor of 3.2 compressed
I am puzzled here. I have just spent a few minutes generating 100, then
10,000, then 100,000, and finally 1,000,000 random integers, and putting
them into (a) a file, one integer per line and (b) an XML document,
with each number tagged and each 10, 100, or 1000 numbers grouped into
a superelement. The start of the second file runs like this:
<array>
<!--* Untagged random numbers generated by genarray.rexx
* 11 Oct 2000 15:21:40
*-->
<row>
<cell>78886</cell>
<cell>73598</cell>
<cell>66082</cell>
...
Like Henry, I find the tagged version about three times as large as the
untagged version, in raw form:
RAW
Size of matrix Untagged Tagged Ratio tagged/untagged
10 by 10 779 2,258 2.89
100 by 100 68,994 200,523 2.906
100 by 1000 688,955 1,990,484 2.889
1000 by 1000 6,887,949 19,902,978 2.889
But after running gzip, I find the ratio is not 3.2 but about 1.18
on average:
COMPRESSED
Size of matrix Untagged Tagged Ratio tagged/untagged
10 by 10 445 554 1.244
100 by 100 28,674 32,934 1.148
100 by 1000 278,165 323,894 1.164
1000 by 1000 2,771,874 3,236,422 1.167
This seems more plausible to me, since the two files should
contain almost the same amount of raw information (the tagged
version does, in my case, contain more information, since it
explicitly taggs row boundaries while the untagged version does
not).
I don't draw any immediate conclusions from this except that
under compression tagging does not necessarily triple the size of an
array.
Michael Sperberg-McQueen
Received on Wednesday, 11 October 2000 17:56:38 UTC