- From: Fuchs, Matthew <matthew.fuchs@commerceone.com>
- Date: Wed, 11 Oct 2000 15:54:00 -0700
- To: "'C. M. Sperberg-McQueen'" <cmsmcq@acm.org>, ht@cogsci.ed.ac.uk
- Cc: Don Brutzman <brutzman@nps.navy.mil>, Frank Olken <olken@lbl.gov>, Joe D Willliams <JOEDWIL@earthlink.net>, Robert Miller <Robert.Miller@gxs.ge.com>, Jane Hunter <jane@dstc.edu.au>, X3D Contributors <x3d-contributors@web3d.org>, mpeg7-ddl <mpeg7-ddl@darmstadt.gmd.de>, www-xml-schema-comments@w3.org, w3c-xml-schema-ig@w3.org, fallside@us.ibm.com
Wasn't this whole compression issue beaten into the ground in 1997? After following this thread, I don't understand the problem. For arrays in markup, Schema allows you to do fixed length multi-dimensional arrays through min and maxOccurs, which is a huge step forward from DTDs. For attributes and textOnly, all you get are vectors, but they can also be of fixed length. If you want multidimensional arrays in textOnly, then have an attribute on the open tag giving the shape of the array as a list of numbers. Encoding multidimensional arrays in attributes is a bit harder. Or (tada) put the shape information into the appinfo tag - we're putting together such a suggestion for child order. CR is upon us. Provide concrete evidence of how lack of the desired feature significantly hampers development of your implementation. Provide a concrete proposal and demonstrate how it generally improves XSDL and cannot be achieved by other means. Matthew > -----Original Message----- > From: C. M. Sperberg-McQueen [mailto:cmsmcq@acm.org] > Sent: Wednesday, October 11, 2000 2:45 PM > To: ht@cogsci.ed.ac.uk > Cc: Don Brutzman; Frank Olken; Joe D Willliams; Robert Miller; Jane > Hunter; X3D Contributors; mpeg7-ddl; www-xml-schema-comments@w3.org; > w3c-xml-schema-ig@w3.org; fallside@us.ibm.com > Subject: Tagged/untagged size ratio (was: Re: > [x3d-contributors] Arrays > in XML Schema - Last Call Issue LC-84 - Schema WG response > > > At 2000-10-05 02:56, Henry S. Thompson wrote: > >I compared an array of 614400 numbers of the form n.n encoded with > >whitespace and markup separators: > > > > wc raw64k > > 614400 614400 2457600 raw64k > > wc cooked64k > > 614400 614400 6758400 cooked64k > > > >So a factor of 2.75 larger as is > > > > gzip -c raw64k | wc > > 0 3 7218 > > gzip -c cooked64k | wc > > 0 3 23014 > > > >and a factor of 3.2 compressed > > I am puzzled here. I have just spent a few minutes > generating 100, then > 10,000, then 100,000, and finally 1,000,000 random integers, > and putting > them into (a) a file, one integer per line and (b) an XML document, > with each number tagged and each 10, 100, or 1000 numbers grouped into > a superelement. The start of the second file runs like this: > > <array> > <!--* Untagged random numbers generated by genarray.rexx > * 11 Oct 2000 15:21:40 > *--> > <row> > <cell>78886</cell> > <cell>73598</cell> > <cell>66082</cell> > ... > > Like Henry, I find the tagged version about three times as > large as the > untagged version, in raw form: > > RAW > Size of matrix Untagged Tagged Ratio tagged/untagged > > 10 by 10 779 2,258 2.89 > 100 by 100 68,994 200,523 2.906 > 100 by 1000 688,955 1,990,484 2.889 > 1000 by 1000 6,887,949 19,902,978 2.889 > > But after running gzip, I find the ratio is not 3.2 but about 1.18 > on average: > > COMPRESSED > Size of matrix Untagged Tagged Ratio tagged/untagged > > 10 by 10 445 554 1.244 > 100 by 100 28,674 32,934 1.148 > 100 by 1000 278,165 323,894 1.164 > 1000 by 1000 2,771,874 3,236,422 1.167 > > This seems more plausible to me, since the two files should > contain almost the same amount of raw information (the tagged > version does, in my case, contain more information, since it > explicitly taggs row boundaries while the untagged version does > not). > > I don't draw any immediate conclusions from this except that > under compression tagging does not necessarily triple the size of an > array. > > Michael Sperberg-McQueen >
Received on Wednesday, 11 October 2000 18:54:11 UTC