- From: Henry S. Thompson <ht@cogsci.ed.ac.uk>
- Date: 05 Oct 2000 09:56:29 +0100
- To: Don Brutzman <brutzman@nps.navy.mil>
- Cc: Frank Olken <olken@lbl.gov>, Joe D Willliams <JOEDWIL@earthlink.net>, Robert Miller <Robert.Miller@gxs.ge.com>, Jane Hunter <jane@dstc.edu.au>, X3D Contributors <x3d-contributors@web3d.org>, mpeg7-ddl <mpeg7-ddl@darmstadt.gmd.de>, www-xml-schema-comments@w3.org, w3c-xml-schema-ig@w3.org, fallside@us.ibm.com
Don Brutzman <brutzman@nps.navy.mil> writes: <bigSnip/> > > > Thus detailed mark up syntax: > > > > > > <array> > > > <arrayElement> 1.0 </arrayElement> > > > <arrayElement> 2.0 </arrayElement> > > > <arrayElement> 3.0 </arrayElement> > > > <arrayElement> 4.0 </arrayElement> > > > </array> > > > > > > would be preferred to the flattened syntax you suggested > > > in your comment: > > > > > > <array> 1.0 2.0 3.0 4.0 </array> > > but for numeric data of big sizes, the detailed approach isn't really > feasible. the relationship of compression is understood, but that isn't a > complete solution since text-editing or text-searching must also be > feasible. This is the crux of the matter. The WG position is that complex structures are best addressed by markup, on what one might call a Total Cost of Ownership basis. There are at least 4 factors here: The marked up version is larger measured in characters (4 chars ( 2.0) vs. 10 (<e>1.0</e>); The marked up version scales to multi-dim arrays without ad-hoc punctuation; The marked up version is less liable to un-detected typing errors (most character dropouts will be harmless (spaces) or detected (markup)); The marked up version makes every element available as the target of a path/pointer. It's not obvious to me why editing and searching are harder in the marked up case, even for large datasets. See the appendix to this message for a quick experiment. We understand that you are long familiarity has made you comfortable with using whitespace delimiters, but it seems clear to us that the TCO is much lower for the marked up approach, and as it fits well with the overall XML architecture, and the whitespace approach does not, there continues to be little sympathy for special-casing arrays in the way you have requested. Speaking for myself, I would strongly urge you to try working with us to define some standard complex types for use in array markup, and see if that does not in fact meet your needs. We recently commited [1] to helping coordinate an effort to define a library of much-needed complex types, and since arrays should certainly be among them, we would welcome your participation in this effort. ht APPENDIX I compared an array of 614400 numbers of the form n.n encoded with whitespace and markup separators: wc raw64k 614400 614400 2457600 raw64k wc cooked64k 614400 614400 6758400 cooked64k So a factor of 2.75 larger as is gzip -c raw64k | wc 0 3 7218 gzip -c cooked64k | wc 0 3 23014 and a factor of 3.2 compressed and crucially, those factors are preserved when searching, that is, searching the compressed cooked version is only 3 times slower than searching the uncompressed raw version: time gegrep -c '\b7\.8\b' raw64k 61440 real 0m0.201s user 0m0.121s sys 0m0.082s time sh -c "zcat cooked64k.gz | gegrep -c '\b7\.8\b'" 61440 real 0m0.599s user 0m0.461s sys 0m0.122s (Searching the uncompressed cooked version is only half the speed of searching the uncompressed raw version). -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/
Received on Thursday, 5 October 2000 04:57:05 UTC