Re: [x3d-contributors] Arrays in XML Schema - Last Call Issue LC-84 - Schema WG response from Henry S. Thompson on 2000-10-05 (www-xml-schema-comments@w3.org from October to December 2000)

From: Henry S. Thompson <ht@cogsci.ed.ac.uk>
Date: 05 Oct 2000 09:56:29 +0100
To: Don Brutzman <brutzman@nps.navy.mil>
Cc: Frank Olken <olken@lbl.gov>, Joe D Willliams <JOEDWIL@earthlink.net>, Robert Miller <Robert.Miller@gxs.ge.com>, Jane Hunter <jane@dstc.edu.au>, X3D Contributors <x3d-contributors@web3d.org>, mpeg7-ddl <mpeg7-ddl@darmstadt.gmd.de>, www-xml-schema-comments@w3.org, w3c-xml-schema-ig@w3.org, fallside@us.ibm.com
Message-ID: <f5bem1vll8i.fsf@cogsci.ed.ac.uk>

Don Brutzman <brutzman@nps.navy.mil> writes:

<bigSnip/>

> > > Thus detailed mark up syntax:
> > >
> > >         <array>
> > >                 <arrayElement> 1.0 </arrayElement>
> > >                 <arrayElement> 2.0 </arrayElement>
> > >                 <arrayElement> 3.0 </arrayElement>
> > >                 <arrayElement> 4.0 </arrayElement>
> > >         </array>
> > >
> > > would be preferred to the flattened syntax you suggested
> > > in your comment:
> > >
> > >         <array> 1.0 2.0 3.0 4.0 </array>
> 
> but for numeric data of big sizes, the detailed approach isn't really 
> feasible.  the relationship of compression is understood, but that isn't a
> complete solution since text-editing or text-searching must also be
> feasible.

This is the crux of the matter.  The WG position is that complex
structures are best addressed by markup, on what one might call a
Total Cost of Ownership basis.  There are at least 4 factors here:

  The marked up version is larger measured in characters (4 chars ( 2.0)
  vs. 10 (<e>1.0</e>);

  The marked up version scales to multi-dim arrays without ad-hoc
  punctuation;

  The marked up version is less liable to un-detected typing errors
  (most character dropouts will be harmless (spaces) or detected
  (markup));

  The marked up version makes every element available as the target of 
  a path/pointer.

It's not obvious to me why editing and searching are harder in the
marked up case, even for large datasets.  See the appendix to this
message for a quick experiment.

We understand that you are long familiarity has made you comfortable
with using whitespace delimiters, but it seems clear to us that the
TCO is much lower for the marked up approach, and as it fits well with 
the overall XML architecture, and the whitespace approach does not,
there continues to be little sympathy for special-casing arrays in the 
way you have requested.

Speaking for myself, I would strongly urge you to try working with us
to define some standard complex types for use in array markup, and see
if that does not in fact meet your needs.  We recently commited [1] to
helping coordinate an effort to define a library of much-needed
complex types, and since arrays should certainly be among them, we
would welcome your participation in this effort.

ht

APPENDIX

I compared an array of 614400 numbers of the form n.n encoded with
whitespace and markup separators:

 wc raw64k
  614400  614400 2457600 raw64k
 wc cooked64k
  614400  614400 6758400 cooked64k

So a factor of 2.75 larger as is

  gzip -c raw64k | wc
       0       3    7218
  gzip -c cooked64k | wc
       0       3   23014

and a factor of 3.2 compressed

and crucially, those factors are preserved when searching, that is,
searching the compressed cooked version is only 3 times slower than searching
the uncompressed raw version:

   time gegrep -c '\b7\.8\b' raw64k
   61440

   real    0m0.201s
   user    0m0.121s
   sys     0m0.082s

    time sh -c "zcat cooked64k.gz | gegrep -c '\b7\.8\b'"
   61440

   real    0m0.599s
   user    0m0.461s
   sys     0m0.122s

(Searching the uncompressed cooked version is only half the speed of
searching the uncompressed raw version).

-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/

Received on Thursday, 5 October 2000 04:57:05 UTC