W3C home > Mailing lists > Public > w3c-sgml-wg@w3.org > May 1997

Re: SD1 - Proposal: allow in valid XML-Documents only

From: Steven J. DeRose <sjd@eps.inso.com>
Date: Tue, 20 May 1997 15:07:28 -0400
Message-Id: <>
To: w3c-sgml-wg@w3.org
At 12:08 AM 05/19/97 +0900, Weichel Bernhard (K3/EES4) wrote:
>I am strongly supporting SD1 - Short End Tags, primarily with respect to
>the useage of XML to exchange databases. In some examples from my desk -
>we are exchanging data for engine management system - I was saving 40%
>size of the instance by simply shorten the end tags.

This sounds a bit like "I really want smaller heads on these hammers you're
selling -- I'm using them to turn screws and they're just not very
efficient." Usually if 80% of your files is markup (roughly that, since by
shortening half the tags you say you're saving 40%), there's a bigger
problem going on.

The biggest savings come if you've got a whole lot of little tiny fields.
Little tiny fields seldom have much internal structure; we're usually
talking about integers, dates, social security and phone numbers, etc. In
such cases, you're a whole lot better off sticking them in attributes.
Conceptually, it makes at least as much sense; and attributes are good at
reprenting little chunks of info that each have little internal structure;
and SGML and XML can validate a number of datatypes for attributes (and,
perhaps will add a few), whereas it is not possible even to require that
#PCDATA content *exist*.

To me, the big problem is not that you have to give the GI twice. It's that
you have to give it on every *instance*, which simply doesn't make sense for
RDBs. If you're shipping 100 records with 10 fields each, you don't want to
just get from 2000 GIs to 1000 GIs. You want to get to 10 GIs. 

The total costs for different forms, given an n-char GI are:

<gi> </gi>    or 5+2n chars per field
<gi> </>      or 5+n  chars per field
<gi/ /        or 3+n  chars per field
gi=" "        or 5+n  chars per field
|gi           or 1    chars per field.

Maybe I'm an incorrigible CS nerd, but I find O(k) a lot better than O(n),
and O(n) not very much better than O(2n). You can save the 2n vs. n amount
just by shortening GIs, and most modems will compress it away anyway.
Doesn't seem worth it, esp. since we have a known constituency that it will
hurt (the DPH, not the parser writers, of course). 

Saving part of the end-tag is just a patch; if we really want to address
this case we should save an awful lot more than that, and it seems to me
that is best done some more radical way if at all.

Steven J. DeRose, Ph.D., Chief Scientist
Inso Electronic Publishing Solutions
   (formerly EBT)
Received on Tuesday, 20 May 1997 15:10:32 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:25:26 UTC