some statistics on the impact of various proposals

I put these statistics together a few days ago in the momentary
confusion over document sizes. I was curious about the implications
of various proposals on the size of a document. When the confusion
was cleared up, I began wondering why we were concerned about
being able to parse without the DTD. So, I decided to send this
out anyway just to provide a few data points.

The size of the sample of documents that I used to get these statistics
was a little over 44 Mbytes of SGML in 138 documents. The average size
of a document is 320 Kbytes. The range was from 2 Kbytes to 2.7 Mbytes.

Empty end tags:
    The names of end tags comprise 15% of the total size of the
    document. I didn't try to differentiate, but this is probably
    due in large part to the number of tables in these docs
    (just over 5000 of them).

    I didn't try to determine the mode which would have been interesting
    but more work than I was willing to invest. The range was from 7.5%
    to 20.6%.

"Pseudo-element" delimiters
    Eliminating mixed content as James and Charles have proposed, would
    increase the average document size in this set by 0.98% for every
    character used to bound the "pseudo-element." So, for example, if
    you had a single character name for the element, file size would
    increase by 6.86% (three characters in the start tag, four in the
    end tag.

    The range was from 0.5% to 1.9%.

End tags on EMPTY elements
    Adding end-tags to EMPTY elements increased file size 0.9%. The
    range was from 0.3% to 3.3%.

Attribute value literals
    Currently, all of the attribute values in these docs are quoted,
    i.e., attribute value literals. If I were to strip out all
    unnecessary quotes, I could reduce the file size by 2%. The
    range was from 0.3% to 4.2%.

DTD size
    As is, our DTD, without comments, is 25 Kbytes. If I were to trim
    out all of the fat, and rebuild it with an emphasis on keeping size
    to a minimum, I'm sure I could get it to around 20 Kbytes. There
    are lots of declarations that could be paired up.

    At 20 Kbytes, the DTD is 6.3% the size of an average document.

    The DynaText stylesheets associated with these documents are 94
    Kbytes, 29% of the size of an average document and 4.7 times the
    size of the DTD. There are some oddities in the DynaText stylesheet
    language that might make it more verbose than a corresponding
    DSSSL stylesheet, but I kind'a doubt it since there are also a
    lot of pieces that encapsulate a lot of behavior into single

Robert Streich				streich@slb.com
Schlumberger				voice: 1 512 331 3318
Austin Research				fax:   1 512 331 3760