some statistics on the impact of various proposals
I put these statistics together a few days ago in the momentary
confusion over document sizes. I was curious about the implications
of various proposals on the size of a document. When the confusion
was cleared up, I began wondering why we were concerned about
being able to parse without the DTD. So, I decided to send this
out anyway just to provide a few data points.
The size of the sample of documents that I used to get these statistics
was a little over 44 Mbytes of SGML in 138 documents. The average size
of a document is 320 Kbytes. The range was from 2 Kbytes to 2.7 Mbytes.
Empty end tags:
The names of end tags comprise 15% of the total size of the
document. I didn't try to differentiate, but this is probably
due in large part to the number of tables in these docs
(just over 5000 of them).
I didn't try to determine the mode which would have been interesting
but more work than I was willing to invest. The range was from 7.5%
Eliminating mixed content as James and Charles have proposed, would
increase the average document size in this set by 0.98% for every
character used to bound the "pseudo-element." So, for example, if
you had a single character name for the element, file size would
increase by 6.86% (three characters in the start tag, four in the
The range was from 0.5% to 1.9%.
End tags on EMPTY elements
Adding end-tags to EMPTY elements increased file size 0.9%. The
range was from 0.3% to 3.3%.
Attribute value literals
Currently, all of the attribute values in these docs are quoted,
i.e., attribute value literals. If I were to strip out all
unnecessary quotes, I could reduce the file size by 2%. The
range was from 0.3% to 4.2%.
As is, our DTD, without comments, is 25 Kbytes. If I were to trim
out all of the fat, and rebuild it with an emphasis on keeping size
to a minimum, I'm sure I could get it to around 20 Kbytes. There
are lots of declarations that could be paired up.
At 20 Kbytes, the DTD is 6.3% the size of an average document.
The DynaText stylesheets associated with these documents are 94
Kbytes, 29% of the size of an average document and 4.7 times the
size of the DTD. There are some oddities in the DynaText stylesheet
language that might make it more verbose than a corresponding
DSSSL stylesheet, but I kind'a doubt it since there are also a
lot of pieces that encapsulate a lot of behavior into single
Robert Streich firstname.lastname@example.org
Schlumberger voice: 1 512 331 3318
Austin Research fax: 1 512 331 3760