Re: Make DTDs optional? from Paul Prescod on 1996-09-30 (w3c-sgml-wg@w3.org from September 1996)

From: Paul Prescod <papresco@calum.csclub.uwaterloo.ca>
Date: Mon, 30 Sep 1996 09:50:59 -0400
To: Robert Streich <streich@slb.com>, Paul Prescod <papresco@calum.csclub.uwaterloo.ca>
Cc: w3c-sgml-wg@w3.org
Message-Id: <1.5.4.32.19960930135059.00696fbc@csclub.uwaterloo.ca>
At 01:04 AM 9/30/96 CDT, Robert Streich wrote:
>At 10:25 AM 9/28/96 -0400, Paul Prescod wrote:
>> * Because XML is widely supported (we hope). 
>> * Because XML is compact (we hope). 
>> * Because XML will preserve whatever structure exists in the document (and
>>Word allows quite a bit). 
>> * Because XML is more portable, more device-independent and more widely
>>supported than Word for Windows format.
>> * Because XML is "open" and standardized.
>> * Because XML is easy to full-text index.
>
>I don't get it. Which of these doesn't apply to HTML? 

>> * Because XML is compact (we hope). 

HTML represented structured information is _not_ compact because you have to
duplicate data hundreds of times: in every "xref", in the table of contents,
in the header, the footer, in the index. In short, every time information
references other information.

>> * Because XML will preserve whatever structure exists in the document (and
>>Word allows quite a bit). 

Conversion to HTML invariably destroys the structure in the document. Let's
say that the internal Word structure is:

/section This is section 1
/paragraph{type=Normal} Please see /reference{target=BOOKMARK_SECTION_2}
/index{"section 2, reference to"}

Then the HTML structure is:

<H1>
<P CLASS=NORMAL>Please see <A HREF="BOOKMARK-SECTION-2">Section 2</A></P>
<A CLASS="INDEX-TARGET" NAME="#INDEX-TARGET-6"> </A>

I've been forced to throw away a TON of the "true information" in this
document. First, there is know way of knowing that the reference is supposed
to be expanded and changed by the style sheet, not hard-coded to the text
"section 2". Second, the index entry has now been split up into a source (in
an index file) and a target (in this file). Third, the index target now
contains data, because HTML anchor targets must contain data. Fourth,
CLASSes will be used throughout the document to make MAJOR semantic
distinctions. That's distasteful because most SGML (and XML) tools work on
the premise that major semantic distinctions will be made in GIs. Fifth,
I've had to generate names for link targets and anchors that were created
that did not exist in the Word doc. Finally, there are going to be hundreds
of small hacks, like the "space" in the anchor target to try to work around
a hundred browser implementations because HTML was not so much designed as
"grown."

In other words, there can be no round trip conversion here. The true
information has been thrown away. This will happen in a hundred small ways
because I cannot choose my GIs and attributes to match the needs of the data.

Here's what I think the XML would look like:

<section>This is section 1
<Normal>Please see <reference target=BOOKMARK-SECTION-2></NORMAL>
<INDEX TARGET="section 2, reference to">

>> * Because XML is more portable, more device-independent and more widely
>>supported than Word for Windows format.

I think that my XML version is likely to render well on a much larger range
of devices than either the proprietary Word for Windows file or the
dumbed-down HTML file.

>And why would Jane
>Author want to mess with creating a structured doc?

To preserve as much of the information as is in the original document in the
Web delivery.

>If a bunch of authors create a bunch of documents with a bunch of DTDs
>and a bunch of stylesheets then how is that any better than what we
>have now? 

Aren't you describing the current usages of SGML? Aren't there a bunch of
authors with a bunch of documents with a bunch of DTDs and a bunch of
stylesheets? I don't understand your point.

>If this were useful, I could just buy a bunch of DynaTags
>and automap all of the styles. Yeccch.

And some people _do_ do that, and it _is_ useful. It doesn't necessarily
give you the quality of information that you get if data is Designed for
SGML, but as I've shown above it gives you better quality information than
if you try to convert it into an even stupider format: HTML.

 Paul Prescod
Received on Monday, 30 September 1996 09:56:10 UTC