- From: Paul Prescod <papresco@calum.csclub.uwaterloo.ca>
- Date: Mon, 30 Sep 1996 09:50:59 -0400
- To: Robert Streich <streich@slb.com>, Paul Prescod <papresco@calum.csclub.uwaterloo.ca>
- Cc: w3c-sgml-wg@w3.org
At 01:04 AM 9/30/96 CDT, Robert Streich wrote: >At 10:25 AM 9/28/96 -0400, Paul Prescod wrote: >> * Because XML is widely supported (we hope). >> * Because XML is compact (we hope). >> * Because XML will preserve whatever structure exists in the document (and >>Word allows quite a bit). >> * Because XML is more portable, more device-independent and more widely >>supported than Word for Windows format. >> * Because XML is "open" and standardized. >> * Because XML is easy to full-text index. > >I don't get it. Which of these doesn't apply to HTML? >> * Because XML is compact (we hope). HTML represented structured information is _not_ compact because you have to duplicate data hundreds of times: in every "xref", in the table of contents, in the header, the footer, in the index. In short, every time information references other information. >> * Because XML will preserve whatever structure exists in the document (and >>Word allows quite a bit). Conversion to HTML invariably destroys the structure in the document. Let's say that the internal Word structure is: /section This is section 1 /paragraph{type=Normal} Please see /reference{target=BOOKMARK_SECTION_2} /index{"section 2, reference to"} Then the HTML structure is: <H1> <P CLASS=NORMAL>Please see <A HREF="BOOKMARK-SECTION-2">Section 2</A></P> <A CLASS="INDEX-TARGET" NAME="#INDEX-TARGET-6"> </A> I've been forced to throw away a TON of the "true information" in this document. First, there is know way of knowing that the reference is supposed to be expanded and changed by the style sheet, not hard-coded to the text "section 2". Second, the index entry has now been split up into a source (in an index file) and a target (in this file). Third, the index target now contains data, because HTML anchor targets must contain data. Fourth, CLASSes will be used throughout the document to make MAJOR semantic distinctions. That's distasteful because most SGML (and XML) tools work on the premise that major semantic distinctions will be made in GIs. Fifth, I've had to generate names for link targets and anchors that were created that did not exist in the Word doc. Finally, there are going to be hundreds of small hacks, like the "space" in the anchor target to try to work around a hundred browser implementations because HTML was not so much designed as "grown." In other words, there can be no round trip conversion here. The true information has been thrown away. This will happen in a hundred small ways because I cannot choose my GIs and attributes to match the needs of the data. Here's what I think the XML would look like: <section>This is section 1 <Normal>Please see <reference target=BOOKMARK-SECTION-2></NORMAL> <INDEX TARGET="section 2, reference to"> >> * Because XML is more portable, more device-independent and more widely >>supported than Word for Windows format. I think that my XML version is likely to render well on a much larger range of devices than either the proprietary Word for Windows file or the dumbed-down HTML file. >And why would Jane >Author want to mess with creating a structured doc? To preserve as much of the information as is in the original document in the Web delivery. >If a bunch of authors create a bunch of documents with a bunch of DTDs >and a bunch of stylesheets then how is that any better than what we >have now? Aren't you describing the current usages of SGML? Aren't there a bunch of authors with a bunch of documents with a bunch of DTDs and a bunch of stylesheets? I don't understand your point. >If this were useful, I could just buy a bunch of DynaTags >and automap all of the styles. Yeccch. And some people _do_ do that, and it _is_ useful. It doesn't necessarily give you the quality of information that you get if data is Designed for SGML, but as I've shown above it gives you better quality information than if you try to convert it into an even stupider format: HTML. Paul Prescod
Received on Monday, 30 September 1996 09:56:10 UTC