- From: Joe English <jenglish@crl.com>
- Date: Tue, 01 Oct 1996 14:35:30 -0700
- To: w3c-sgml-wg@w3.org
Regarding the proposal to require that all data content in XML be delimited: As I understand it, this proposal attempts to solve two problems: that of distinguishing element content from mixed content, and that of RS/RE processing. Since ISO 8879:1986 has no PELO/PELC ("pseudoelement open"/ "pseudoelement close") delimiter roles, the issue of SGML compatibility must be addressed. The following options are available for processing an XML document as SGML: 1a) use a native XML parser to build a grove or ESIS stream; 1b) use an XML-to-SGML translator; 1c) wait for ISO 8879:199X (and tools that support it); 1d) use SHORTREF tricks in the DTD; 1e) something else that I've missed. <myopinion> (1a), (1b), and to a lesser extent (1c) would severely limit XML's utility. If XML can't be directly processed with existing SGML tools, there is little point in putting an SGML-like notation on the Web. If a separate XML parser is required, we'd be better off using a completely new syntax that is *significantly* easier to parse than something SGMLoid; Lisp S-expressions might be a good choice. </myopinion> (1d) would allow XML instances to be parsed as SGML, but introduces the new problem of DTD incompatibility. (I'm still operating under the assumption that *some* XML applications will need to examine the DTD.) We can get around this by stating in the XML spec something to the effect of "Ignore <!SHORTREF> and <!USEMAP> declarations; they're black magic intended for consumption by SGML parsers." Inelegant, but workable. It is unclear (to me) what the SHORTREF black magic should be. If I understand correctly, something like the following was proposed: <!-- where "pel" is an element type name not defined elsewhere in the DTD, and the DTD uses "pel" instead of #PCDATA in all other content models. --> <!ELEMENT pel - - (#PCDATA) > <!ENTITY sr.pelo STARTTAG pel > <!ENTITY sr.pelc ENDTAG pel > <!ENTITY sr.escaped-pelc CDATA '"'> <!SHORTREF element-content '"' sr.pelo > <!SHORTREF data-content '"' sr.pelc '\"' sr.escaped-pelc -- NB: won't work with RCS -- > <!USEMAP data-content pel> <!USEMAP element-content ...everything else...> With this scheme, an XML parser will *not* produce the same output as an SGML parser: the latter will construct a grove with all PEL nodes inside "wrapper" EL nodes that will be absent in the grove constructed by the former. Presumably this can be handled by an application convention on the SGML side that "pel" elements are to be ignored, or with another rule for XML that the PELO/PELC delimiters are equivalent to "pel" start- and end- tags -- in other words, requiring XML to interpret (or act as if it interprets) the short references after all. This scheme doesn't solve all of the RS/RE weirdness: <foo> " This data contains two delimited REs that are discarded by SGML parsers. " </foo> Maybe: <!ENTITY sr.pelo "<pel>&#RE;" > <!ENTITY sr.pelc "&#RE;</pel>" > would do the trick. OK... so I've convinced myself at least that option (1d) is workable from a technical standpoint (though it excludes SGML parsers that don't support extending the short reference delimiter set, or don't support SHORTREF at all). Is there a better solution that I've missed? This leaves the problem in the hands of XML producers. It seems to me that the best way to produce XML will be to author in "full" SGML and down-translate. Ideally the down-translation will be a simple normalization process, and with any luck most "native" SGML editors already save files in a form that will be suitably normalized for consumption as XML. The only restriction on XML-able DTDs so far is that they cannot contain EMPTY elements or #CONREF attributes (unless we can come up with a way for DTD-less parsers to handle those too). However, very few (if any) existing DTDs utilize the short reference tricks required by option (1d); if these are required to make a DTD XML-able, it places a significant burden on XML producers. Do we expect them to use XML-able DTDs for all data in its "native" format? Is there a way to automatically convert existing DTDs to add the required SHORTREF declarations and perform #PCDATA substitutions? Do we require that providers maintain two copies of all their DTDs, one for use with their local SGML tools and one for publication as XML? I'm not sure how to address this issue. <myopinion> <![ TEMP [ I think that the problem of distinguishing element content from mixed content is best solved by a markup convention forbidding SEPCHARs in element content. The cost of this solution -- that XML input cannot be formatted for editing the way most users would like -- is less than the cost of adopting option (1d). I think that the problem of RS/RE processing is best handled simply by keeping the rules from ISO 8879. If the rules can be explained in 14 lines of prose and implemented with a 4-state DFA, then RS/RE processing is at most a minor annoyance; it's *not* a major problem. I think that "RE delenda est" is the other best way to handle the RS/RE problem. Moreover, if we adopt the markup convention above then Charles' objections concerning the ontogeny of individual record ends are moot: if authors are forbidden to use whitespace to format element content, why should data content be any different? ]]> </myopinion> --Joe English jenglish@crl.com
Received on Tuesday, 1 October 1996 17:35:19 UTC