- From: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>
- Date: Thu, 17 Oct 96 14:10:29 CDT
- To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
The SGML ERB met today, Oct 17th, and voted on several items already submitted to the SGML WG. Participating: Bosak, Clark, Maler (in part), Magliery (in part), Paoli, Sperberg-McQueen, and Sharpe. Absent: Bray, DeRose, Hollander, Kimber, and Connolly. All decisions were by consensus of all those participating in the call, and thus carry a majority of the membership of the ERB. The text below is substantially the same as the drafts discussed by the ERB, but was edited after the meeting to reflect the ERB's decisions; the ERB has thus not seen and approved the precise wording given, and may choose to correct any editorial errors made in the revision. -C. M. Sperberg-McQueen ----------------------------------------------------------------------- B.1 What should XML's character-set rules be? Should conforming XML documents be restricted to particular character sets? Should conforming XML processors be required to be able to parse all conforming XML documents (13.1)? It had already been agreed that: - the character repertoire of XML documents is that of ISO 10646 - conforming XML documents may be in UTF-8 or UCS-2 form - all XML processors must accept documents in UTF-8 and UCS-2 (or optionally UTF-16) form - XML processors may provide a user option which causes them to accept documents in other coded character sets (e.g. ISO 8859 or JIS 0208) or other encodings of 10646 or other coded character sets (e.g. Extended Unix Code) -- this behavior must be optional [at least in validating processors, we decided today] (i.e. the user must be able to turn it off, so that documents not in UTF-8 or UCS-2 raise errors). In discussing the mechanism to be used for signaling the encoding and/or coded character set in use, the ERB decided the following. [Editorial note: if the ERB decides that XML will have external text entities, then everything said below about documents will also apply to all external text entities.] The character repertoire of XML documents is that of ISO 10646. All XML processors are required to accept documents in the UTF-8 and UCS-2 encodings of 10646. It is recognized that accepting documents in the UTF-16 variant would be desirable. Documents encoded in UCS-2 must begin with the Byte Order Mark described by ISO 10646 Annex E and Unicode Appendix B (the ZERO WIDTH NO-BREAK SPACE character, U+FEFF) -- this is an encoding signature, and not (for SGML purposes) part of the document. XML processors must be able to use this character to differentiate between UTF-8 and UCS-2 encoded documents. XML does not explicitly sanction the use of any other encodings. It is recognized, however, that many documents exist in other encodings. To support processors in dealing with this situation, an XML document may contain at its beginning, before any other text, markup, PIs, or white space, an Encoding Declaration PI matching EncDecl ::= '<?XML' S 'encoding' Eq ("'" Encoding "'")|('"' Encoding '"') S? '>' An XML processor may choose to read Encoding Declaration PIs and accept nonstandard encodings so declared. In validating processors such behavior must be at user option. An XML document which lacks both the Byte Order Mark and an Encoding Declaration PI must be in the UTF-8 encoding. It is an error for a document to be in an encoding other than that declared in its Encoding Declaration PI. The XML specification shall include (possibly by reference to relevant IETF documentation) a list of standard declarations for the nonterminal "Encoding" in the above production, to support interoperability, including names for at least ISO-Latin-X and the JIS family. ----------------------------------------------------------------------- B.2 Should XML require each document instance to have a DTD or not (7.1)? In discussing this item, the ERB made the following decisions: 1. Well-formedness The XML spec shall define two characteristics which an XML document may possess, called "well-formedness" and "validity". A well-formed document, informally, is one for which no content model checking has been done, but which can be read by an XML processor with confidence in producing a correct ESIS. Questions remaining open include: (a) the specific definition of well-formedness -- it is expected to include at least least (1) a containing root element with no text outside it, (2) properly nested elements, (3) properly structured tags, and possibly other constraints on entity references, empty elements, etc. (b) whether two distinct levels of well-formedness (e.g. strong and weak) are necessary (c) the nature of well-formedness when there is no DTD or a partial DTD remains open. 2. Required Markup Declaration (votable Y/N) XML markup declarations are divided into DTDs pointed-at by the <!DOCTYPE, and internal subsets contained within the <!DOCTYPE. Markup declarations necessary to produce a correct parse may be contained either in the DTD or the subset. XML will include a signalling method whereby instances may contain statements indicating whether the declarations in the DTD and/or the subset are necessary to produce a correct parse. XML documents may contain a Required Markup Declaration PI as follows: RMDDecl ::= '<?XML' S 'rmd' Eq ('NONE'|'INTERNAL'|'ALL') S? '>' The RMD PI must appear after the Encoding Declaration PI, if any, and before the document type declaration itself, if any. Should the RMD state that the DTD is required ('DTD' or 'ALL'), it is a reportable error if the DTD cannot be retrieved. 3. Interpretation of Required Markup Declaration If no RMD PI is given, then - if a document type declaration is given, an XML processor must assume that the DTD is required, and read and process both the internal subset and the external DTD; it is a reportable error if the external DTD cannot be retrieved. This is as if <?XML rmd='ALL'> had been specified. - if no document type declaration is given, an XML processor may do as it likes. For example, (a) signal an error, (b) behave as if <?XML rmd='NONE'> were declared, (c) guess, on the basis of the root element's GI, and retrieve the appropriate well-known DTD if possible or act on hard-coded knowledge of the DTD (e.g. HTML). If an RMD PI is given, then - for the value NONE, a validating processor may check the DTD and/or instance to verify that in fact the DTD is not necessary to the correct construction of the ESIS; it's an error if the DTD is necessary but <?XML rmd='none'> is specified. - for the value INTERNAL, a validating processor may check the accuracy of the RMD PI; a non-validating processor may read the internal subset and skip the external DTD. If the RMD PI is correct, the non-validating processor can construct the same parse as a validating parser. - for the value ALL, an XML processor must read and process the entire DTD, and construct the ESIS accordingly. (The DTD may be skipped only by applications which don't construct an ESIS in any meaningful sense.) A processor may issue an informational message if in fact the DTD could have been skipped, for this instance or for all documents in using the given DTD. -----------------------------------------------------------------------
Received on Thursday, 17 October 1996 16:04:05 UTC