B.1 and B.2 results

The SGML ERB met today, Oct 17th, and voted on several items already
submitted to the SGML WG.  Participating:  Bosak, Clark, Maler (in
part), Magliery (in part), Paoli, Sperberg-McQueen, and Sharpe.  Absent:
Bray, DeRose, Hollander, Kimber, and Connolly.  All decisions were by
consensus of all those participating in the call, and thus carry a
majority of the membership of the ERB.

The text below is substantially the same as the drafts discussed by the
ERB, but was edited after the meeting to reflect the ERB's decisions;
the ERB has thus not seen and approved the precise wording given, and
may choose to correct any editorial errors made in the revision.

-C. M. Sperberg-McQueen


B.1 What should XML's character-set rules be?  Should conforming
XML documents be restricted to particular character sets?  Should
conforming XML processors be required to be able to parse all conforming
XML documents (13.1)?

It had already been agreed that:
  - the character repertoire of XML documents is that of ISO 10646
  - conforming XML documents may be in UTF-8 or UCS-2 form
  - all XML processors must accept documents in UTF-8 and UCS-2 (or
    optionally UTF-16) form
  - XML processors may provide a user option which causes them to accept
    documents in other coded character sets (e.g. ISO 8859 or JIS 0208)
    or other encodings of 10646 or other coded character sets (e.g.
    Extended Unix Code) -- this behavior must be optional [at least
    in validating processors, we decided today] (i.e. the user must
    be able to turn it off, so that documents not in UTF-8 or UCS-2
    raise errors).

In discussing the mechanism to be used for signaling the encoding and/or
coded character set in use, the ERB decided the following.  [Editorial
note:  if the ERB decides that XML will have external text entities,
then everything said below about documents will also apply to all
external text entities.]

The character repertoire of XML documents is that of ISO 10646.  All XML
processors are required to accept documents in the UTF-8 and UCS-2
encodings of 10646.  It is recognized that accepting documents in the
UTF-16 variant would be desirable.  Documents encoded in UCS-2 must
begin with the Byte Order Mark described by ISO 10646 Annex E and
Unicode Appendix B (the ZERO WIDTH NO-BREAK SPACE character, U+FEFF) --
this is an encoding signature, and not (for SGML purposes) part of the
document.  XML processors must be able to use this character to
differentiate between UTF-8 and UCS-2 encoded documents.

XML does not explicitly sanction the use of any other encodings.  It is
recognized, however, that many documents exist in other encodings.  To
support processors in dealing with this situation, an XML document may
contain at its beginning, before any other text, markup, PIs, or white
space, an Encoding Declaration PI matching

EncDecl ::=
  '<?XML' S 'encoding' Eq ("'" Encoding "'")|('"' Encoding '"') S? '>'

An XML processor may choose to read Encoding Declaration PIs and accept
nonstandard encodings so declared.  In validating processors such
behavior must be at user option.

An XML document which lacks both the Byte Order Mark and an Encoding
Declaration PI must be in the UTF-8 encoding.  It is an error for a
document to be in an encoding other than that declared in its Encoding
Declaration PI.

The XML specification shall include (possibly by reference to relevant
IETF documentation) a list of standard declarations for the nonterminal
"Encoding" in the above production, to support interoperability,
including names for at least ISO-Latin-X and the JIS family.


B.2 Should XML require each document instance to have a DTD or not

In discussing this item, the ERB made the following decisions:

1. Well-formedness

The XML spec shall define two characteristics which an XML document may
possess, called "well-formedness" and "validity".  A well-formed
document, informally, is one for which no content model checking has
been done, but which can be read by an XML processor with confidence in
producing a correct ESIS.

Questions remaining open include:
  (a) the specific definition of well-formedness -- it is expected to
      include at least least (1) a containing root element with no text
      outside it, (2) properly nested elements, (3) properly structured
      tags, and possibly other constraints on entity references, empty
      elements, etc.
  (b) whether two distinct levels of well-formedness (e.g. strong
      and weak) are necessary
  (c) the nature of well-formedness when there is no DTD or a
      partial DTD remains open.

2. Required Markup Declaration (votable Y/N)

XML markup declarations are divided into DTDs pointed-at by the
<!DOCTYPE, and internal subsets contained within the <!DOCTYPE.  Markup
declarations necessary to produce a correct parse may be contained
either in the DTD or the subset.  XML will include a signalling method
whereby instances may contain statements indicating whether the
declarations in the DTD and/or the subset are necessary to produce a
correct parse.

XML documents may contain a Required Markup Declaration PI as follows:

RMDDecl ::= '<?XML' S 'rmd' Eq ('NONE'|'INTERNAL'|'ALL') S? '>'

The RMD PI must appear after the Encoding Declaration PI, if any, and
before the document type declaration itself, if any.

Should the RMD state that the DTD is required ('DTD' or 'ALL'), it is a
reportable error if the DTD cannot be retrieved.

3. Interpretation of Required Markup Declaration

If no RMD PI is given, then
  - if a document type declaration is given, an XML processor must
    assume that the DTD is required, and read and process both the
    internal subset and the external DTD; it is a reportable error
    if the external DTD cannot be retrieved.  This is as if
    <?XML rmd='ALL'> had been specified.
  - if no document type declaration is given, an XML processor may
    do as it likes.  For example, (a) signal an error, (b) behave as
    if <?XML rmd='NONE'> were declared, (c) guess, on the basis of the
    root element's GI, and retrieve the appropriate well-known DTD
    if possible or act on hard-coded knowledge of the DTD (e.g. HTML).

If an RMD PI is given, then
  - for the value NONE, a validating processor may check the DTD and/or
    instance to verify that in fact the DTD is not necessary to the
    correct construction of the ESIS; it's an error if the DTD is
    necessary but <?XML rmd='none'> is specified.
  - for the value INTERNAL, a validating processor may check the
    accuracy of the RMD PI; a non-validating processor may read the
    internal subset and skip the external DTD.  If the RMD PI is
    correct, the non-validating processor can construct the same
    parse as a validating parser.
  - for the value ALL, an XML processor must read and process the
    entire DTD, and construct the ESIS accordingly.  (The DTD may be
    skipped only by applications which don't construct an ESIS in any
    meaningful sense.)  A processor may issue an informational message
    if in fact the DTD could have been skipped, for this instance or
    for all documents in using the given DTD.