Re: C.4 Undeclared entities?

This is a long article, but I hope that it will be a useful summary
and reference, and will help those working on the XML specification.

It relates to the information about a document and/or at the start of
the document.  We must remember, when deleting the SGML declaration and
other goodies, that they serve multiple functions in SGML, and even if
we can't use them as they stand, we still need some of the things
that they do.

The following information needs to be conveyed, and very preferably in
a way that SGML can accommodate:

(1) that this is an XML document and not a GIF image.
    This may come from a MIME-type or filename.
    This includes the version of the spec for which this document
    was written to conform.

(2) the character set and encoding of what follows.
    This may come from a MIME header, if there is one.
    If not, it must be found very near to the beginning of the document.

(3) if there is a DTD, what it is called and where to get it

(4) if there are other auxiliary SGML files, such as SDATA entity definitions
    (if such things are permitted) or other included objects, where
    to get them

(5) if there are other auxilliary application-specific files, such as
    style sheets, navigator/table-of-contents definitions, VRML mappings,
    odour personality packs, or whatever, what they are and how to get them.
    I know style sheet attachification is a later toic, but it is
    not (I think) entirely unrelated.

(7) whether to expect the following data to be a complete document or
    an incomplete (unfinished partial) document or a reusable a fragment,
    and, if it is complete, whether to expect it to be valid or merely
    (This last is not necessarily the same as having a DTD)

Currently, for SGML,
(1) is external to SGML, although once you've got an SGML document,
    if the SGML declaration says ISO8879-1986 you know you've got a
    pre-amendment SGML document, and if it says ISO8879:1986 you have
    a post-amendment pre-SML97 document, etc.
(2) is derived from the SGML declaration;
(3) is included in or pointed to (or both) by the document type wotsit;
    SGML uses the DOCTYPE external identifier to refer to an external
    file it then can't find, but see (4).
(4) is outside the scope of SGML, but well within the scope of an
    application such as HTML or XML; SGML OPEN TR9501/CATLOG is one
    such mechanism.
(5) is the same as (4) --  e.g., Panorama looks up th public identifier
    from the DOCTYPE wotsit in ENTITYRC, and Author/Editor uses a
    combination of the ${styles_path} preference and voodoo magick.
(6) there isn't a 6.
(7) SGML always expects a complete document; if the element given in
    the DOCTYPE declaration isn't the same as the first open tag
    encountered, an SGML application can go crazy if it likes (A/E
    reduces the strictness of parsing, and assumes that you simply
    haven't finished editing the document yet, for example).

For XML, not all of these apply.

(1) determining the document format is external to XML.  I would suggest,
    however, that a MIME type be registered with iana and also that a
    file extension (.xml I suppose) be suggested.  It's also worth
    adding XML to the Microsoft, Apple and OSF/CDE/Sun type registries.

    If you can't tell that a document is an XML file by looking at
    the start of it, so be it.

    Once you've determined it's an XML document, you need to know the
    version *before* anything else.  Anything transmitted after
    and before
	BUT IT's XML 3.1
    can never, ever change.
    It is possible to say, if there is no version identifier it's XML 1.0,
    but that's a little dangerous.

(2) the character set and encoding must be available by looking at the
    start of the document, just after the XML version.

    This can be a MIME header, or a processing instruction or a
    significant comment (I don't see much difference between the last two)
    or an attribute on a fixed element name (polluting the element
    namespace somewhat as has already been agreed, I think, for IDs)
    A DOCTYPE line doesn't do this.

    (this has to be after the version, so that if the meaning of
    an encoding is changed in some future revision of XML,
    the parser can understand in time)

    I have suggested combining (1) and (2) with
	<XML REV="1.0" encoding="xxx" charset="yyy">
    but I accept that this means you can't use an element called
    XML.  If @ is added as a namestart (until SGML can be changed
    to have a separate MDO for empty elements) then the element could
    be empty, and we could use
	<@XML REV="1.0" ....>
    Another way would be with a PROCINST,
	<?XML REV="1.0" encoding="xxx" charset="yyy">
    This might not appeal to SGML users accustomed to thinking of
    processing instructions as a kludge for the things you couldn't
    see how to do properly (yes, there are many such people), but
    it has the advantage that it can precede a DOCTYPE line.

(3) if there is a DTD, what it is called and where to get it
    If there is a DTD, I have no strong objection to a DOCTYPE line,
    even though it's not very pretty.

    If there is no DTD, there might as well be no DOCTYPE line, since
    it doesn't help SGML conformance in practice.  In principle, I
    understand Charles' <!DOCTYPE XML SYSTEM> hack, and he has
    almost convinced me that this does not conflict with [110:37]
    (p. 403 I think in the handbook, I'm at home)
	A document type declaration must contain an
	element declaration for the document type name.
    on the grounds that in this case the application must invent a
    DTD that so conforms.  But in practice I doubt that there are
    any SGML applications that cope with this.

(4) how to get ancilliary files:

    HTML allows a base URL for the current document to be specified.
    This is necessary for use with databases and CGI scripts.

    SoftQuad Panorama allows processing instructions in the document
    to say
	* this is my URL (new in Panorama 2.0 because there were
	  situations we absolutely couldn't cope with without it)
	  (i.e. same as BASE)
	* this is the URL or a style sheet for me, and this is its name;
	* same for other associated files (e.g. navspec)
	  (see (5) below for these)
    This is a minimum.

    For Panorama, system identifiers in the definitions of external
    identifiers are taken to be partial URLs and interpreted relative
    to the file containing their definition.

    XML could do this with a processing instruction, although note that
    since you can't easily put a > in there, there are some files that
    you can't access (name space pollution again!).
    In practice, an application can decide to recognise &#nnn; or &gt;
    inside a PI if it likes (after ESIS, if you will), and thus allow
    a quoting mechanism.  You have to have &ampl too in that case,
    of course.

    An attribute on the <@XML> element would be a more robust way
    to do this:
    This starts to look like the (very useful but problematic) HTML
    HEAD structure...

(5) how to get smile-sheets
    This is a later topic, I know, but if we get rid of PIs for
    DTDs, we can no longer use the PI of a DTD to bind a document
    and its style sheet.

    The mechanism used has to be DTD-independent, and cn't rely
    on architectural forms in DTDs (since we might not have a DTD to put
    them in).

    It needs to occur before the first non-header element in the document
    (so we can display without backtracking).

    Again, a PROCINST could be used.

(6) There is no 6.

(7) whether to expect the following data to be a complete document or an
    incomplete (unfinished partial) document or a reusable a fragment...

    This function is normally served by the DOCTYPE.

	<!Doctype boy system>
    we do not have a complete document, since Simon and boy don't match.
    In XML, the functions of the DOCTYPE might be
    (a) confuse systems that use the contents of the file rather than
	an externally specified file type or extension into thinking
	that this is an SGML file and firing up Adept on it (say).
	I'm not sure this is a real problem in practice.
    (b) let us know whether the element named in the doctype matches
	the first element in the document.
	This information is not needed when there is no DTD, as we
	can't do anything useful with it.  Can we?
    (c) if a system identifier is given, we know where we would fetch the
	DTD from if we chose to do so.  A separate mechanism must indicate
	whether that is appropriate or not.
    ...and, if it is complete, whether to expect it to be valid or
    merely well-formed.

    I think that having the URL to a DTD is not the same as saying that
    this document must be valid according to that DTD.
    If that's so, we also need
	<?!@#validity: [valid|well-formed|broken]>

Summary of fields:

A:    It's XML
B:    REV="1.0"
C:    encoding="xxx"
D:    charset="yyy"
E:    base="http://where.you.com/and/get/me.xml"
F:    style set:
	(style sheet, name, valid-devices)
	(or whatever, probably application-dependent, we'll see)
G:    DTD name and version and location
H:    conformance; validity

Notes on all the fields together in XML:

If you have field (B), you can assume that the document is intended
to conform to the spec. -- that is, it is well-formed (or you can
report an error and quit!) and may also be valid.

Of course, all the other HTTP sorts of fields that one can
specify with META, such as
    <META HTTP-EQUIV="Expires" Value="IETF standard format date+time">
    <META HTTP-EQUIV="Pragma" Value="no-cache">
and so on, are also going to be issues for XML files, as are the
metadata (such as the Dublin Core) for search engines and other uses.

One way might simply be to have a surrogate document:
    <?XML:metadata "URL:xxx.html">
with much of this information in it.

But id would be for the best if all these things can be included
in an XML file, and in a stndard way, and in a way that does not
conflict with SGML.

Where there is a standard way of doing something, and that standard
can be used, use it.