- From: <lee@sq.com>
- Date: Mon, 28 Oct 96 23:04:35 EST
- To: w3c-sgml-wg@w3.org
This is a long article, but I hope that it will be a useful summary and reference, and will help those working on the XML specification. It relates to the information about a document and/or at the start of the document. We must remember, when deleting the SGML declaration and other goodies, that they serve multiple functions in SGML, and even if we can't use them as they stand, we still need some of the things that they do. The following information needs to be conveyed, and very preferably in a way that SGML can accommodate: (1) that this is an XML document and not a GIF image. This may come from a MIME-type or filename. This includes the version of the spec for which this document was written to conform. (2) the character set and encoding of what follows. This may come from a MIME header, if there is one. If not, it must be found very near to the beginning of the document. (3) if there is a DTD, what it is called and where to get it (4) if there are other auxiliary SGML files, such as SDATA entity definitions (if such things are permitted) or other included objects, where to get them (5) if there are other auxilliary application-specific files, such as style sheets, navigator/table-of-contents definitions, VRML mappings, odour personality packs, or whatever, what they are and how to get them. I know style sheet attachification is a later toic, but it is not (I think) entirely unrelated. (7) whether to expect the following data to be a complete document or an incomplete (unfinished partial) document or a reusable a fragment, and, if it is complete, whether to expect it to be valid or merely well-formed. (This last is not necessarily the same as having a DTD) Currently, for SGML, (1) is external to SGML, although once you've got an SGML document, if the SGML declaration says ISO8879-1986 you know you've got a pre-amendment SGML document, and if it says ISO8879:1986 you have a post-amendment pre-SML97 document, etc. (2) is derived from the SGML declaration; (3) is included in or pointed to (or both) by the document type wotsit; SGML uses the DOCTYPE external identifier to refer to an external file it then can't find, but see (4). (4) is outside the scope of SGML, but well within the scope of an application such as HTML or XML; SGML OPEN TR9501/CATLOG is one such mechanism. (5) is the same as (4) -- e.g., Panorama looks up th public identifier from the DOCTYPE wotsit in ENTITYRC, and Author/Editor uses a combination of the ${styles_path} preference and voodoo magick. (6) there isn't a 6. (7) SGML always expects a complete document; if the element given in the DOCTYPE declaration isn't the same as the first open tag encountered, an SGML application can go crazy if it likes (A/E reduces the strictness of parsing, and assumes that you simply haven't finished editing the document yet, for example). For XML, not all of these apply. (1) determining the document format is external to XML. I would suggest, however, that a MIME type be registered with iana and also that a file extension (.xml I suppose) be suggested. It's also worth adding XML to the Microsoft, Apple and OSF/CDE/Sun type registries. If you can't tell that a document is an XML file by looking at the start of it, so be it. Once you've determined it's an XML document, you need to know the version *before* anything else. Anything transmitted after THIS IS XML and before BUT IT's XML 3.1 can never, ever change. It is possible to say, if there is no version identifier it's XML 1.0, but that's a little dangerous. (2) the character set and encoding must be available by looking at the start of the document, just after the XML version. This can be a MIME header, or a processing instruction or a significant comment (I don't see much difference between the last two) or an attribute on a fixed element name (polluting the element namespace somewhat as has already been agreed, I think, for IDs) A DOCTYPE line doesn't do this. (this has to be after the version, so that if the meaning of an encoding is changed in some future revision of XML, the parser can understand in time) I have suggested combining (1) and (2) with <XML REV="1.0" encoding="xxx" charset="yyy"> but I accept that this means you can't use an element called XML. If @ is added as a namestart (until SGML can be changed to have a separate MDO for empty elements) then the element could be empty, and we could use <@XML REV="1.0" ....> Another way would be with a PROCINST, <?XML REV="1.0" encoding="xxx" charset="yyy"> This might not appeal to SGML users accustomed to thinking of processing instructions as a kludge for the things you couldn't see how to do properly (yes, there are many such people), but it has the advantage that it can precede a DOCTYPE line. (3) if there is a DTD, what it is called and where to get it If there is a DTD, I have no strong objection to a DOCTYPE line, even though it's not very pretty. If there is no DTD, there might as well be no DOCTYPE line, since it doesn't help SGML conformance in practice. In principle, I understand Charles' <!DOCTYPE XML SYSTEM> hack, and he has almost convinced me that this does not conflict with [110:37] (p. 403 I think in the handbook, I'm at home) A document type declaration must contain an element declaration for the document type name. on the grounds that in this case the application must invent a DTD that so conforms. But in practice I doubt that there are any SGML applications that cope with this. (4) how to get ancilliary files: HTML allows a base URL for the current document to be specified. This is necessary for use with databases and CGI scripts. SoftQuad Panorama allows processing instructions in the document to say * this is my URL (new in Panorama 2.0 because there were situations we absolutely couldn't cope with without it) (i.e. same as BASE) * this is the URL or a style sheet for me, and this is its name; * same for other associated files (e.g. navspec) (see (5) below for these) This is a minimum. For Panorama, system identifiers in the definitions of external identifiers are taken to be partial URLs and interpreted relative to the file containing their definition. XML could do this with a processing instruction, although note that since you can't easily put a > in there, there are some files that you can't access (name space pollution again!). In practice, an application can decide to recognise &#nnn; or > inside a PI if it likes (after ESIS, if you will), and thus allow a quoting mechanism. You have to have &l too in that case, of course. An attribute on the <@XML> element would be a more robust way to do this: <@XML REV="1.0" encoding="xxx" charset="yyy" base="http://where.you.com/and/get/me.xml" > This starts to look like the (very useful but problematic) HTML HEAD structure... (5) how to get smile-sheets This is a later topic, I know, but if we get rid of PIs for DTDs, we can no longer use the PI of a DTD to bind a document and its style sheet. The mechanism used has to be DTD-independent, and cn't rely on architectural forms in DTDs (since we might not have a DTD to put them in). It needs to occur before the first non-header element in the document (so we can display without backtracking). Again, a PROCINST could be used. (6) There is no 6. (7) whether to expect the following data to be a complete document or an incomplete (unfinished partial) document or a reusable a fragment... This function is normally served by the DOCTYPE. In <!Doctype boy system> <Simon> we do not have a complete document, since Simon and boy don't match. In XML, the functions of the DOCTYPE might be (a) confuse systems that use the contents of the file rather than an externally specified file type or extension into thinking that this is an SGML file and firing up Adept on it (say). I'm not sure this is a real problem in practice. (b) let us know whether the element named in the doctype matches the first element in the document. This information is not needed when there is no DTD, as we can't do anything useful with it. Can we? (c) if a system identifier is given, we know where we would fetch the DTD from if we chose to do so. A separate mechanism must indicate whether that is appropriate or not. ...and, if it is complete, whether to expect it to be valid or merely well-formed. I think that having the URL to a DTD is not the same as saying that this document must be valid according to that DTD. If that's so, we also need <?!@#validity: [valid|well-formed|broken]> Summary of fields: A: It's XML B: REV="1.0" C: encoding="xxx" D: charset="yyy" E: base="http://where.you.com/and/get/me.xml" F: style set: (style sheet, name, valid-devices) (or whatever, probably application-dependent, we'll see) G: DTD name and version and location H: conformance; validity Notes on all the fields together in XML: If you have field (B), you can assume that the document is intended to conform to the spec. -- that is, it is well-formed (or you can report an error and quit!) and may also be valid. Of course, all the other HTTP sorts of fields that one can specify with META, such as <META HTTP-EQUIV="Expires" Value="IETF standard format date+time"> <META HTTP-EQUIV="Pragma" Value="no-cache"> and so on, are also going to be issues for XML files, as are the metadata (such as the Dublin Core) for search engines and other uses. One way might simply be to have a surrogate document: <?XML:metadata "URL:xxx.html"> with much of this information in it. But id would be for the best if all these things can be included in an XML file, and in a stndard way, and in a way that does not conflict with SGML. Where there is a standard way of doing something, and that standard can be used, use it. Lee
Received on Monday, 28 October 1996 23:04:44 UTC