- From: William F. Hammond <hammond@csc.albany.edu>
- Date: Fri, 29 Jun 2001 08:29:32 -0400 (EDT)
- To: www-talk@w3.org
Arjun -- I've not disagreed with your statements about HTML documents when viewed as SGML applications. My view, however, of the various HTML-under-SGML specs that begin with RFC1866, then the 3.2, 4.0, 4.01 versions, and now the several versions of HTML as XML is that an HTML document is an SGML application that must also meet other requirements. Formally, this means that an HTML document is something that always gives rise to an SGML application, but it is not correct to say that it "is" an SGML application. One can say that it "degrades to" an SGML application as a short form of indicating that for a given HTML document there is a canonically associated SGML application. A validating agent must know how to perform the canonical association. It would be mischief to ship the completely assembled version of an HTML 2.0 document as an SGML application (SGML declaration, document type definition, and instance) under the purview of RFC1866 through HTTP. But if an HTML document "is" an SGML application, that should be sensible. The distinction between an HTML document and the SGML application to which it canonically gives rise is the reason why Ian Hickson's example of a valid HTML-as-SGML-application that begins with "<?xml ... >" fails to prove his point. That example, as I explained here some time ago, when put under the eye of an XHTML-aware user agent must give first priority to the PI named "xml", a reserved name in PI space for XHTML-aware user agents, and then must reject the document for not qualifying as conforming XML. > > If a late version of HTML has a larger charset than an early version, > > then it is formally wrong to allow the larger charset in something > > specified as the early version. > > I don't see how, if the newer set were a proper superset of the older > one. (The whole character set business has been handled less than > optimally, IMHO, but that's a separate discussion.) The example <title>A Test</title><body><p>Œ</body> can be validated against the 2.0 DTD if the SGML declaration for 4.01 is used but not with the correct declaration. It would be wrong to use an FPI for 2.0 or to say that it is an HTML 2.0 document because char 338 is not in the character set for 2.0. > > and has specified a particular form of document type declaration > > construction using one of a small list of FPI's. > > Actually, no. They have done the right thing in publishing FPIs for In RFC1866 it's not required, but for W3C/3.2 a doctype declaration is required and for each subsequent W3C version it's required. > of convenience. The core validation requirement is that an instance > validate with respect to a (specific) declaration subset. To this > end, the particular form of a document type declaration - or even, in > fact, its presence - is irrelevant. It's relevant in the context of the web where one cannot ship anything more than the instance with a short prolog. For example, in the 4.01 spec, section 7.1 says that a document must begin with a "line containing HTML version information". > I disagree. In a nutshell, you're proposing that a validation system > do nothing until it has sniffed an FPI in a document type declaration, The word "sniffing" is inappropriate. A doctype declaration is required for all but 2.0. So formally if there is no doctype declaration and the HTML is assumed to be in the W3C family of document types, one should assume 2.0. (But in practice I would assume 4.01, and that might be a glitch in the system.) There are other definitions of HTML, but they are not really suited for use on the web. > at which point it should use the FPI to resolve all other requirements > such as appropriate SGML declarations and the like. The main purpose > of such guesswork, apparently, is to *hide* from ordinary people the > fact that XHTML and HTML4 documents will *not* validate in "identical > regimes". ... As HTML 3.2 and HTML 4.01 will not validate in identical regimes, and as Docbook 3 and Docbook 4 will not validate in identical regimes. Or Docbook 4 and its XML counterpart. Basically the same relation exists between XHTML 1.0 and HTML 4.01. "Tidy" is visibly out there, so I don't see anything hidden. The fact that one of them is also XML is not that important, and certainly not important enough to justify the claim of a user agent advocate that HTML 4.01 and XHTML 1.0 should live under different HTTP and SMTP content types. That is the current issue. No public spec so far has suggested that text/xml should be the preferred content type for XHTML although RFC3023 allows any XML document type that degrades to text/plain to be so shipped. From the standpoint of vocabulary and content, XHTML is HTML. The text/xml content type is an umbrella for a vast world of entirely different things, many of which have no place under the eye of mass market user agents except possibly for transport. -- Bill
Received on Friday, 29 June 2001 08:30:22 UTC