Re: FPI Mythology (was: XHTML Considered Harmful)

Arjun --

I've not disagreed with your statements about HTML documents when
viewed as SGML applications.

My view, however, of the various HTML-under-SGML specs that begin with
RFC1866, then the 3.2, 4.0, 4.01 versions, and now the several
versions of HTML as XML is that an HTML document is an SGML
application that must also meet other requirements.

Formally, this means that an HTML document is something that always
gives rise to an SGML application, but it is not correct to say that
it "is" an SGML application.  One can say that it "degrades to" an
SGML application as a short form of indicating that for a given
HTML document there is a canonically associated SGML application.

A validating agent must know how to perform the canonical association.

It would be mischief to ship the completely assembled version of an
HTML 2.0 document as an SGML application (SGML declaration, document
type definition, and instance) under the purview of RFC1866 through
HTTP.

But if an HTML document "is" an SGML application, that should be
sensible.

The distinction between an HTML document and the SGML application to
which it canonically gives rise is the reason why Ian Hickson's
example of a valid HTML-as-SGML-application that begins with "<?xml
... >" fails to prove his point.  That example, as I explained here
some time ago, when put under the eye of an XHTML-aware user agent
must give first priority to the PI named "xml", a reserved name in PI
space for XHTML-aware user agents, and then must reject the document
for not qualifying as conforming XML.

> > If a late version of HTML has a larger charset than an early version,
> > then it is formally wrong to allow the larger charset in something
> > specified as the early version.
> 
> I don't see how, if the newer set were a proper superset of the older
> one.  (The whole character set business has been handled less than
> optimally, IMHO, but that's a separate discussion.)

The example

            <title>A Test</title><body><p>&#338;</body>

can be validated against the 2.0 DTD if the SGML declaration for 4.01
is used but not with the correct declaration.  It would be wrong to
use an FPI for 2.0 or to say that it is an HTML 2.0 document because
char 338 is not in the character set for 2.0.

> > and has specified a particular form of document type declaration
> > construction using one of a small list of FPI's.
> 
> Actually, no.  They have done the right thing in publishing FPIs for

In RFC1866 it's not required, but for W3C/3.2 a doctype declaration is
required and for each subsequent W3C version it's required.

> of convenience.  The core validation requirement is that an instance
> validate with respect to a (specific) declaration subset.  To this
> end, the particular form of a document type declaration - or even, in
> fact, its presence - is irrelevant. 

It's relevant in the context of the web where one cannot ship anything
more than the instance with a short prolog.  For example, in the 4.01
spec, section 7.1 says that a document must begin with a "line
containing HTML version information".

> I disagree.  In a nutshell, you're proposing that a validation system
> do nothing until it has sniffed an FPI in a document type declaration,

The word "sniffing" is inappropriate.

A doctype declaration is required for all but 2.0.  So formally if
there is no doctype declaration and the HTML is assumed to be in the
W3C family of document types, one should assume 2.0.  (But in practice
I would assume 4.01, and that might be a glitch in the system.)  There
are other definitions of HTML, but they are not really suited for use
on the web.

> at which point it should use the FPI to resolve all other requirements
> such as appropriate SGML declarations and the like.  The main purpose
> of such guesswork, apparently, is to *hide* from ordinary people the
> fact that XHTML and HTML4 documents will *not* validate in "identical
> regimes".  ...

As HTML 3.2 and HTML 4.01 will not validate in identical regimes,
and as Docbook 3 and Docbook 4 will not validate in identical
regimes.  Or Docbook 4 and its XML counterpart.  Basically the same
relation exists between XHTML 1.0 and HTML 4.01.

"Tidy" is visibly out there, so I don't see anything hidden.

The fact that one of them is also XML is not that important, and
certainly not important enough to justify the claim of a user agent
advocate that HTML 4.01 and XHTML 1.0 should live under different HTTP
and SMTP content types.

That is the current issue.  No public spec so far has suggested that
text/xml should be the preferred content type for XHTML although
RFC3023 allows any XML document type that degrades to text/plain to be
so shipped.

From the standpoint of vocabulary and content, XHTML is HTML.

The text/xml content type is an umbrella for a vast world of entirely
different things, many of which have no place under the eye of mass
market user agents except possibly for transport.

                                    -- Bill

Received on Friday, 29 June 2001 08:30:22 UTC