Re: Revising the parse mode detection code from Bjoern Hoehrmann on 2004-09-06 (public-qa-dev@w3.org from September 2004)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Mon, 06 Sep 2004 03:30:18 +0200
To: Nick Kew <nick@webthing.com>
Cc: public-qa-dev@w3.org
Message-ID: <4166b8cb.184563368@smtp.bjoern.hoehrmann.de>

* Nick Kew wrote:
>>   * the document has a document type declaration with a public
>>     identifier that when split at // has a third component which
>>     matches /^DTD\s+(\S+)/ for which $1 matches /XHTML/
>>
>>   * no public/system identifier but a <html> root element with an
>>     explicitly *specified* xmlns attribute with a value of
>>     "http://www.w3.org/1999/xhtml"
>
>That's too eager IMO.

I should point out that this is less eager than the current code in
terms of which documents are considered XHTML.

>Appendix C applies only to XHTML 1.0.  So we should permit XHTML-as-
>text/html only if the document uses one of the three XHTML 1.0 FPIs.

I do not know about that, http://www.w3.org/TR/xhtml-media-types/ says
SHOULD NOT not MUST NOT and there are hundreds of thousands XHTML 1.1
text/html documents on the web already, including the very XHTML 1.1
Recommendation, it's too late for us to tell them they are all wrong.

And even if we'd agree to tell them they are all wrong, why would doing
that through improperly processing the document using the HTML 4.01 SGML
declaration be better than properly processing it using the XML SGML
declaration, or even a proper XML processor, and complaining just about
the MIME type? IOW, how would less eager XHTML detection code make the
service any better for its users?

>Will your code do a better job of dealing with hixie's pathological
>use of comments?  Do we parse them as SGML or XML to determine whether
>the document is SGML or XML?

The document would be processed using the HTML 4.01 SGML declaration,
and yes it would fix "bug" #14. And a number of related "bugs".

Received on Monday, 6 September 2004 01:31:07 UTC