Re: Revising the parse mode detection code

On Sun, 5 Sep 2004, Bjoern Hoehrmann wrote:

>   * the document has a document type declaration with a public
>     identifier that when split at // has a third component which
>     matches /^DTD\s+(\S+)/ for which $1 matches /XHTML/
>
>   * no public/system identifier but a <html> root element with an
>     explicitly *specified* xmlns attribute with a value of
>     "http://www.w3.org/1999/xhtml"

That's too eager IMO.

Appendix C applies only to XHTML 1.0.  So we should permit XHTML-as-
text/html only if the document uses one of the three XHTML 1.0 FPIs.

For documents served as text/html that are not identifiably XHTML1.0,
we should expect HTML, and emit a stern warning if they look like
XML, as in any document starting with an xmldecl.

Will your code do a better job of dealing with hixie's pathological
use of comments?  Do we parse them as SGML or XML to determine whether
the document is SGML or XML?  Hixie gives a first line that claims to
be valid SGML, but suggesting it as valid HTML4 seems to be stretching
a point.  Hixie's valid point is that Appendix C is trouble, but we
can't do anything about that.

> I would like to know whether there are any good reasons to use a
> different algorithm to determine the parse mode, whether everyone is
> okay to use SGML::Parser::OpenSP to do that, where I could maintain the
> tests in CVS and where code as the fragment above should go at this
> point (CVS repository, module names, etc.)

I'm not sure.  But using Hixie's contrivance as a yardstick looks to
be the way of madness.

-- 
Nick Kew

Received on Monday, 6 September 2004 00:43:36 UTC