[whatwg] HTML5 doctypes incompatible with XHR if named entities present

I already filed a bug
<http://www.w3.org/Bugs/Public/show_bug.cgi?id=8268>, but figured I'd
copy it here to get more discussion.

Wikipedia just experimented with switching to an HTML5 doctype.  A lot
of user tools broke, and after two hours of investigation, we
determined that the problem is intractable and switched back to XHTML
1.0 Transitional.

XMLHttpRequest was historically intended only for XML, and lots of
scripts rely on the responseXML property being set to a Document.  In
current browsers, this only happens when the document is actually
well-formed XML.  But named entities are treated differently based on
the doctype.  Consider this document:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html><head>
<title>Hello</title>
</head>
<body>
<p>&nbsp;</p>
</body>
</html>

This works just fine in all browsers I tested in (latestish versions
of Firefox, Chrome, Opera).  However, if you serve the exact same
document but replace the doctype with <!DOCTYPE html>, all of them
throw a syntax error on &nbsp;.

Practically speaking, this means that any site that wants to serve
content compatible with XHR cannot use either of the two doctypes that
the spec recommends for authors.  There are a variety of widely-used
scripts on Wikipedia that rely on XHR, so this is currently a blocker
for us.  It's very unlikely that we'll deploy HTML5 in the foreseeable
future if it means our users have to rewrite all their scripts.  I'm
pretty sure that XHR is used for screen-scraping beyond Wikipedia,
too, so this will probably crop up elsewhere too.

I don't know what the extent of the magic is that causes this problem.
 Could some reasonably minimal, distinctive doctype be invented that
would avoid the problem but not make the document look to humans and
validators like it thinks it's some old version of XHTML?  If an
existing XHTML doctype must be reused, should validators continue to
raise warnings as they do now, or should an XHTML doctype be promoted
from "obsolete permitted DOCTYPE" to a fully permitted doctype?

Also, is this a wider problem?  Are there any other tools besides
browsers that might be magically allowing named entities for some
doctypes only?

Received on Wednesday, 11 November 2009 20:16:02 UTC