Re: XHTML character entity support from Ian Hickson on 2010-02-11 (public-html@w3.org from February 2010)

From: Ian Hickson <ian@hixie.ch>
Date: Thu, 11 Feb 2010 11:13:16 +0000 (UTC)
To: HTML WG <public-html@w3.org>
Message-ID: <Pine.LNX.4.64.1002111059430.29686@ps20323.dreamhostps.com>
On Fri, 30 Oct 2009, Alexey Proskuryakov wrote:
> 
> As noted in 
> <http://www.whatwg.org/specs/web-apps/current-work/#writing-xhtml-documents>, 
> there is no guarantee that authors can use character entity references 
> such as &nbsp; in XHTML, because XML parsers are not required to process 
> external DTD subsets. This works in at least Firefox, Safari and Opera, 
> but it's depressing that such a major feature is not interoperable per 
> the spec.

HTML5 now attempts to navigate the XML spec in a manner that encourages 
interoperability here as much as possible without strictly violating XML's 
requirements on the matter.


> I think that it's important to guarantee that character entity 
> references work in XHTML

Insofar as XML allows us to guarantee interoperability at all, I have now 
done so.


> (even when parsing fragments, e.g. with innerHTML - which doesn't 
> currently work in Firefox or Safari, and is confusing to authors).

I have not done this; innerHTML on elements does not support entities in 
XML documents. In general I would discourage use of this API, and use of 
true XHTML in general is pretty rare, so it doesn't seem worth the 
additional potential engineering cost to add this.


On Mon, 2 Nov 2009, Henri Sivonen wrote:
>
> There are three classes of documents:
> 
> DTDless: Entities other than the 5 built-in ones must not "work" in 
> these. Here we have interop: 
> http://hsivonen.iki.fi/test/moz/entity-without-dtd.xhtml
> 
> Known DTD: The browser pretends to have loaded the DTD from the network 
> but actually does something else. Here we have interop, too, to the 
> extent the list of known DTDs is the same: 
> http://hsivonen.iki.fi/test/moz/entity-with-known-dtd.xhtml
> 
> Bogus DTD: Here we don't have interop: Opera falls back to behaving like 
> an XML parser that hasn't loaded the DTD. Gecko and WebKit resolve the 
> bogus DTD to a zero-length stream and then let the XML parser proceed 
> thinking it has read the DTD (hence invoking the clauses of the XML spec 
> that make unknown entity refs fatal). Well, that's what Gecko does. I 
> didn't check WebKit's code, but the black-box behavior is the same. 
> http://hsivonen.iki.fi/test/moz/entity-with-bogus-dtd.xhtml

A strict reading of the text in the spec now implies Opera's behaviour, I 
believe. I can change that if people think we should make all external 
entities resolve; it seemed like that would cross the line into violating 
XML more explicitly, which is why I avoided doing this.


> IIRC, WebKit's known list of doctype doesn't cover the legacy MathML 
> doctypes that Gecko's list covers and that are used in legacy content, 
> so I think we should standardize Gecko's list--not WebKit's list--if we 
> end up standardizing a list.

The only difference was "-//W3C//DTD MathML 2.0//EN" which was only 
present in Gecko's list:

   http://trac.webkit.org/browser/trunk/WebCore/dom/XMLTokenizerLibxml2.cpp#L1245
   http://mxr.mozilla.org/mozilla-central/source/parser/htmlparser/src/nsExpatDriver.cpp#287

I've used Gecko's list here. This is probably a violation of MathML's 
rules too, though I haven't mentioned that in the spec.


> If we standardize a list, there's the question if the list is a minimum 
> list or the closed list forever. (See 
> http://groups.google.com/group/mozilla.dev.tech.mathml/browse_thread/thread/e7f7efbb5e161348/9fde74f46fb0b5d2 
> )

Given that the entity list is no longer growing, and that DTDs are no 
longer useful other than for entities, I've made it a closed list.


The list of entities is the complete list of entities supported in 
text/html, for all public identifiers. This will cause a small memory 
footprint increase in WebKit, though that hit would be taken anyway when 
implementing the HTML5 parser. It will also cause some engineering cost to 
Gecko and Opera to avoid a performance regression (since their parsers 
parse the external subset each time), though for optimal performance in 
XML modes, such engineering work would likely be needed anyway.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Thursday, 11 February 2010 11:13:45 UTC