Re: XHTML character entity support

On Oct 31, 2009, at 01:10, Alexey Proskuryakov wrote:

> As noted in <http://www.whatwg.org/specs/web-apps/current-work/#writing-xhtml-documents 
> >, there is no guarantee that authors can use character entity  
> references such as &nbsp; in XHTML, because XML parsers are not  
> required to process external DTD subsets. This works in at least  
> Firefox, Safari and Opera, but it's depressing that such a major  
> feature is not interoperable per the spec.

The above is an oversimplification. There are three classes of  
documents:

DTDless: Entities other than the 5 built-in ones must not "work" in  
these. Here we have interop:
http://hsivonen.iki.fi/test/moz/entity-without-dtd.xhtml

Known DTD: The browser pretends to have loaded the DTD from the  
network but actually does something else. Here we have interop, too,  
to the extent the list of known DTDs is the same:
http://hsivonen.iki.fi/test/moz/entity-with-known-dtd.xhtml

Bogus DTD: Here we don't have interop: Opera falls back to behaving  
like an XML parser that hasn't loaded the DTD. Gecko and WebKit  
resolve the bogus DTD to a zero-length stream and then let the XML  
parser proceed thinking it has read the DTD (hence invoking the  
clauses of the XML spec that make unknown entity refs fatal). Well,  
that's what Gecko does. I didn't check WebKit's code, but the black- 
box behavior is the same.
http://hsivonen.iki.fi/test/moz/entity-with-bogus-dtd.xhtml

IIRC, WebKit's known list of doctype doesn't cover the legacy MathML  
doctypes that Gecko's list covers and that are used in legacy content,  
so I think we should standardize Gecko's list--not WebKit's list--if  
we end up standardizing a list. If we standardize a list, there's the  
question if the list is a minimum list or the closed list forever.  
(See http://groups.google.com/group/mozilla.dev.tech.mathml/browse_thread/thread/e7f7efbb5e161348/9fde74f46fb0b5d2 
  )

Opera's behavior in the unknown DTD case is cleaner than Gecko's and  
WebKit's behavior from the XML spec POV, but I don't know if the off- 
the-shelf parsers used by Gecko and WebKit have enough API surface for  
that behavior. (I don't like it that Opera reportedly has put new  
doctypes on the list of known doctypes, though. I'm personally in the  
frozen list camp myself.)

> I think that it's important to guarantee that character entity  
> references work in XHTML (even when parsing fragments, e.g. with  
> innerHTML - which doesn't currently work in Firefox or Safari, and  
> is confusing to authors).

Test cases:
http://hsivonen.iki.fi/test/moz/innerHTML-no-doctype.xhtml
http://hsivonen.iki.fi/test/moz/innerHTML-xhtml1-doctype.xhtml

Opera supports entities in innerHTML setter regardless of the doctype  
of the document. Gecko and WebKit don't support entities in the  
innerHTML setter.

Frankly, I'm a bit annoyed to see Opera supporting entities here,  
because now we don't have a stable state and Gecko and WebKit may end  
up putting engineering cycles into tweaking stuff that's marginal on  
the Web scale, since it doesn't work in IE at all.

Why does Opera support entities here? It seems logical (as far as the  
XML spec goes) not to support entities here. Authors who use  
application/xhtml+xml are explicitly asking for XML. If they don't  
want XML the way it is, they shouldn't ask for it. I think we  
shouldn't paper over the flaws of XML one by one. Instead, I think we  
should take XML 1.0 as it is until the time is ripe and XML Core does  
XML5 all at once (with all the MathML entities predefined, the  
tokenizer state machine borrowed from HTML5, non-Draconian tree  
builder, no DTDs, etc.).

> For obvious performance reasons, it is impractical to ask UAs to  
> utilize validating XML parsers, so this guarantee may need to be  
> specified in a way that doesn't require full DTD support.

There are three classes of XML processors:
  1) Non-validating XML processors that don't process the external DTD  
subset.
  2) Non-validating XML processors that process the external DTD subset.
  3) Validating XML processors that process the external DTD subset.

It's not a dichotomy between #1 and #3.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Monday, 2 November 2009 12:53:02 UTC