Re: XHTML character entity support

On Oct 31, 2009, at 5:00 PM, Boris Zbarsky wrote:

> On 10/31/09 5:32 PM, Alexey Proskuryakov wrote:
>> WebKit does not use a validating parser, but it does support XHTML  
>> named
>> entities. I'm not quite sure about Firefox.
>
> Likewise.  Firefox loads http://hg.mozilla.org/mozilla-central/file/4597c9ddc1ff/content/xml/content/src/xhtml11.dtd 
>  (which pretty much just defines the relevant named entities) when  
> it detects certain doctypes.  See the table at http://hg.mozilla.org/mozilla-central/file/4597c9ddc1ff/parser/htmlparser/src/nsExpatDriver.cpp#l287
>
> Firefox can also load external DTDs if they satisfy certain  
> constraints (e.g. being installed as part of the app itself).  See http://hg.mozilla.org/mozilla-central/file/4597c9ddc1ff/parser/htmlparser/src/nsExpatDriver.cpp#l785
>
> The DTD is only really used for ID attribute names and named  
> entities; no validation is performed.

It would be good for some spec to define the Gecko/WebKit behavior.  
The crux of the issue is this. The XML spec allows XML processors to  
be one of the following:

A) A validating processor (in which case they must read all external  
DTDs, process the declared entities, and expand the entities when  
appropriate, and which must also report violations of DTD constraints.
B) A non-validating processor, which does not read external DTDs and  
does not provide any entities other than the ones predefined for XML,  
and any defined in the internal subset.

Neither A nor B is practical for the Web. Running a full validating  
parser is too heavyweight, so A is not an option. But there's XHTML  
content out there that does use XHTML entities; failing to expand  
these entity references results in undesired behavior. So B is also  
not an option.

In theory interoperable XML content should never use anything but the  
built-in XML entities, unless it can guarantee that it will only ever  
be processed with a validating parser. In practice, that's not what  
happens. XHTML content uses the XHTML entities. And browsers that  
don't handle it are perceived as broken.

In practice, browsers do this compromise thing, where they recognize  
certain DTDs and define the relevant entities, but without validating.  
This is arguably against the spirit of the XML spec, but I think it is  
the practical choice.

Regards,
Maciej

Received on Sunday, 1 November 2009 00:44:39 UTC