Re: XHTML character entity support

On Oct 31, 2009, at 7:37 PM, Shelley Powers wrote:

> On Sat, Oct 31, 2009 at 8:33 PM, Maciej Stachowiak <mjs@apple.com>  
> wrote:
>>
>> On Oct 31, 2009, at 6:22 PM, Shelley Powers wrote:
>>
>>>
>>>
>>> Yes, how the browsers work when it comes to DTDs and named entities
>>> has come up in the past [1][2].
>>>
>>> Case in point, Firefox, Safari, and Chrome don't allow named  
>>> entities
>>> in XHTML+RDFa documents, even though the XHTML+RDFa DTD does  
>>> reference
>>> the named entities.
>>>
>>> Oops
>>>
>>> But, still, we manage. We use numeric entities.
>>
>> I think it's fine to omit named entities from newly minted DTDs. In  
>> fact,
>> probably a good idea since it's the strict XML behavior and nothing  
>> stops
>> you from using an NCR or just a literal unicode character in new  
>> content.
>>
>> But browsers need to handle named entities when some specific XHTML  
>> DTDs are
>> present, since there is a body of legacy content that depends on  
>> having the
>> XHTML set of entities. Handling content with the XHTML+RDFa DTD  
>> does not
>> have this constraint.
>>
>
> I can understand, and not. XHTML from the very beginning had rules
> having to do with named entities, and this has always been a
> constraint.

The problem is that content didn't do a good job of sticking to the  
narrow path of these rules. I suspect this problem comes from a few  
unusual conditions: (1) XHTML 1.x validators were validating XML  
processors, and thus respected the entities and did not flag them as  
errors; (2) chameleon content served as HTML to some UAs but XHTML to  
others would work fine in HTML mode with entities. I believe this  
contributed to pressure for browsers to support the standard XHTML  
named entities in XHTML in some form. On the other hand, as I said,  
it's not practical for a browser to be a validating XHTML processor.

I think it's a problem with the XHTML specs that they made named  
entity processing so unpredictable. The wisest thing for new content  
to do is to never use named entities other than the five predefined by  
XML. In the meantime, we have some old content already using named  
entities in XHTML, and it works today in Gecko-based and WebKit-based  
browsers (and thus, in most browsers that support XHTML at all). (I'm  
not sure what Opera does offhand.)

>
> Regardless, there is no legacy content for HTML5.

HTML5 recommends using no DTD at all for XHTML5 content, or the short  
HTML5 <!doctype html> doctype. I agree that special entity processing  
is not necessary (or arguably even desirable) in those cases. However,  
when an HTML5 UA is faced with content using an XHTML1 DTD (and  
probably a short whitelist of other DTDs), it should do the special  
entity handling. This should be defined by a specification. I think  
that spec could be HTML5, since it strives to define compatible  
processing for older versions of HTML and XHTML, such that you can  
implement HTML5 in an existing browser engine without introducing  
additional mode switches.

>> Note: we'd rather not have this behavior in WebKit but we added it  
>> due to
>> compatibility bugs being filed. I expect any XHTML-capable browser  
>> would
>> eventually be pressured to add similar behavior. Non-browser tools  
>> that
>> process XHTML from the Web may also benefit from doing the same  
>> thing.
>>
>
> But those that don't will respect named entities in RDFa in XHTML,
> while browsers don't. You start bending rules, and you add, rather
> than remove inconsistencies.

As things currently stand, only a validating XML processor would  
respect any named entities in RDFa+XHTML (using the RDFa+XHTML  
doctype). I think it is not common for any software other than DTD- 
based validators to use a validating XML processor.

Regards,
Maciej

Received on Sunday, 1 November 2009 03:03:21 UTC