Re: Validating XHTML5 with XML entities

Hi Jeff,

On Aug 27, 2008, at 8:49 PM, Jeff Schiller wrote:

> Can you share more thoughts and/or address my other question
> concerning XHTML5 adopting all HTML entities?

Sure, I have written about this before[1]. First I'll expose my bias.  
I think the XML recommendation is two dependent upon DTDs and that a  
future XML recommendation should decouple the two and raise other  
schema languages to peers alongside DTD. The problem is that one of  
the most important advances of XML over SGML is that it made a  
structured generalized markup language that could stand on its own  
with no inherent need for a schema. Except for one slip: it tied the  
entity references and DTDs into the specification and did so in a way  
that didn't allow XML UAs to treat the general entity references as  
opaque. In many ways the XML namespaces recommendation is more  
integral to the modern use of XML than the use of DTDs and DocType  
identifiers.

So what does that mean for HTML5 and an XML serialization for HTML5?  
Well it means that if we want to be processed by XML UAs that are not  
also HTML5 UAs (have no knowledge of the HTML5 infoset), we need to  
provide a DTD with at least some character entity references or at  
most the entire DTD definable HTML5 schema. Obviously such an "XML but  
not HTML5 validation UA" would not be able to perform other  
conformance checking that cannot be expressed through a DTD, but it  
could at least perform some validation or other processing of a  
document. At the very least, users would gain the ability to use HTML5  
named character entities within a standard XML UA.

So in summary, it makes no sense for us to specify an XML  
serialization for HTML5, yet not provide the anachronistic DTD and  
DocType identifiers necessary for standard off-the-shelf XML UAs to  
process HTML5 (though the same could also be said for SGML UAs and  
text/html serialization, but I don't feel as strongly about that).  
Granted, providing a DTD will not make an off-the-shelf XML UA into an  
HTML5 UA, but it will enable some processing capabilities: enough  
perhaps to satisfy the needs of some authors and some users. Not doing  
so leads to authors such as you meticulously entering the entity  
definitions over and over when we as spec writers should take on that  
burden so that burden is lifted off our authors and users. For authors  
targeting Gecko, WebKit, Presto, etc., the DocType can be omitted  
since those will recognize the XML as HTML5 simply by the namespace  
URI declaration. Those UAs will properly process the character entity  
references without any DocType or DTD (they already do for XHTML).  
Nothing in the XML recommendation prohibits this processing for these  
browsers: they have to bring knowledge of the HTML5 infoset not  
available in a machine schema anyway. Obviously XML applications (in  
the XML sense of application) such as HTML5, cannot use a DTD to tell  
the XML processor what a link is so a DTD is insufficient to turn an  
off-the-shelf XML UA into an XML + HTML5 UA anyway. The only hiccup  
then is that off-the-shelf XML processors (non-HTML5 aware processors)  
will need a schema and a schema linking mechanism (typically DTD and  
DocType identifiers up until now) to map the character entity  
references to their corresponding characters (and perhaps other HTML5  
infoset processing). XML could have allowed UAs to treat unknown  
entities as opaque and treat validation of transcluded content in an  
atomic fashion, but it didn't. So i think we should give the XML  
processors what they need: a DTD schema and a DocType identifier  
(though only for the validating and generic XML UAs and not required  
for authors targeting other UAs).

Take care,
Rob

[1]: <http://lists.w3.org/Archives/Public/public-html/2008Jul/0252.html>

Original thread:
> On 8/27/08, Robert J Burns <rob@robburns.com> wrote:
>>
>> On Aug 27, 2008, at 4:14 PM, Jeff Schiller wrote:
>>
>>
>>> Hi Robert,
>>>
>>> On Wed, Aug 27, 2008 at 2:04 AM, Robert J Burns <rob@robburns.com>  
>>> wrote:
>>>
>>>>
>>>>>
>>>>> I'd appreciate some insight.  Yes, I can continue to hack on  
>>>>> WordPress
>>>>> and get it to emit "&#160;" instead of "&nbsp;" and then go  
>>>>> through my
>>>>> database and replace all instances for the last several years,  
>>>>> but...
>>>>>
>>>>
>>>> Can't you have WordPress emit U+00A0, or are you using a charset
>> encoding
>>>> other than a UTF encoding.
>>>>
>>>>
>>>
>>> Again, maybe I don't understand what you're suggesting.
>>>
>>> I'm using UTF-8.  I can go through the WordPress source and change  
>>> all
>>> their PHP files that use &nbsp; &raquo; and &laquo; to their
>>> equivalent numeric references but there are over 100 instances of
>>> this.
>>>
>>> I can create a ticket and submit a 100-line patch to the WP project,
>>> but I'm worried that getting this accepted by the WordPress
>>> powers-that-be will be challenging, especially considering my last  
>>> few
>>> patches that languished for months (and those patches prevented  
>>> Yellow
>>> Screens of Death - the XHTML equivalent of a 'segfault').  What are
>>> the chances of a 100-line patch that has no observable user benefit
>>> (since declaring these entities is a quick 3-line fix that can be  
>>> done
>>> by the theme creator)?
>>>
>>> So if that patch doesn't get accepted (or it takes a long chunk of
>>> time), then next time I upgrade to the new version of WP (happens
>>> every 6 months or so), I have to remember to manually search/replace
>>> those three entities.
>>>
>>
>> Well, this isn't really the list to discuss WordPress development  
>> issues.
>> However, this is a problem that should be solved by WordPress by  
>> emitting
>> Unicode characters rather than named or numbered character entity
>> references. The reason to use character entity references is to  
>> facilitate
>> documents in non-UTF encodings (or perhaps where the author is  
>> concerned the
>> document will be converted or round-tripped through non-UTF  
>> encodings). For
>> pure UTF charset documents, it's advisable to simply use the literal
>> characters (and not references to them). Some like the source  
>> readability of
>> named character references, but that readability depends solely on  
>> the
>> reader's familiarity with the characters. If I'm a reader of a  
>> Cyrillic
>> script based language, I'm not going to find reading the source  
>> easier if
>> all of the characters are replaced with named references to the  
>> characters.
>>
>> In terms of your present problem, I don't know enough about  
>> WordPress. If
>> it cannot be fixed through configuration tweaks, it still is  
>> something that
>> is better handled in the long-term by WordPress through literal  
>> characters
>> rather than references.
>>
>> Take care,
>> Rob
>>
>

Received on Wednesday, 27 August 2008 20:39:28 UTC