- From: Robert J Burns <rob@robburns.com>
- Date: Wed, 27 Aug 2008 23:38:44 +0300
- To: Jeff Schiller <codedread@gmail.com>
- Cc: "HTML WG" <public-html@w3.org>
Hi Jeff, On Aug 27, 2008, at 8:49 PM, Jeff Schiller wrote: > Can you share more thoughts and/or address my other question > concerning XHTML5 adopting all HTML entities? Sure, I have written about this before[1]. First I'll expose my bias. I think the XML recommendation is two dependent upon DTDs and that a future XML recommendation should decouple the two and raise other schema languages to peers alongside DTD. The problem is that one of the most important advances of XML over SGML is that it made a structured generalized markup language that could stand on its own with no inherent need for a schema. Except for one slip: it tied the entity references and DTDs into the specification and did so in a way that didn't allow XML UAs to treat the general entity references as opaque. In many ways the XML namespaces recommendation is more integral to the modern use of XML than the use of DTDs and DocType identifiers. So what does that mean for HTML5 and an XML serialization for HTML5? Well it means that if we want to be processed by XML UAs that are not also HTML5 UAs (have no knowledge of the HTML5 infoset), we need to provide a DTD with at least some character entity references or at most the entire DTD definable HTML5 schema. Obviously such an "XML but not HTML5 validation UA" would not be able to perform other conformance checking that cannot be expressed through a DTD, but it could at least perform some validation or other processing of a document. At the very least, users would gain the ability to use HTML5 named character entities within a standard XML UA. So in summary, it makes no sense for us to specify an XML serialization for HTML5, yet not provide the anachronistic DTD and DocType identifiers necessary for standard off-the-shelf XML UAs to process HTML5 (though the same could also be said for SGML UAs and text/html serialization, but I don't feel as strongly about that). Granted, providing a DTD will not make an off-the-shelf XML UA into an HTML5 UA, but it will enable some processing capabilities: enough perhaps to satisfy the needs of some authors and some users. Not doing so leads to authors such as you meticulously entering the entity definitions over and over when we as spec writers should take on that burden so that burden is lifted off our authors and users. For authors targeting Gecko, WebKit, Presto, etc., the DocType can be omitted since those will recognize the XML as HTML5 simply by the namespace URI declaration. Those UAs will properly process the character entity references without any DocType or DTD (they already do for XHTML). Nothing in the XML recommendation prohibits this processing for these browsers: they have to bring knowledge of the HTML5 infoset not available in a machine schema anyway. Obviously XML applications (in the XML sense of application) such as HTML5, cannot use a DTD to tell the XML processor what a link is so a DTD is insufficient to turn an off-the-shelf XML UA into an XML + HTML5 UA anyway. The only hiccup then is that off-the-shelf XML processors (non-HTML5 aware processors) will need a schema and a schema linking mechanism (typically DTD and DocType identifiers up until now) to map the character entity references to their corresponding characters (and perhaps other HTML5 infoset processing). XML could have allowed UAs to treat unknown entities as opaque and treat validation of transcluded content in an atomic fashion, but it didn't. So i think we should give the XML processors what they need: a DTD schema and a DocType identifier (though only for the validating and generic XML UAs and not required for authors targeting other UAs). Take care, Rob [1]: <http://lists.w3.org/Archives/Public/public-html/2008Jul/0252.html> Original thread: > On 8/27/08, Robert J Burns <rob@robburns.com> wrote: >> >> On Aug 27, 2008, at 4:14 PM, Jeff Schiller wrote: >> >> >>> Hi Robert, >>> >>> On Wed, Aug 27, 2008 at 2:04 AM, Robert J Burns <rob@robburns.com> >>> wrote: >>> >>>> >>>>> >>>>> I'd appreciate some insight. Yes, I can continue to hack on >>>>> WordPress >>>>> and get it to emit " " instead of " " and then go >>>>> through my >>>>> database and replace all instances for the last several years, >>>>> but... >>>>> >>>> >>>> Can't you have WordPress emit U+00A0, or are you using a charset >> encoding >>>> other than a UTF encoding. >>>> >>>> >>> >>> Again, maybe I don't understand what you're suggesting. >>> >>> I'm using UTF-8. I can go through the WordPress source and change >>> all >>> their PHP files that use » and « to their >>> equivalent numeric references but there are over 100 instances of >>> this. >>> >>> I can create a ticket and submit a 100-line patch to the WP project, >>> but I'm worried that getting this accepted by the WordPress >>> powers-that-be will be challenging, especially considering my last >>> few >>> patches that languished for months (and those patches prevented >>> Yellow >>> Screens of Death - the XHTML equivalent of a 'segfault'). What are >>> the chances of a 100-line patch that has no observable user benefit >>> (since declaring these entities is a quick 3-line fix that can be >>> done >>> by the theme creator)? >>> >>> So if that patch doesn't get accepted (or it takes a long chunk of >>> time), then next time I upgrade to the new version of WP (happens >>> every 6 months or so), I have to remember to manually search/replace >>> those three entities. >>> >> >> Well, this isn't really the list to discuss WordPress development >> issues. >> However, this is a problem that should be solved by WordPress by >> emitting >> Unicode characters rather than named or numbered character entity >> references. The reason to use character entity references is to >> facilitate >> documents in non-UTF encodings (or perhaps where the author is >> concerned the >> document will be converted or round-tripped through non-UTF >> encodings). For >> pure UTF charset documents, it's advisable to simply use the literal >> characters (and not references to them). Some like the source >> readability of >> named character references, but that readability depends solely on >> the >> reader's familiarity with the characters. If I'm a reader of a >> Cyrillic >> script based language, I'm not going to find reading the source >> easier if >> all of the characters are replaced with named references to the >> characters. >> >> In terms of your present problem, I don't know enough about >> WordPress. If >> it cannot be fixed through configuration tweaks, it still is >> something that >> is better handled in the long-term by WordPress through literal >> characters >> rather than references. >> >> Take care, >> Rob >> >
Received on Wednesday, 27 August 2008 20:39:28 UTC