Re: Validating XHTML5 with XML entities from Jeff Schiller on 2008-08-27 (public-html@w3.org from August 2008)

From: Jeff Schiller <codedread@gmail.com>
Date: Wed, 27 Aug 2008 16:12:55 -0500
To: "Robert J Burns" <rob@robburns.com>
Cc: "HTML WG" <public-html@w3.org>
Message-ID: <da131fde0808271412p7edda08k4abe93fdb82065ed@mail.gmail.com>
Robert,

I get it now - and I agree it's a shame that non-numeric entities are
not treated opaquely in XML (but even namespaces require a prefix
declaration to avoid yellow screens of death).

So in your proposal HTML-aware UAs will inherently know about the HTML
character entities while standalone parsers will need the help of a
DTD that someone will have to write?  In essence, browsers will have
no need of a DTD.

Personally, I hate the concept of named XML entities in the first
place and would prefer numeric character references throughout (and
yes, ideally I'd like to use the Unicode characters directly in the
markup - but what I'm dealing with is 'WordPress chrome', not content
that I write).

Regards,
Jeff
On 8/27/08, Robert J Burns <rob@robburns.com> wrote:
> Hi Jeff,
>
>  On Aug 27, 2008, at 8:49 PM, Jeff Schiller wrote:
>
>
> > Can you share more thoughts and/or address my other question
> > concerning XHTML5 adopting all HTML entities?
> >
>
>  Sure, I have written about this before[1]. First I'll expose my bias. I
> think the XML recommendation is two dependent upon DTDs and that a future
> XML recommendation should decouple the two and raise other schema languages
> to peers alongside DTD. The problem is that one of the most important
> advances of XML over SGML is that it made a structured generalized markup
> language that could stand on its own with no inherent need for a schema.
> Except for one slip: it tied the entity references and DTDs into the
> specification and did so in a way that didn't allow XML UAs to treat the
> general entity references as opaque. In many ways the XML namespaces
> recommendation is more integral to the modern use of XML than the use of
> DTDs and DocType identifiers.
>
>  So what does that mean for HTML5 and an XML serialization for HTML5? Well
> it means that if we want to be processed by XML UAs that are not also HTML5
> UAs (have no knowledge of the HTML5 infoset), we need to provide a DTD with
> at least some character entity references or at most the entire DTD
> definable HTML5 schema. Obviously such an "XML but not HTML5 validation UA"
> would not be able to perform other conformance checking that cannot be
> expressed through a DTD, but it could at least perform some validation or
> other processing of a document. At the very least, users would gain the
> ability to use HTML5 named character entities within a standard XML UA.
>
>  So in summary, it makes no sense for us to specify an XML serialization for
> HTML5, yet not provide the anachronistic DTD and DocType identifiers
> necessary for standard off-the-shelf XML UAs to process HTML5 (though the
> same could also be said for SGML UAs and text/html serialization, but I
> don't feel as strongly about that). Granted, providing a DTD will not make
> an off-the-shelf XML UA into an HTML5 UA, but it will enable some processing
> capabilities: enough perhaps to satisfy the needs of some authors and some
> users. Not doing so leads to authors such as you meticulously entering the
> entity definitions over and over when we as spec writers should take on that
> burden so that burden is lifted off our authors and users. For authors
> targeting Gecko, WebKit, Presto, etc., the DocType can be omitted since
> those will recognize the XML as HTML5 simply by the namespace URI
> declaration. Those UAs will properly process the character entity references
> without any DocType or DTD (they already do for XHTML). Nothing in the XML
> recommendation prohibits this processing for these browsers: they have to
> bring knowledge of the HTML5 infoset not available in a machine schema
> anyway. Obviously XML applications (in the XML sense of application) such as
> HTML5, cannot use a DTD to tell the XML processor what a link is so a DTD is
> insufficient to turn an off-the-shelf XML UA into an XML + HTML5 UA anyway.
> The only hiccup then is that off-the-shelf XML processors (non-HTML5 aware
> processors) will need a schema and a schema linking mechanism (typically DTD
> and DocType identifiers up until now) to map the character entity references
> to their corresponding characters (and perhaps other HTML5 infoset
> processing). XML could have allowed UAs to treat unknown entities as opaque
> and treat validation of transcluded content in an atomic fashion, but it
> didn't. So i think we should give the XML processors what they need: a DTD
> schema and a DocType identifier (though only for the validating and generic
> XML UAs and not required for authors targeting other UAs).
>
>  Take care,
>  Rob
>
>  [1]:
> <http://lists.w3.org/Archives/Public/public-html/2008Jul/0252.html>
>
>  Original thread:
>
> > On 8/27/08, Robert J Burns <rob@robburns.com> wrote:
> >
> >
> > >
> > > On Aug 27, 2008, at 4:14 PM, Jeff Schiller wrote:
> > >
> > >
> > >
> > > > Hi Robert,
> > > >
> > > > On Wed, Aug 27, 2008 at 2:04 AM, Robert J Burns <rob@robburns.com>
> wrote:
> > > >
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > I'd appreciate some insight.  Yes, I can continue to hack on
> WordPress
> > > > > > and get it to emit "&#160;" instead of "&nbsp;" and then go
> through my
> > > > > > database and replace all instances for the last several years,
> but...
> > > > > >
> > > > > >
> > > > >
> > > > > Can't you have WordPress emit U+00A0, or are you using a charset
> > > > >
> > > >
> > > encoding
> > >
> > > >
> > > > > other than a UTF encoding.
> > > > >
> > > > >
> > > > >
> > > >
> > > > Again, maybe I don't understand what you're suggesting.
> > > >
> > > > I'm using UTF-8.  I can go through the WordPress source and change all
> > > > their PHP files that use &nbsp; &raquo; and &laquo; to their
> > > > equivalent numeric references but there are over 100 instances of
> > > > this.
> > > >
> > > > I can create a ticket and submit a 100-line patch to the WP project,
> > > > but I'm worried that getting this accepted by the WordPress
> > > > powers-that-be will be challenging, especially considering my last few
> > > > patches that languished for months (and those patches prevented Yellow
> > > > Screens of Death - the XHTML equivalent of a 'segfault').  What are
> > > > the chances of a 100-line patch that has no observable user benefit
> > > > (since declaring these entities is a quick 3-line fix that can be done
> > > > by the theme creator)?
> > > >
> > > > So if that patch doesn't get accepted (or it takes a long chunk of
> > > > time), then next time I upgrade to the new version of WP (happens
> > > > every 6 months or so), I have to remember to manually search/replace
> > > > those three entities.
> > > >
> > > >
> > >
> > > Well, this isn't really the list to discuss WordPress development
> issues.
> > > However, this is a problem that should be solved by WordPress by
> emitting
> > > Unicode characters rather than named or numbered character entity
> > > references. The reason to use character entity references is to
> facilitate
> > > documents in non-UTF encodings (or perhaps where the author is concerned
> the
> > > document will be converted or round-tripped through non-UTF encodings).
> For
> > > pure UTF charset documents, it's advisable to simply use the literal
> > > characters (and not references to them). Some like the source
> readability of
> > > named character references, but that readability depends solely on the
> > > reader's familiarity with the characters. If I'm a reader of a Cyrillic
> > > script based language, I'm not going to find reading the source easier
> if
> > > all of the characters are replaced with named references to the
> characters.
> > >
> > > In terms of your present problem, I don't know enough about WordPress.
> If
> > > it cannot be fixed through configuration tweaks, it still is something
> that
> > > is better handled in the long-term by WordPress through literal
> characters
> > > rather than references.
> > >
> > > Take care,
> > > Rob
> > >
> > >
> >
> >
>
>
Received on Wednesday, 27 August 2008 21:13:37 UTC