W3C home > Mailing lists > Public > public-html@w3.org > November 2009

Re: XHTML character entity support

From: David Carlisle <davidc@nag.co.uk>
Date: Wed, 11 Nov 2009 12:47:40 GMT
Message-Id: <200911111247.nABCleVL005332@edinburgh.nag.co.uk>
To: public-html@w3.org
Cc: public-xml-core-wg@w3.org

Henri wrote: 

> One can use <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML
> 2.0//EN" "http://www.w3.org/Math/DTD/mathml2/xhtml-math11-f.dtd"> and
> trade compat with WebKit and Opera into the ability to use the MathML
> entities in shipped Gecko. (Here's a point where interop between
> browsers is lacking, BTW.)

Using an xhtml1+mathm2 dtd with an  xhtml5+mathml3 document would
work in a browser but would be confusing and fragile and
break in other xml pipelines.

Presumably compatibility with xml workflows would be a major reason for 
use of the xml serialisation of html5, so saying you have to use a
doctype that makes the document invalid would seem pretty odd.

> I'd expect it to map the public ids listed at
> http://mxr.mozilla.org/mozilla-central/source/parser/htmlparser/src/nsExpatDriver.cpp#287
> to a bogo-DTD that defines either the XHTML 1.0 entities or the
> *latest* MathML entity set (depending on which one of the two DTD
> files in named in nsExpatDriver.cpp), and I'd expect it to map other
> public ids and lone system ids to the empty stream.

Personally I think that the spec should not mandate any particular
entity resolver, so it's a fact of life with xml entities that some
systems will report errors and some will read the dtd and use the
definitions. Authors worried about that can use character data or
numeric references instead (which is a good idea in any case).

However I think the html5 spec could suggest (perhaps even an rfc
.should. requirement) that a system using a non validating parser and
the xml representation served over application/xhtml+xml (act as if it)
uses a catalog that defaults a dtd if it were not there, and uses an
entity resolver such that any external dtd is mapped to the same default
dtd which could be essentially

       Public identifier: -//W3C//ENTITIES Combined Set//EN//XML
       System identifier: http://www.w3.org/2003/entities/2007/w3centities-f.ent

which just looks like:

<!ENTITY Aacgr            "&#x00386;" ><!--GREEK CAPITAL LETTER ALPHA WITH TONOS -->
<!ENTITY aacgr            "&#x003AC;" ><!--GREEK SMALL LETTER ALPHA WITH TONOS -->
<!ENTITY Aacute           "&#x000C1;" ><!--LATIN CAPITAL LETTER A WITH ACUTE -->
<!ENTITY aacute           "&#x000E1;" ><!--LATIN SMALL LETTER A WITH ACUTE -->
<!ENTITY Abreve           "&#x00102;" ><!--LATIN CAPITAL LETTER A WITH BREVE -->
<!ENTITY abreve           "&#x00103;" ><!--LATIN SMALL LETTER A WITH BREVE -->
<!ENTITY ac               "&#x0223E;" ><!--INVERTED LAZY S -->

a sorted list of all the entities. Actually that file is a bit bigger
than the xhtml+mathml set proposed for html 5 as it contains some ISO
entity sets not normally included, but if it was thought useful a
similar sorted list could be produced which just had the html5 entities.

The format used there is mainly for human consumption, if there was any
possibility of systems really fetching this over the web it could of
course be compressed a lot by losing all the white space and comments,
and using character data rather then numeric references for the



While I have the attention of the HTML and XML core WGs, just a heads up
that we hope to be asking the xml entities draft to go to last call next
week, and would again appreciate any reviews that the working groups, or
individuals within those groups, could give to the spec, the current
editors' draft version of which is always available at


The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
Received on Wednesday, 11 November 2009 12:48:21 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:16:52 GMT