Re: missing letter gcedil in isolat2 from David Carlisle on 2008-01-07 (www-math@w3.org from January 2008)

From: David Carlisle <davidc@nag.co.uk>
Date: Mon, 7 Jan 2008 13:35:48 GMT
To: hsivonen@iki.fi
CC: www-math@w3.org
Message-Id: <200801071335.m07DZmJm025362@quetzal.nag.co.uk>
> Therefore, I think it would be a mistake and Bad-for-the-Web if any WG  
> of the W3C tried to push a DTD change or a new DTD for Web deployment.  
> I find http://www.w3.org/TR/2007/WD-xml-entity-names-20071214/ very  
> alarming if the intent is to serve those entities over the wire.

We do actually echo your arguments against using enity references in the
current mathml 3 draft, and all examples in the mathl3 draft uses
numeric references (with a comment with the Unicode name) rather than
using a named entity reference. 


I word the arguments against using entities rather more strongly in my
blog entry where I introduced that draft
http://dpcarlisle.blogspot.com/2007/11/xml-entities-definitions-for-characters.html


Apart from gcedil in this threead, I notice a recent request on
public-html for sub[123]
http://lists.w3.org/Archives/Public/public-html/2007Dec/0228.html


Personally I'm very much against either adding or removing any entity
names. The names that we have are a somewhat arbitrary collection but I
don't see how changing the set of names in any way can make the situtation
better. I do think however that changing the definitions can improve the
situation. Using characters that were not previously available, and
certainly it is not possible to get a consistent set of definitions
across html/mathml/docbook/tei unless _someone_ changes.



For various reasons the entity names will persist for some time yet in
several contexts, and if they are going to persist in several places,
personally I think it is worth the effort in getting a consistent set of
definitions to Uniocde.

It's _much_ easier to safely replace entity references by character data
if there are agreed universal definitions, as that makes the change
essentially reversible. If different vocabularies don't agree on the
definition of phi then it is very hard to make a global change that
expands out phi, as this may or may not be losing or corrupting
information.

> Mnemonic character input should be between the author and his/her  
> MathML converter. What goes over the wire to the browser should be  
> unescaped UTF-8.

That actually I don't agree with, I can't see any reason not to use
numeric references if you so choose (apart from file size) and using
numeric references has several advantages.

1) a document that uses nummeric references is much more likely to be
   served with a correct encoding in the http headers. It _ought_ to be
   easy to get your http server to serve documents with the correct
   encoding but experience shows that this is wrong as often as it's
   right, A document that's ascii + numeric references will almost
   always be correctly served. A document that's utf8 will often as not
   end up being served from somewhere as latin 1 with resulting mangling
   of the character data. (I wish this wasn't true, but it's what I
   observe). Note that the author of the document often has no control,
   or even knowledge of the web server being used. For example we put
   documentation on CD which end users may (or may not)  put on a web
   server or may just read off the file system. using ASCII + NCR (and
   no  DTD!) simplifies the installation instructions enormously.


2) A document with numeric character data is self describing in tutorial
   examples. If you see a document with & # 1 2 3 4 ; in its source and
   you want to generate a similar document, then you can see how to
   generate it. If you see a document with some explict character then
   you may not know how to recreate that character (except by cut and
   paste, if that's available).

David
Received on Monday, 7 January 2008 13:36:38 UTC