Re: Character Entities: An XML Core WG View from David Carlisle on 2002-11-01 (www-math@w3.org from November 2002)

From: David Carlisle <davidc@nag.co.uk>
Date: Fri, 1 Nov 2002 10:05:22 GMT
To: pgrosso@arbortext.com
CC: www-math@w3.org
Message-Id: <200211011005.KAA21058@penguin.nag.co.uk>
Paul,

Thanks for the pointer, I replied to this document on xml-dev, I include
a copy below. 

I should probably note here that this is a personal response not a
Working Group one (although obviously my personal views are somewhat
flavoured by working on the MathML DTD and also my failure to persuade
any end users that they would rather write &#x2192; than &rightarrow;).

David




Date: 1 Nov 2002 09:22:22 +0000
From: David Carlisle <davidc@nag.co.uk>
CC: xml-dev@lists.xml.org
In-reply-to: <200210302155.QAA19553@mail2.reutershealth.com> (message from
	John Cowan on Wed, 30 Oct 2002 16:41:46 -0500 (EST))
Subject: Re: [xml-dev] Character Entities: An XML Core WG View



 
Comments on "Character Entities: An XML Core WG View"

While the comments on character entities are (mostly) technically
correct they fail to acknowledge the real problems that authors
currently face when trying to use character entities alongside other
aspects of current XML technologies. The general flavour is "existing
mechanisms suffice" but that may be compared to comments that "XML isn't
needed as SGML could do everything needed in that area".  Technically it
is true but misses the point. XML has proved to have many advantages
over SGML. Reasonable schema languages using XML instance syntax may
prove to have real advantages over DTD, and it may yet prove to be the
case that a different approach to entity definition may have real
advantages. Unfortunately the document as published does not address the
issues at all and just states the obviously true fact that entities
already have a definition mechanism in DTD.

Acknowledging the usability problems with the current mechanisms and
investigating the possibilities for alternatives should not commit
anyone to adding any mechanism in a future XML 2 (if there were ever to
be such a version). As already seen in XML 1.1 debates, the costs of
any version increment are high, and it may be a reasonable position to
take that the changes required would be too great. However unless
there is an acceptance of a desirability of new functionality and some
rough idea about what changes could be made to meet that requirement
then it is impossible to weigh the benefits against the costs of a
version change.

Responses to individual (quoted) points are contained below.


>  The existing mechanism, DTDs, is entirely adequate to the purpose.
>  Although some subsets of XML have outlawed DTDs in the name of
>  interoperability, all conforming XML processors (parsers) must be able
>  to recognize at least some DTD information,


If you are using an XML application that forbids (at the application
level) the use of <!DOCTYPE, the fact that this is allowed by the XML
spec does not really help. SOAP is probably the main example of this,
although probably SOAP is not so often used with hand authored
documents. However given the pressure from some quarters to move from
dtd to schema languages of one sort or another, this is likely to become
more rather than less common.


>  At worst, then, the character entities actually used in a given
>  document (generally a small subset of those available) can be declared
>  in the internal subset, and are 100% interoperable across processors. 

As noted above, this facility may not be available at all. Even when it
is, it is only barely usable for hand authored documents (which as you
comment in the introduction is a main use case for entities of this
form). The idea that every time you use a character by name you have to
(a) know the required definition and (b) go up to the top of the
document to add the entity declaration, has severe usability problems.

 
>  However, different XML applications such as XHTML and MathML do not
>  need to declare differently named entities for the same
>  characters. Most character names have already been standardized by
>  ISO, and these names should be and are used wherever possible. 

"most" characters have not had names standardised by ISO (or anyone
else) unless you are thinking solely of characters used in common
European languages.

Also XHTML is incompatible with the usual ISO definitions
(asymp and circ for example) which causes some problems for MathML which
tries to be in agreeement with both.

In addition Unicode/ISO chose not to support the full set of characters 
that have ISO entity names even in the additions in Unicode 3.x, so
several so called "standard" names have wildy different definitions in
common XML DTD, depending on whether the DTD author chose to pick
something "close" or to mark the character as unsupported.
Docbook for example maps several characters to #FFFD
http://www.oasis-open.org/docbook/specs/wd-docbook-xmlcharent-0.3.html#d0e184
wheras MathML  attempts to map all of these names to some more (or
less) suitable character.

At the very least, the W3C XML Activity could agree a standard set of
entity names together with a mapping to Unicode for use across W3C specs
(which may possibly just be a matter of rubber stamping the mathml ones
http://www.w3.org/Math/characters and moving them out of the math
area). There has been some interest expressed previously of coordinating
such a set with ISO and/or OASIS.


>  There is no need for such a facility, because of the Unicode Private
>  Use Area (PUA).

I would agree with this, but it is worth noting that W3C I18N group
takes a very hard line against any public use of the PUA.



David Carlisle




_____________________________________________________________________
This message has been checked for all known viruses by Star Internet
delivered through the MessageLabs Virus Scanning Service. For further
information visit http://www.star.net.uk/stats.asp or alternatively call
Star Internet for details on the Virus Scanning Service.
Received on Friday, 1 November 2002 05:06:20 UTC