Re: Named character entities from Henri Sivonen on 2003-05-24 (www-html@w3.org from May 2003)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Sat, 24 May 2003 22:38:43 +0300
To: www-html@w3.org
Message-Id: <53F2DD48-8E1F-11D7-80D6-003065B8CF0E@iki.fi>
On Monday, May 19, 2003, at 19:03 Europe/Helsinki, William F Hammond 
wrote:

>
> Ian Hickson <ian@hixie.ch> writes:
>
>> More to the point, XHTML can't make restrictions on XML parsers beyond
>> those of XML. This has to be the case so that arbitrary XML parsers 
>> can be
>> re-used in XHTML environments, otherwise XHTML processors must have
>> specialised XML parsers.
>
> It may be useful for interoperability to require that XHTML be
> parsable by arbitrary XML parsers, but it is not inherently impossible
> to bring up XHTML as a markup language that, rather than _being_ an
> XML application, has a canonically associated XML application.  That
> way additional requirements can be imposed.

It is *possible* to define a markup language called XHTML 2 that isn't 
an application of XML and only leverages the 'X' for marketing 
purposes. But does defining XHTML 2 that way have any technical merit? 
Why would it make sense to give up the ability to use ready-made 
off-the-shelf XML tools (most importantly XML processors) in exchange 
of having a larger set of predefined entities--especially considering 
that the problems the larger set of predefined entities is designed to 
solve are better solved in a different way?

>> In the case of entities, XML says that non-validating parsers need not
>> recognise anything outside of the five pre-defined entities. Thus, an
>> arbitary non-validating XML parser will probably not recognise the 
>> XHTML
>> entities. By the time the XHTML-specific part of the UA gets 
>> involved, it
>> is likely that the entities are long lost.
>
> The 5 entities are "amp", "lt", "gt", "quot", and "apos".  The list
> does NOT include "copy".
>
> "&copy;" is an interesting case because it is neither used in markup
> nor (AFAIK) used natively in association with the language of a
> non-ascii locale.

The character encoding is part of the concept of "locale" on some 
platforms which have the design limitation of tightly coupling 
characters with bytes. However, even those platforms don't inherently 
make it impossible for applications to write UTF-8 to disk.

> It is an exception to the idea that CDATA encoding,
> e.g., the processing route from an author's keystroke to UTF-8, should
> be locale special.

Rather, the processing route from user actions to UTF-8 should depend 
on the input method in use.

I am currently using a keyboard with Apple's Finnish keyboard layout as 
the logical keyboard layout even though I am writing English. I'm 
located in Finland and I use an email client that current shows me the 
UI strings in U.S. English. If I wanted to, I could switch to the Greek 
keyboard layout and type some Greek letters for use as variable names. 
Or I could open the Character Palette which allows me to pick 
characters from the Unicode chart.

The language being written is not bound to the input method. The input 
method is not bound to the UI language. (And yes, Apple's Finnish 
keyboard layout allows me to type the copyright sign. It takes one 
modifier key and one ordinary key.)

> Named character entities are importantly useful when one needs to
> refer to characters outside of those provided natively in one's
> locale.

Editors that restrict the set of available characters to a subset of 
Unicode because of the user's locale are ill-suited for editing XML 
documents. Editors that properly use the Unicode features of Mac OS X 
and Windows XP don't suffer from that kind of limitation.

The main practical problem is that X11 systems aren't quite 
Unicode-savvy, yet. However, I think it doesn't make sense to work 
around the limitations of X11 systems at the markup language level. 
History suggests that it takes *at least* six years for a W3C spec to 
become implemented widely enough on the client side for authors to use 
the spec features casually. Six years should be enough for X11 to catch 
up in terms of Unicode-savviness.

> I think it desirable for XHTML user agents to provide some form of
> imitation of Mozilla's handling of the 253 character entites defined
> in XHTML 1.0.

Mozilla's approach isn't forward-compatible with eg. XHTML 1.1.1, 
because Mozilla's uses a finite list of existing public ids.

-- 
Henri Sivonen
hsivonen@iki.fi
http://www.iki.fi/hsivonen/
Received on Saturday, 24 May 2003 15:38:53 UTC