Re: XML spec: Letter, Digit from Tim Bray on 1999-06-13 (xml-editor@w3.org from April to June 1999)

From: Tim Bray <tbray@textuality.com>
Date: Sun, 13 Jun 1999 15:10:48 -0700
To: John Stracke <francis@thibault.org>, xml-editor@w3.org
Message-Id: <3.0.32.19990613150823.01226010@pop.intergate.bc.ca>

At 05:13 PM 6/13/99 -0400, John Stracke wrote:
>I'm building an XML parser, and I'm somewhat confused by the
>spec's productions Letter and Digit.  My concern is that, if
>a new character set is defined next week, then existing XML
>parsers won't consider any of its characters to be Letters
>or Digits

You've put your finger on one of the real hard problems with XML.
Production [2], for Character, makes it clear that you can use,
as a character, pretty well anything that the appropriate committees 
add to Unicode.  On the other hand, every time they add a new character
set, it will in general contain some things that fall under "letter"
and others that shouldn't.  Note that the XMl spec outlines the 
algorithm that we used to identify what we consider a "letter"; is
this extensible to new character sets?  At the moment we just
don't know.  

For what it's worth, XML 1.0 is 100% totally clear on what's a letter
and what isn't, and includes most of the languages that most people
are going to be using... but there's no doubt that there's a problem
lurking out there that's going to have to be solved sometime.  
Fortunately, the key committees both in the XML, IETF, and Unicode
spaces know about the problem and are already worrying. -Tim

Received on Sunday, 13 June 1999 18:10:53 UTC