RE: BaseChar problem in XML 1.0? from Mark Davis on 2001-02-08 (xml-editor@w3.org from January to March 2001)

From: Mark Davis <mark.davis@us.ibm.com>
Date: Thu, 8 Feb 2001 14:30:35 -0800
To: Francois Yergeau <FYergeau@alis.com>
Cc: "Andy Heninger" <heninger@us.ibm.com>, xml-editor@w3.org, "Glenn Marcy" <gmarcy@us.ibm.com>, "Arnaud Le Hors" <lehors@us.ibm.com>
Message-ID: <OF6CB012FC.EB18745F-ON882569ED.007B6328@LocalDomain>

Unfortunately we can't escape visual misidentification: look at Omicron. I
think we do need to consider carefully how we can expand the repertoire of
identifiers to deal with Unicode 3.1 anyway.

From a practical point of view in parsers, it would be simpler and much
faster to say that an identifier is anything that doesn't contain a syntax
character (whitespace, <, >, ...). If one wanted to reduce visual
ambiguity, one could say that two identifiers are considered identical if
the both have the same folded form, then define the folded form to be NFKC
or something like it. However, that is probably impossible at this point.

Mark
___
Mark Davis, IBM GCoC, Cupertino
(408) 777-5850 [fax: 5891], mark.davis@us.ibm.com, president@unicode.org
http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10275+N.+De+Anza&csz=95014



Francois Yergeau <FYergeau@alis.com> on 02-08-2001 10:52:15

To:   Andy Heninger/Cupertino/IBM@IBMUS, xml-editor@w3.org
cc:   Glenn Marcy/Cupertino/IBM@IBMUS, Arnaud Le Hors/Cupertino/IBM@IBMUS,
      Mark Davis/Cupertino/IBM@IBMUS
Subject:  RE: BaseChar problem in XML 1.0?




I  believe the choice was made to avoid "ambiguous" identifiers, i.e. ones
that you  cannot unambiguously re-type if you see them.

Imagine the identifier VISIBLE.  Is that "V",  then "I", then "S" etc. or
"VI" (U+2165), then "S", then "I" (U+2160), etc.  ?  Ambiguous. Out go all
the Roman numerals from 2610 to 217F.  No  such problem for 2180-2182, the
glyphs are distinct enough (same  for U+2183, but it wasn't around when XML
1.0 was  designed).
--
François Yergeau

-----Message d'origine-----
De : Andy Heninger  [mailto:heninger@us.ibm.com]
Envoyé : 5 février, 2001  13:09
À : xml-editor@w3.org
Cc : Glenn Marcy;  Arnaud Le Hors
Objet : BaseChar problem in XML  1.0?



Hello XML  Editors,

Here's a question that  just came up regarding the definition
of allowable identifier characters in XML.

From the XML spec,

Production [85]  BaseChar includes the characters  [#x2180-#x2182].

These are Roman Numerals
    1000    CD
    5000    (No reasonable ASCII  approximation)
   10000     (No reasonable ASCII approximation)

BaseChar does not include the remaining Unicode  Roman Numerals,
which encompass the  range [#x2160-#x2183]

I checked  with Mark Davis, and there is nothing from a
Unicode perspective that sets the three included  characters
apart from the rest of  the Unicode Roman Numerals.  It would
seem that they either all ought to be allowed or disallowed as
BaseChars.

Unicode's recommendations for Identifier characters  allow them
all.

Something does not seem right.   Is there some logic here
that  escapes me, or is it possible that the inclusion of
these characters is an editing error, or ???



 -- Andy Heninger,      IBM  Cupertino, XML Technology Group
      heninger@us.ibm.com

Received on Thursday, 8 February 2001 17:30:46 UTC