- From: Mark Davis <mark.davis@us.ibm.com>
- Date: Thu, 8 Feb 2001 14:30:35 -0800
- To: Francois Yergeau <FYergeau@alis.com>
- Cc: "Andy Heninger" <heninger@us.ibm.com>, xml-editor@w3.org, "Glenn Marcy" <gmarcy@us.ibm.com>, "Arnaud Le Hors" <lehors@us.ibm.com>
Unfortunately we can't escape visual misidentification: look at Omicron. I think we do need to consider carefully how we can expand the repertoire of identifiers to deal with Unicode 3.1 anyway. From a practical point of view in parsers, it would be simpler and much faster to say that an identifier is anything that doesn't contain a syntax character (whitespace, <, >, ...). If one wanted to reduce visual ambiguity, one could say that two identifiers are considered identical if the both have the same folded form, then define the folded form to be NFKC or something like it. However, that is probably impossible at this point. Mark ___ Mark Davis, IBM GCoC, Cupertino (408) 777-5850 [fax: 5891], mark.davis@us.ibm.com, president@unicode.org http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10275+N.+De+Anza&csz=95014 Francois Yergeau <FYergeau@alis.com> on 02-08-2001 10:52:15 To: Andy Heninger/Cupertino/IBM@IBMUS, xml-editor@w3.org cc: Glenn Marcy/Cupertino/IBM@IBMUS, Arnaud Le Hors/Cupertino/IBM@IBMUS, Mark Davis/Cupertino/IBM@IBMUS Subject: RE: BaseChar problem in XML 1.0? I believe the choice was made to avoid "ambiguous" identifiers, i.e. ones that you cannot unambiguously re-type if you see them. Imagine the identifier VISIBLE. Is that "V", then "I", then "S" etc. or "VI" (U+2165), then "S", then "I" (U+2160), etc. ? Ambiguous. Out go all the Roman numerals from 2610 to 217F. No such problem for 2180-2182, the glyphs are distinct enough (same for U+2183, but it wasn't around when XML 1.0 was designed). -- François Yergeau -----Message d'origine----- De : Andy Heninger [mailto:heninger@us.ibm.com] Envoyé : 5 février, 2001 13:09 À : xml-editor@w3.org Cc : Glenn Marcy; Arnaud Le Hors Objet : BaseChar problem in XML 1.0? Hello XML Editors, Here's a question that just came up regarding the definition of allowable identifier characters in XML. From the XML spec, Production [85] BaseChar includes the characters [#x2180-#x2182]. These are Roman Numerals 1000 CD 5000 (No reasonable ASCII approximation) 10000 (No reasonable ASCII approximation) BaseChar does not include the remaining Unicode Roman Numerals, which encompass the range [#x2160-#x2183] I checked with Mark Davis, and there is nothing from a Unicode perspective that sets the three included characters apart from the rest of the Unicode Roman Numerals. It would seem that they either all ought to be allowed or disallowed as BaseChars. Unicode's recommendations for Identifier characters allow them all. Something does not seem right. Is there some logic here that escapes me, or is it possible that the inclusion of these characters is an editing error, or ??? -- Andy Heninger, IBM Cupertino, XML Technology Group heninger@us.ibm.com
Received on Thursday, 8 February 2001 17:30:46 UTC