RE: Algorithm for mapping an application defined name to an XML name from Addison Phillips [wM] on 2002-11-25 (public-i18n-ws@w3.org from November 2002)

From: Addison Phillips [wM] <aphillips@webmethods.com>
Date: Mon, 25 Nov 2002 11:14:39 -0800
To: <public-i18n-ws@w3.org>
Cc: <w3-i18n-wg@w3.org>, <asirv@webmethods.com>
Message-ID: <PNEHIBAMBMLHDMJDDFLHMEHEGBAA.aphillips@webmethods.com>
Team,

One of the issues we reviewed in the WSTF face-to-face was the ugly Appendix B (app name to XML name mapping) section [1]. I had a hand in addressing previous versions of this awful, impenetrable text (see mail thread below).

I would like to forward a more readable version to the XMLP editors. I composed one such on the flight home yesterday, which is between the --- lines in this message. Before I do that, I am offering an opportunity for review here. There are, no doubt, some issues with this text.

[1] http://www.w3.org/2000/xp/Group/2/06/LC/soap12-part2.html#namemap

----------------

B. Mapping Application Defined Names to XML Names

This appendix details an algorithm for taking an application defined name, such as the name of a variable or field in a programming language, and mapping it to the Unicode characters that are legal in the names of XML elements and attributes as defined in [Namespaces in XML].

Hex Digits

[5]    hexDigit    ::=    [0-9A-F]
* Note, only uppercase letters A-F are defined here.

B.1 Rules for mapping application defined names to XML Names

1. An XML Name has two parts: Prefix and LocalPart. Let Prefix be determined per the rules and constraints specified in Namespaces in XML [Namespaces in XML]. The LocalPart will be determined by transforming the application name of the object as follows.

2. Let "T" be the application name. "T" must be represented by a sequence of Unicode characters in Unicode Normalization Form C (NFC). Let "M" be the output of the algorithm, also as a sequence of Unicode characters.

Note: If the name in the application is represented in some non-Unicode character encoding, it must be converted to Unicode before starting the name mapping process. Ideally any such conversion will use a reversable conversion (so that the original byte sequence can be obtained from the Unicode sequence), although this is not possible for some encodings.

Note: Characters in the application name's original encoding that do not have a mapping to Unicode should be handled in some reasonable, application defined manner.

Note: "A sequence of Unicode characters" means a sequence of code points (sometimes called Unicode Scalar Values). This should not be taken to imply that the characters are using any particular encoding of Unicode. The conversion algorithm itself may use whatever Unicode encoding is most convenient on that platform. However, it is important to note that surrogate pair sequences (pairs of UTF-16 code points that represent characters in Unicode above U+FFFF) must be handled as a single Unicode code point, rather than as the individual bytes or surrogate characters. Unpaired surrogates are not permitted.

  That is, the UTF-16 sequence 0xD800 0xDC00 represents the Unicode character U+10000. When performing the following steps, the value U+10000 is considered to be a single Unicode character in the sequence T. 

3. Let "i" be an integer representing the current position in "T", T(i), with a starting value of 1, such that T(1) is the first character in T, T(2) is the second, and so forth. Let "n" represent the last position in T, T(n).

4. Starting with 1, iterate across T by increasing i by 1 and perform the following evaluation on each character. 

   a. If T begins with the string "xml" (or any upper/lower case variation, such as "xmL", "XML", or "xMl"), encode T(1) using rule 4.c.i (that is, output either "_x0078_" for "x" or "_x0058_" for "X") and increase "i" to 4.

   Note: if the sequence starts with "xml"/"XML"/"xMl"/etc., and is followed by a combining character, the combining character is not considered part of the letter "l" for processing purposes. By contrast, precomposed characters such as U+013B (Latin letter capital L with cedilla) do not trigger this rule. That is "xmĻ" is encoded as xmĻ, whereas xml(U+0300) is encoded as _x0078_ml_x0300_. [For reviewers, note that NFC means that these sequences will always evaluate identically.]

   b. If T(i) falls in the range 0xD800 through 0xDFFF (that is, there is an unpaired surrogate character in the sequence), stop with an error.

   c. Else if T(i) is not a valid XML NCName character (see [Namespaces in XML]) or if i=1 and T(i) is not a valid first character of an XML NCName (see [Namespaces in XML]) then:

      i. If T(i) < 0x10000 and T(i) is not in the range 0xD800 through 0xDFFF, output to "M" the sequence "_x" followed by four hexDigits representing the Unicode Scalar Value followed by an underscore ("_") (for example, "x" (U+0078) would be encoded as "_x0078_").

      ii. Else if T(i) > 0xFFFF, output to "M" the sequence "_x" followed by eight hexDigits representing the Unicode Scalar Value. For example, the character 0x10FFFE would be encoded as "_x0010FFFE_".

   d. Else if T(i) = "_" (lowline) and T(i+1) = "x" or "X", output "_x005F_" to M.

   e. Else output T(i) to M.

Examples:



Hello world -> Hello_x0020_world  // space not permitted
Hello_xorld -> Hello_x005F_xorld  // _x rule
Helloworld_ -> Helloworld_        

          x -> x
        xml -> _x0078_ml   // starts with xml
       -xml -> _x002D_xml  // starts with hyphen-minus
       x-ml -> x-ml        // not the string xml

     Ælfred -> Ælfred
   άγνωστος -> άγνωστος
ᜉᜅᜎᜈ        -> _x1709__x1705__x170E__x1708_   // the Tagalog block is new and not permitted in XML 1.0
ᏙᏚᎥ         -> _x13D9__x13DA__x13A5_  // The Cherokee block is new and not permitted in XML 1.0

xml̀moo -> _x0078_ml_x0300_moo  // Starts with "xml". Note that combining character U+0300 is considered as separate from the "l" in "xml".
      



----------------

Hopefully this is more legible. Comments VERY welcome.

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.

+1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
-------------------------------------------------
Internationalization is an architecture.
It is not a feature. 

Chair, W3C-I18N-WG Web Services Task Force
To participate see http://www.w3.org/International/ws 


> -----Original Message-----
> From: Martin Gudgin [mailto:mgudgin@microsoft.com]
> Sent: Sunday, August 25, 2002 6:03 PM
> To: Addison Phillips [wM]; asirv@webmethods.com
> Cc: W3C Public Archive; Jean-Jacques Moreau; Marc Hadley; Nilo Mitra;
> Noah Mendelson; Henrik Frystyk Nielsen
> Subject: RE: Algorithm for mapping an application defined name to an XML
> name
> 
> 
> Addison,
> 
> Thanks very much for your detailed comments. I've commented inline
> 
> Martin
> 
> > -----Original Message-----
> > From: Addison Phillips [wM] [mailto:aphillips@webmethods.com]
> > Sent: 24 August 2002 00:05
> > To: Martin Gudgin; asirv@webmethods.com
> > Cc: W3C Public Archive; Jean-Jacques Moreau; Marc Hadley; 
> > Nilo Mitra; Noah Mendelson; Henrik Frystyk Nielsen
> > Subject: RE: Algorithm for mapping an application defined 
> > name to an XML name
> > 
> > 
> > Hi Martin,
> > 
> > Thanks for the note. It's been awhile since I thought about this.
> 
> Sorry it's taken us so long to incorporate your feedback.
> 
> > 
> > My edits were done from the original proposal. Although I
> > modified the text to be more correct about various Unicode 
> > issues, I didn't change the structure of the original at all. 
> > (FWIW, I would have designed and written it differently. And 
> > I hate standards that obfuscate what's going on as much as 
> > this one does. It not being my document, I didn't rewrite it. 
> > I just edited the text to be more correct.)
> 
> If you have ( or have time to produce ) a more readable version, I'm
> sure the editorial team would be very grateful.
> 
/* much more deleted... */
Received on Monday, 25 November 2002 14:14:43 UTC