XMLP SOAP1.2 Issue 270: Further clarification...

Dear XMLP Editors,

Some time ago I worked with webMethods' XMLP rep, Asir, to produce a rewrite of SOAP 1.2 Part 2 Appendix B [1]. This is issue #270 [2] on your issues list. More recently I became the chair of the I18N-WG's Web Services task force, which has been delegated the review of Web Services recommendations, etc., by the I18N-WG Core group. At our recent face-to-face we reviewed the various i18n issues in SOAP 1.2 to ensure that we understood the actions taken by XMLP/SOAP WG and that we had no further comments.

One of the issues our task force reviewed was Appendix B. As I remarked in our last exchange (see the bottom of this email for a refresher), the text was pretty hard to read and very difficult to understand. The I18N-WG resolved therefore to prepare an improved version, which is included in this message between the ---- lines.

This version is functionally identical to the previous version. However, it should be easier to understand and therefore to correctly implement. Would you please incorporate this version into SOAP 1.2 Part 2 in place of the current version?

Best Regards,

Addison (on behalf of W3C-I18N-WG)

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.

+1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
Internationalization is an architecture.
It is not a feature. 

Chair, W3C-I18N-WG Web Services Task Force
To participate see http://www.w3.org/International/ws 

[1] http://www.w3.org/2000/xp/Group/2/06/LC/soap12-part2.html#namemap
[2] http://www.w3.org/2000/xp/Group/xmlp-lc-issues.html#x270


B. Mapping Application Defined Names to XML Names

This appendix details an algorithm for taking an application defined name, such as the name of a variable or field in a programming language, and mapping it to the Unicode characters that are legal in the names of XML elements and attributes as defined in [Namespaces in XML].

Note: Application defined names are generally subject to the specific restrictions of their underlying development environment. Ideally these names should restricted to the subset of Unicode characters in accordance with the guidelines in Unicode Standard Annex #15, Annex 7 ("Programming Language Identifiers")[1]. Names that follow these guidelines will generally also follow the guidelines in [Namespaces in XML].

Hex Digits

[5]    hexDigit    ::=    [0-9A-F]
* Note, only uppercase letters A-F are defined here.

B.1 Rules for mapping application defined names to XML Names

1. An XML Name has two parts: Prefix and LocalPart. Let Prefix be determined per the rules and constraints specified in Namespaces in XML [3]. The LocalPart will be determined by transforming the application name of the object as follows.

2. Let "T" be the application name. "T" must be represented by a sequence of Unicode characters in Unicode Normalization Form C (NFC) [2]. Let "M" be the output of this algorithm, also as a sequence of Unicode characters.

Note: If the name in the application is represented in some non-Unicode character encoding, it must be converted to Unicode before starting the name mapping process. Ideally any such conversion will use a reversable conversion (so that the original byte sequence can be obtained from the Unicode sequence), although this is not possible for some encodings.

Note: Characters in the application name's original encoding that do not have a mapping to Unicode should be handled in some reasonable, application defined manner.

Note: "A sequence of Unicode characters" means a sequence of code points (sometimes called Unicode Scalar Values). This should not be taken to imply that the characters are using any particular encoding of Unicode. The conversion algorithm itself may use whatever Unicode encoding is most convenient on that platform. However, it is important to note that surrogate pair sequences (pairs of UTF-16 code points that represent characters in Unicode above U+FFFF) must be handled as a single Unicode code point (their Scalar Value, that is the specific supplemental character they represent), rather than as the individual bytes or surrogate characters in UTF-16. Unpaired surrogates are not permitted. For example:

The UTF-16 sequence 0xD800 0xDC00 represents the Unicode character U+10000. When performing the following steps, the value U+10000 is considered to be a single Unicode character in the sequence T.

3. Let "i" be an integer representing the current position in "T", T(i), with a starting value of 1, such that T(1) is the first character in T, T(2) is the second, and so forth. Let "n" represent the last position in T, T(n).

4. Starting with 1, iterate across T by increasing i by 1 and perform the following evaluation on each character. 

   a. If T begins with the string "xml" (or any upper/lower case variation, such as "xmL", "XML", or "xMl"), encode T(1) using rule 4.c.i (that is, output either "_x0078_" for "x" or "_x0058_" for "X") and increase "i" to 4.

   Note: if the sequence starts with "xml"/"XML"/"xMl"/etc., and is followed by a Unicode combining character, the combining character is not considered part of the letter "l" for processing purposes. By contrast, precomposed characters such as U+013B (Latin letter capital L with cedilla) do not trigger this rule. That is "xmĻ" is encoded as xmĻ, whereas xml(U+0300) is encoded as _x0078_ml_x0300_. [Note that Normalization Form C means that this special case will always produce the same resulting sequence.]

   b. If T(i) falls in the range 0xD800 through 0xDFFF (that is, there is an unpaired surrogate character in the sequence), stop with an error.

   c. Else if T(i) is not a valid XML NCName character (see [3]) or if i=1 and T(i) is not a valid first character of an XML NCName (see [3]) then:

      i. If T(i) < U+10000 and T(i) is not in the range 0xD800 through 0xDFFF, output to "M" the sequence "_x" followed by four hexDigits representing the Unicode Scalar Value followed by an underscore ("_") (for example, "x" (U+0078) would be encoded as "_x0078_").

      ii. Else if T(i) > U+FFFF, output to "M" the sequence "_x" followed by eight hexDigits representing the Unicode Scalar Value. For example, the character 0x10FFFE would be encoded as "_x0010FFFE_".

   d. Else if T(i) = "_" (lowline) and T(i+1) = "x" or "X", output "_x005F_" to M.

   e. Else output T(i) to M.

[1] http://www.unicode.org/unicode/reports/tr15/#Programming_Language_Identifiers
[2] http://www.unicode.org/unicode/reports/tr15/
[3] http://www.w3.org/TR/REC-xml-names/


Hello world -> Hello_x0020_world  // space not permitted
Hello_xorld -> Hello_x005F_xorld  // _x rule
Helloworld_ -> Helloworld_        

          x -> x
        xml -> _x0078_ml   // starts with xml
       -xml -> _x002D_xml  // starts with hyphen-minus
       x-ml -> x-ml        // not the string xml

     Ælfred -> Ælfred
   άγνωστος -> άγνωστος
ᜉᜅᜎᜈ        -> _x1709__x1705__x170E__x1708_   // the Tagalog block is newer and not permitted in XML 1.0
ᏙᏚᎥ         -> _x13D9__x13DA__x13A5_  // The Cherokee block is newer and not permitted in XML 1.0

xml̀moo -> _x0078_ml_x0300_moo  // Starts with "xml". Note that combining character U+0300 is considered as separate from the "l" in "xml".

Note to editor> The last example is the only one that I have changed (added).


> -----Original Message-----
> From: Martin Gudgin [mailto:mgudgin@microsoft.com]
> Sent: Sunday, August 25, 2002 6:03 PM
> To: Addison Phillips [wM]; asirv@webmethods.com
> Cc: W3C Public Archive; Jean-Jacques Moreau; Marc Hadley; Nilo Mitra;
> Noah Mendelson; Henrik Frystyk Nielsen
> Subject: RE: Algorithm for mapping an application defined name to an XML
> name
> Addison,
> Thanks very much for your detailed comments. I've commented inline
> Martin
> > -----Original Message-----
> > From: Addison Phillips [wM] [mailto:aphillips@webmethods.com]
> > Sent: 24 August 2002 00:05
> > To: Martin Gudgin; asirv@webmethods.com
> > Cc: W3C Public Archive; Jean-Jacques Moreau; Marc Hadley; 
> > Nilo Mitra; Noah Mendelson; Henrik Frystyk Nielsen
> > Subject: RE: Algorithm for mapping an application defined 
> > name to an XML name
> > 
> > 
> > Hi Martin,
> > 
> > Thanks for the note. It's been awhile since I thought about this.
> Sorry it's taken us so long to incorporate your feedback.
> > 
> > My edits were done from the original proposal. Although I
> > modified the text to be more correct about various Unicode 
> > issues, I didn't change the structure of the original at all. 
> > (FWIW, I would have designed and written it differently. And 
> > I hate standards that obfuscate what's going on as much as 
> > this one does. It not being my document, I didn't rewrite it. 
> > I just edited the text to be more correct.)
> If you have ( or have time to produce ) a more readable version, I'm
> sure the editorial team would be very grateful.
/* much more deleted... */

Received on Wednesday, 11 December 2002 19:10:18 UTC