- From: Addison Phillips [wM] <aphillips@webmethods.com>
- Date: Wed, 11 Dec 2002 17:46:03 -0800
- To: "Mark Davis" <mark.davis@jtcsv.com>, <xmlp-comments@w3.org>
- Cc: <public-i18n-ws@w3.org>, <w3c-i18n-ig@w3.org>
Hi Mark, Thanks for your comments. I should note that, while we did rewrite the section from stem to stern, the actual algorithm is entirely unchanged, a fact that is (I am given to understand) important to XMLP. Hence the escaping mechanism and leading zeroes are preserved. ~Addison > > Item 4a is confusing. If this is meant to be: "If the substring > from T[i] to > T[n] begins with the string "xml"...", then it should say so. If not, then > it should be outside of the loop that starts with item 4. This should be outside the loop, since it applies only if i == 1 (as it were), that is, if the string T starts with "xml". > > It is certainly ok to have this step, and reject such a string, but in > practice it is not really necessary. Gigo. That's true. I added this instruction during an earlier revision and didn't remove it after clarifying the use of Unicode scalars elsewhere. > > A. Leading zeros are unnecessary, and make the string longer. They are not > necessary since there is a terminator. I'd recommend omitting > them. Thus one > would have _x78_ instead of _x0078_, and _x10FFFE_ instead of _x0010FFFE_. > B. If a rare character (or character sequence) is chosen instead of "_x", > then it would be much less likely that the quoting mechanism in (d) would > need to be invoked. If a single rare character is chosen instead of a > sequence, it avoids look-ahead. > > Mark > __________________________________ > http://www.macchiato.com > ► “Eppur si muove” ◄ > > ----- Original Message ----- > From: "Addison Phillips [wM]" <aphillips@webmethods.com> > To: <xmlp-comments@w3.org> > Cc: <public-i18n-ws@w3.org>; <w3c-i18n-ig@w3.org> > Sent: Wednesday, December 11, 2002 16:08 > Subject: XMLP SOAP1.2 Issue 270: Further clarification... > > > > > > Dear XMLP Editors, > > > > Some time ago I worked with webMethods' XMLP rep, Asir, to produce a > rewrite of SOAP 1.2 Part 2 Appendix B [1]. This is issue #270 [2] on your > issues list. More recently I became the chair of the I18N-WG's > Web Services > task force, which has been delegated the review of Web Services > recommendations, etc., by the I18N-WG Core group. At our recent > face-to-face > we reviewed the various i18n issues in SOAP 1.2 to ensure that we > understood > the actions taken by XMLP/SOAP WG and that we had no further comments. > > > > One of the issues our task force reviewed was Appendix B. As I > remarked in > our last exchange (see the bottom of this email for a refresher), the text > was pretty hard to read and very difficult to understand. The I18N-WG > resolved therefore to prepare an improved version, which is > included in this > message between the ---- lines. > > > > This version is functionally identical to the previous version. However, > it should be easier to understand and therefore to correctly implement. > Would you please incorporate this version into SOAP 1.2 Part 2 in place of > the current version? > > > > Best Regards, > > > > Addison (on behalf of W3C-I18N-WG) > > > > Addison P. Phillips > > Director, Globalization Architecture > > webMethods, Inc. > > > > +1 408.962.5487 (phone) +1 408.210.3569 (mobile) > > ------------------------------------------------- > > Internationalization is an architecture. > > It is not a feature. > > > > Chair, W3C-I18N-WG Web Services Task Force > > To participate see http://www.w3.org/International/ws > > > > [1] http://www.w3.org/2000/xp/Group/2/06/LC/soap12-part2.html#namemap > > [2] http://www.w3.org/2000/xp/Group/xmlp-lc-issues.html#x270 > > > > ---------------- > > > > B. Mapping Application Defined Names to XML Names > > > > This appendix details an algorithm for taking an application > defined name, > such as the name of a variable or field in a programming language, and > mapping it to the Unicode characters that are legal in the names of XML > elements and attributes as defined in [Namespaces in XML]. > > > > Note: Application defined names are generally subject to the specific > restrictions of their underlying development environment. Ideally these > names should restricted to the subset of Unicode characters in accordance > with the guidelines in Unicode Standard Annex #15, Annex 7 ("Programming > Language Identifiers")[1]. Names that follow these guidelines > will generally > also follow the guidelines in [Namespaces in XML]. > > > > Hex Digits > > > > [5] hexDigit ::= [0-9A-F] > > * Note, only uppercase letters A-F are defined here. > > > > B.1 Rules for mapping application defined names to XML Names > > > > 1. An XML Name has two parts: Prefix and LocalPart. Let Prefix be > determined per the rules and constraints specified in Namespaces > in XML [3]. > The LocalPart will be determined by transforming the application > name of the > object as follows. > > > > 2. Let "T" be the application name. "T" must be represented by > a sequence > of Unicode characters in Unicode Normalization Form C (NFC) [2]. > Let "M" be > the output of this algorithm, also as a sequence of Unicode characters. > > > > Note: If the name in the application is represented in some non-Unicode > character encoding, it must be converted to Unicode before > starting the name > mapping process. Ideally any such conversion will use a reversable > conversion (so that the original byte sequence can be obtained from the > Unicode sequence), although this is not possible for some encodings. > > > > Note: Characters in the application name's original encoding that do not > have a mapping to Unicode should be handled in some reasonable, > application > defined manner. > > > > Note: "A sequence of Unicode characters" means a sequence of code points > (sometimes called Unicode Scalar Values). This should not be > taken to imply > that the characters are using any particular encoding of Unicode. The > conversion algorithm itself may use whatever Unicode encoding is most > convenient on that platform. However, it is important to note > that surrogate > pair sequences (pairs of UTF-16 code points that represent characters in > Unicode above U+FFFF) must be handled as a single Unicode code > point (their > Scalar Value, that is the specific supplemental character they represent), > rather than as the individual bytes or surrogate characters in UTF-16. > Unpaired surrogates are not permitted. For example: > > > > The UTF-16 sequence 0xD800 0xDC00 represents the Unicode character > U+10000. When performing the following steps, the value U+10000 is > considered to be a single Unicode character in the sequence T. > > > > 3. Let "i" be an integer representing the current position in "T", T(i), > with a starting value of 1, such that T(1) is the first character > in T, T(2) > is the second, and so forth. Let "n" represent the last position > in T, T(n). > > > > 4. Starting with 1, iterate across T by increasing i by 1 and > perform the > following evaluation on each character. > > > > a. If T begins with the string "xml" (or any upper/lower case > variation, such as "xmL", "XML", or "xMl"), encode T(1) using rule 4.c.i > (that is, output either "_x0078_" for "x" or "_x0058_" for "X") > and increase > "i" to 4. > > > > Note: if the sequence starts with "xml"/"XML"/"xMl"/etc., and is > followed by a Unicode combining character, the combining character is not > considered part of the letter "l" for processing purposes. By contrast, > precomposed characters such as U+013B (Latin letter capital L > with cedilla) > do not trigger this rule. That is "xmĻ" is encoded as xmĻ, whereas > xml(U+0300) is encoded as _x0078_ml_x0300_. [Note that > Normalization Form C > means that this special case will always produce the same resulting > sequence.] > > > > b. If T(i) falls in the range 0xD800 through 0xDFFF (that > is, there is > an unpaired surrogate character in the sequence), stop with an error. > > > > c. Else if T(i) is not a valid XML NCName character (see > [3]) or if i=1 > and T(i) is not a valid first character of an XML NCName (see [3]) then: > > > > i. If T(i) < U+10000 and T(i) is not in the range 0xD800 through > 0xDFFF, output to "M" the sequence "_x" followed by four hexDigits > representing the Unicode Scalar Value followed by an underscore ("_") (for > example, "x" (U+0078) would be encoded as "_x0078_"). > > > > ii. Else if T(i) > U+FFFF, output to "M" the sequence > "_x" followed > by eight hexDigits representing the Unicode Scalar Value. For example, the > character 0x10FFFE would be encoded as "_x0010FFFE_". > > > > d. Else if T(i) = "_" (lowline) and T(i+1) = "x" or "X", output > "_x005F_" to M. > > > > e. Else output T(i) to M. > > > > [1] > http://www.unicode.org/unicode/reports/tr15/#Programming_Language_ > Identifier > s > > [2] http://www.unicode.org/unicode/reports/tr15/ > > [3] http://www.w3.org/TR/REC-xml-names/ > > > > Examples: > > > > > > > > Hello world -> Hello_x0020_world // space not permitted > > Hello_xorld -> Hello_x005F_xorld // _x rule > > Helloworld_ -> Helloworld_ > > > > x -> x > > xml -> _x0078_ml // starts with xml > > -xml -> _x002D_xml // starts with hyphen-minus > > x-ml -> x-ml // not the string xml > > > > Ælfred -> Ælfred > > άγνωστος -> άγνωστος > > ᜉᜅᜎᜈ -> _x1709__x1705__x170E__x1708_ // the Tagalog block is > newer and not permitted in XML 1.0 > > ᏙᏚᎥ -> _x13D9__x13DA__x13A5_ // The Cherokee block is newer and > not permitted in XML 1.0 > > > > xml̀moo -> _x0078_ml_x0300_moo // Starts with "xml". Note that > combining > character U+0300 is considered as separate from the "l" in "xml". > > > > Note to editor> The last example is the only one that I have changed > (added). > > > > > > > > > > ---------------- > > > > > > > -----Original Message----- > > > From: Martin Gudgin [mailto:mgudgin@microsoft.com] > > > Sent: Sunday, August 25, 2002 6:03 PM > > > To: Addison Phillips [wM]; asirv@webmethods.com > > > Cc: W3C Public Archive; Jean-Jacques Moreau; Marc Hadley; Nilo Mitra; > > > Noah Mendelson; Henrik Frystyk Nielsen > > > Subject: RE: Algorithm for mapping an application defined > name to an XML > > > name > > > > > > > > > Addison, > > > > > > Thanks very much for your detailed comments. I've commented inline > > > > > > Martin > > > > > > > -----Original Message----- > > > > From: Addison Phillips [wM] [mailto:aphillips@webmethods.com] > > > > Sent: 24 August 2002 00:05 > > > > To: Martin Gudgin; asirv@webmethods.com > > > > Cc: W3C Public Archive; Jean-Jacques Moreau; Marc Hadley; > > > > Nilo Mitra; Noah Mendelson; Henrik Frystyk Nielsen > > > > Subject: RE: Algorithm for mapping an application defined > > > > name to an XML name > > > > > > > > > > > > Hi Martin, > > > > > > > > Thanks for the note. It's been awhile since I thought about this. > > > > > > Sorry it's taken us so long to incorporate your feedback. > > > > > > > > > > > My edits were done from the original proposal. Although I > > > > modified the text to be more correct about various Unicode > > > > issues, I didn't change the structure of the original at all. > > > > (FWIW, I would have designed and written it differently. And > > > > I hate standards that obfuscate what's going on as much as > > > > this one does. It not being my document, I didn't rewrite it. > > > > I just edited the text to be more correct.) > > > > > > If you have ( or have time to produce ) a more readable version, I'm > > > sure the editorial team would be very grateful. > > > > > /* much more deleted... */ > > > > > >
Received on Wednesday, 11 December 2002 20:46:06 UTC