Re: XMLP SOAP1.2 Issue 270: Further clarification... from Mark Davis on 2002-12-12 (xmlp-comments@w3.org from December 2002)

From: Mark Davis <mark.davis@jtcsv.com>
Date: Wed, 11 Dec 2002 16:38:27 -0800
To: "Addison Phillips [wM]" <aphillips@webmethods.com>, <xmlp-comments@w3.org>
Cc: <public-i18n-ws@w3.org>, <w3c-i18n-ig@w3.org>
Message-ID: <002801c2a176$c9dde2d0$6ede2b09@DAVIS1>
A couple of comments.

> 4. Starting with 1, iterate across T by increasing i by 1 and perform the
following evaluation on each character.
>
>    a. If T begins with the string "xml" (or any upper/lower case
variation, such as "xmL", "XML", or "xMl"), encode T(1) using rule 4.c.i
(that is, output either "_x0078_" for "x" or "_x0058_" for "X") and increase
"i" to 4.

Item 4a is confusing. If this is meant to be: "If the substring from T[i] to
T[n] begins with the string "xml"...", then it should say so. If not, then
it should be outside of the loop that starts with item 4.

>    b. If T(i) falls in the range 0xD800 through 0xDFFF (that is, there is
an unpaired surrogate character in the sequence), stop with an error.

It is certainly ok to have this step, and reject such a string, but in
practice it is not really necessary. Gigo.

>       i. If T(i) < U+10000 and T(i) is not in the range 0xD800 through
0xDFFF, output to "M" the sequence "_x" followed by four hexDigits
representing the Unicode Scalar Value followed by an underscore ("_") (for
example, "x" (U+0078) would be encoded as "_x0078_").
>
>       ii. Else if T(i) > U+FFFF, output to "M" the sequence "_x" followed
by eight hexDigits representing the Unicode Scalar Value. For example, the
character 0x10FFFE would be encoded as "_x0010FFFE_".
>
>    d. Else if T(i) = "_" (lowline) and T(i+1) = "x" or "X", output
"_x005F_" to M.

Two points here. They are *only* relevant if this section is a guideline,
and thus doesn't have compatibility issues.

A. Leading zeros are unnecessary, and make the string longer. They are not
necessary since there is a terminator. I'd recommend omitting them. Thus one
would have _x78_ instead of _x0078_, and _x10FFFE_ instead of _x0010FFFE_.
B. If a rare character (or character sequence) is chosen instead of "_x",
then it would be much less likely that the quoting mechanism in (d) would
need to be invoked. If a single rare character is chosen instead of a
sequence, it avoids look-ahead.

Mark
__________________________________
http://www.macchiato.com
►  “Eppur si muove” ◄

----- Original Message -----
From: "Addison Phillips [wM]" <aphillips@webmethods.com>
To: <xmlp-comments@w3.org>
Cc: <public-i18n-ws@w3.org>; <w3c-i18n-ig@w3.org>
Sent: Wednesday, December 11, 2002 16:08
Subject: XMLP SOAP1.2 Issue 270: Further clarification...


>
> Dear XMLP Editors,
>
> Some time ago I worked with webMethods' XMLP rep, Asir, to produce a
rewrite of SOAP 1.2 Part 2 Appendix B [1]. This is issue #270 [2] on your
issues list. More recently I became the chair of the I18N-WG's Web Services
task force, which has been delegated the review of Web Services
recommendations, etc., by the I18N-WG Core group. At our recent face-to-face
we reviewed the various i18n issues in SOAP 1.2 to ensure that we understood
the actions taken by XMLP/SOAP WG and that we had no further comments.
>
> One of the issues our task force reviewed was Appendix B. As I remarked in
our last exchange (see the bottom of this email for a refresher), the text
was pretty hard to read and very difficult to understand. The I18N-WG
resolved therefore to prepare an improved version, which is included in this
message between the ---- lines.
>
> This version is functionally identical to the previous version. However,
it should be easier to understand and therefore to correctly implement.
Would you please incorporate this version into SOAP 1.2 Part 2 in place of
the current version?
>
> Best Regards,
>
> Addison (on behalf of W3C-I18N-WG)
>
> Addison P. Phillips
> Director, Globalization Architecture
> webMethods, Inc.
>
> +1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
> -------------------------------------------------
> Internationalization is an architecture.
> It is not a feature.
>
> Chair, W3C-I18N-WG Web Services Task Force
> To participate see http://www.w3.org/International/ws
>
> [1] http://www.w3.org/2000/xp/Group/2/06/LC/soap12-part2.html#namemap
> [2] http://www.w3.org/2000/xp/Group/xmlp-lc-issues.html#x270
>
> ----------------
>
> B. Mapping Application Defined Names to XML Names
>
> This appendix details an algorithm for taking an application defined name,
such as the name of a variable or field in a programming language, and
mapping it to the Unicode characters that are legal in the names of XML
elements and attributes as defined in [Namespaces in XML].
>
> Note: Application defined names are generally subject to the specific
restrictions of their underlying development environment. Ideally these
names should restricted to the subset of Unicode characters in accordance
with the guidelines in Unicode Standard Annex #15, Annex 7 ("Programming
Language Identifiers")[1]. Names that follow these guidelines will generally
also follow the guidelines in [Namespaces in XML].
>
> Hex Digits
>
> [5]    hexDigit    ::=    [0-9A-F]
> * Note, only uppercase letters A-F are defined here.
>
> B.1 Rules for mapping application defined names to XML Names
>
> 1. An XML Name has two parts: Prefix and LocalPart. Let Prefix be
determined per the rules and constraints specified in Namespaces in XML [3].
The LocalPart will be determined by transforming the application name of the
object as follows.
>
> 2. Let "T" be the application name. "T" must be represented by a sequence
of Unicode characters in Unicode Normalization Form C (NFC) [2]. Let "M" be
the output of this algorithm, also as a sequence of Unicode characters.
>
> Note: If the name in the application is represented in some non-Unicode
character encoding, it must be converted to Unicode before starting the name
mapping process. Ideally any such conversion will use a reversable
conversion (so that the original byte sequence can be obtained from the
Unicode sequence), although this is not possible for some encodings.
>
> Note: Characters in the application name's original encoding that do not
have a mapping to Unicode should be handled in some reasonable, application
defined manner.
>
> Note: "A sequence of Unicode characters" means a sequence of code points
(sometimes called Unicode Scalar Values). This should not be taken to imply
that the characters are using any particular encoding of Unicode. The
conversion algorithm itself may use whatever Unicode encoding is most
convenient on that platform. However, it is important to note that surrogate
pair sequences (pairs of UTF-16 code points that represent characters in
Unicode above U+FFFF) must be handled as a single Unicode code point (their
Scalar Value, that is the specific supplemental character they represent),
rather than as the individual bytes or surrogate characters in UTF-16.
Unpaired surrogates are not permitted. For example:
>
> The UTF-16 sequence 0xD800 0xDC00 represents the Unicode character
U+10000. When performing the following steps, the value U+10000 is
considered to be a single Unicode character in the sequence T.
>
> 3. Let "i" be an integer representing the current position in "T", T(i),
with a starting value of 1, such that T(1) is the first character in T, T(2)
is the second, and so forth. Let "n" represent the last position in T, T(n).
>
> 4. Starting with 1, iterate across T by increasing i by 1 and perform the
following evaluation on each character.
>
>    a. If T begins with the string "xml" (or any upper/lower case
variation, such as "xmL", "XML", or "xMl"), encode T(1) using rule 4.c.i
(that is, output either "_x0078_" for "x" or "_x0058_" for "X") and increase
"i" to 4.
>
>    Note: if the sequence starts with "xml"/"XML"/"xMl"/etc., and is
followed by a Unicode combining character, the combining character is not
considered part of the letter "l" for processing purposes. By contrast,
precomposed characters such as U+013B (Latin letter capital L with cedilla)
do not trigger this rule. That is "xmĻ" is encoded as xmĻ, whereas
xml(U+0300) is encoded as _x0078_ml_x0300_. [Note that Normalization Form C
means that this special case will always produce the same resulting
sequence.]
>
>    b. If T(i) falls in the range 0xD800 through 0xDFFF (that is, there is
an unpaired surrogate character in the sequence), stop with an error.
>
>    c. Else if T(i) is not a valid XML NCName character (see [3]) or if i=1
and T(i) is not a valid first character of an XML NCName (see [3]) then:
>
>       i. If T(i) < U+10000 and T(i) is not in the range 0xD800 through
0xDFFF, output to "M" the sequence "_x" followed by four hexDigits
representing the Unicode Scalar Value followed by an underscore ("_") (for
example, "x" (U+0078) would be encoded as "_x0078_").
>
>       ii. Else if T(i) > U+FFFF, output to "M" the sequence "_x" followed
by eight hexDigits representing the Unicode Scalar Value. For example, the
character 0x10FFFE would be encoded as "_x0010FFFE_".
>
>    d. Else if T(i) = "_" (lowline) and T(i+1) = "x" or "X", output
"_x005F_" to M.
>
>    e. Else output T(i) to M.
>
> [1]
http://www.unicode.org/unicode/reports/tr15/#Programming_Language_Identifier
s
> [2] http://www.unicode.org/unicode/reports/tr15/
> [3] http://www.w3.org/TR/REC-xml-names/
>
> Examples:
>
>
>
> Hello world -> Hello_x0020_world  // space not permitted
> Hello_xorld -> Hello_x005F_xorld  // _x rule
> Helloworld_ -> Helloworld_
>
>           x -> x
>         xml -> _x0078_ml   // starts with xml
>        -xml -> _x002D_xml  // starts with hyphen-minus
>        x-ml -> x-ml        // not the string xml
>
>      Ælfred -> Ælfred
>    άγνωστος -> άγνωστος
> ᜉᜅᜎᜈ        -> _x1709__x1705__x170E__x1708_   // the Tagalog block is
newer and not permitted in XML 1.0
> ᏙᏚᎥ         -> _x13D9__x13DA__x13A5_  // The Cherokee block is newer and
not permitted in XML 1.0
>
> xml̀moo -> _x0078_ml_x0300_moo  // Starts with "xml". Note that combining
character U+0300 is considered as separate from the "l" in "xml".
>
> Note to editor> The last example is the only one that I have changed
(added).
>
>
>
>
> ----------------
>
>
> > -----Original Message-----
> > From: Martin Gudgin [mailto:mgudgin@microsoft.com]
> > Sent: Sunday, August 25, 2002 6:03 PM
> > To: Addison Phillips [wM]; asirv@webmethods.com
> > Cc: W3C Public Archive; Jean-Jacques Moreau; Marc Hadley; Nilo Mitra;
> > Noah Mendelson; Henrik Frystyk Nielsen
> > Subject: RE: Algorithm for mapping an application defined name to an XML
> > name
> >
> >
> > Addison,
> >
> > Thanks very much for your detailed comments. I've commented inline
> >
> > Martin
> >
> > > -----Original Message-----
> > > From: Addison Phillips [wM] [mailto:aphillips@webmethods.com]
> > > Sent: 24 August 2002 00:05
> > > To: Martin Gudgin; asirv@webmethods.com
> > > Cc: W3C Public Archive; Jean-Jacques Moreau; Marc Hadley;
> > > Nilo Mitra; Noah Mendelson; Henrik Frystyk Nielsen
> > > Subject: RE: Algorithm for mapping an application defined
> > > name to an XML name
> > >
> > >
> > > Hi Martin,
> > >
> > > Thanks for the note. It's been awhile since I thought about this.
> >
> > Sorry it's taken us so long to incorporate your feedback.
> >
> > >
> > > My edits were done from the original proposal. Although I
> > > modified the text to be more correct about various Unicode
> > > issues, I didn't change the structure of the original at all.
> > > (FWIW, I would have designed and written it differently. And
> > > I hate standards that obfuscate what's going on as much as
> > > this one does. It not being my document, I didn't rewrite it.
> > > I just edited the text to be more correct.)
> >
> > If you have ( or have time to produce ) a more readable version, I'm
> > sure the editorial team would be very grateful.
> >
> /* much more deleted... */
>
>
Received on Wednesday, 11 December 2002 19:57:15 UTC