Re: XMLP SOAP1.2 Issue 270: Further clarification... from Mark Davis on 2002-12-12 (xmlp-comments@w3.org from December 2002)

From: Mark Davis <mark.davis@jtcsv.com>
Date: Wed, 11 Dec 2002 18:10:22 -0800
To: "Addison Phillips [wM]" <aphillips@webmethods.com>, <xmlp-comments@w3.org>
Cc: <public-i18n-ws@w3.org>, <w3c-i18n-ig@w3.org>
Message-ID: <003e01c2a183$a11dd5f0$6ede2b09@DAVIS1>
I thought that might be the case (re the last two items), but included them
just in case.

Mark
__________________________________
http://www.macchiato.com
►  “Eppur si muove” ◄

----- Original Message -----
From: "Addison Phillips [wM]" <aphillips@webmethods.com>
To: "Mark Davis" <mark.davis@jtcsv.com>; <xmlp-comments@w3.org>
Cc: <public-i18n-ws@w3.org>; <w3c-i18n-ig@w3.org>
Sent: Wednesday, December 11, 2002 17:46
Subject: RE: XMLP SOAP1.2 Issue 270: Further clarification...


Hi Mark,

Thanks for your comments.

I should note that, while we did rewrite the section from stem to stern, the
actual algorithm is entirely unchanged, a fact that is (I am given to
understand) important to XMLP. Hence the escaping mechanism and leading
zeroes are preserved.

~Addison

>
> Item 4a is confusing. If this is meant to be: "If the substring
> from T[i] to
> T[n] begins with the string "xml"...", then it should say so. If not, then
> it should be outside of the loop that starts with item 4.

This should be outside the loop, since it applies only if i == 1 (as it
were), that is, if the string T starts with "xml".

>
> It is certainly ok to have this step, and reject such a string, but in
> practice it is not really necessary. Gigo.

That's true. I added this instruction during an earlier revision and didn't
remove it after clarifying the use of Unicode scalars elsewhere.

>
> A. Leading zeros are unnecessary, and make the string longer. They are not
> necessary since there is a terminator. I'd recommend omitting
> them. Thus one
> would have _x78_ instead of _x0078_, and _x10FFFE_ instead of _x0010FFFE_.
> B. If a rare character (or character sequence) is chosen instead of "_x",
> then it would be much less likely that the quoting mechanism in (d) would
> need to be invoked. If a single rare character is chosen instead of a
> sequence, it avoids look-ahead.
>
> Mark
> __________________________________
> http://www.macchiato.com
> ►  “Eppur si muove” ◄
>
> ----- Original Message -----
> From: "Addison Phillips [wM]" <aphillips@webmethods.com>
> To: <xmlp-comments@w3.org>
> Cc: <public-i18n-ws@w3.org>; <w3c-i18n-ig@w3.org>
> Sent: Wednesday, December 11, 2002 16:08
> Subject: XMLP SOAP1.2 Issue 270: Further clarification...
>
>
> >
> > Dear XMLP Editors,
> >
> > Some time ago I worked with webMethods' XMLP rep, Asir, to produce a
> rewrite of SOAP 1.2 Part 2 Appendix B [1]. This is issue #270 [2] on your
> issues list. More recently I became the chair of the I18N-WG's
> Web Services
> task force, which has been delegated the review of Web Services
> recommendations, etc., by the I18N-WG Core group. At our recent
> face-to-face
> we reviewed the various i18n issues in SOAP 1.2 to ensure that we
> understood
> the actions taken by XMLP/SOAP WG and that we had no further comments.
> >
> > One of the issues our task force reviewed was Appendix B. As I
> remarked in
> our last exchange (see the bottom of this email for a refresher), the text
> was pretty hard to read and very difficult to understand. The I18N-WG
> resolved therefore to prepare an improved version, which is
> included in this
> message between the ---- lines.
> >
> > This version is functionally identical to the previous version. However,
> it should be easier to understand and therefore to correctly implement.
> Would you please incorporate this version into SOAP 1.2 Part 2 in place of
> the current version?
> >
> > Best Regards,
> >
> > Addison (on behalf of W3C-I18N-WG)
> >
> > Addison P. Phillips
> > Director, Globalization Architecture
> > webMethods, Inc.
> >
> > +1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
> > -------------------------------------------------
> > Internationalization is an architecture.
> > It is not a feature.
> >
> > Chair, W3C-I18N-WG Web Services Task Force
> > To participate see http://www.w3.org/International/ws
> >
> > [1] http://www.w3.org/2000/xp/Group/2/06/LC/soap12-part2.html#namemap
> > [2] http://www.w3.org/2000/xp/Group/xmlp-lc-issues.html#x270
> >
> > ----------------
> >
> > B. Mapping Application Defined Names to XML Names
> >
> > This appendix details an algorithm for taking an application
> defined name,
> such as the name of a variable or field in a programming language, and
> mapping it to the Unicode characters that are legal in the names of XML
> elements and attributes as defined in [Namespaces in XML].
> >
> > Note: Application defined names are generally subject to the specific
> restrictions of their underlying development environment. Ideally these
> names should restricted to the subset of Unicode characters in accordance
> with the guidelines in Unicode Standard Annex #15, Annex 7 ("Programming
> Language Identifiers")[1]. Names that follow these guidelines
> will generally
> also follow the guidelines in [Namespaces in XML].
> >
> > Hex Digits
> >
> > [5]    hexDigit    ::=    [0-9A-F]
> > * Note, only uppercase letters A-F are defined here.
> >
> > B.1 Rules for mapping application defined names to XML Names
> >
> > 1. An XML Name has two parts: Prefix and LocalPart. Let Prefix be
> determined per the rules and constraints specified in Namespaces
> in XML [3].
> The LocalPart will be determined by transforming the application
> name of the
> object as follows.
> >
> > 2. Let "T" be the application name. "T" must be represented by
> a sequence
> of Unicode characters in Unicode Normalization Form C (NFC) [2].
> Let "M" be
> the output of this algorithm, also as a sequence of Unicode characters.
> >
> > Note: If the name in the application is represented in some non-Unicode
> character encoding, it must be converted to Unicode before
> starting the name
> mapping process. Ideally any such conversion will use a reversable
> conversion (so that the original byte sequence can be obtained from the
> Unicode sequence), although this is not possible for some encodings.
> >
> > Note: Characters in the application name's original encoding that do not
> have a mapping to Unicode should be handled in some reasonable,
> application
> defined manner.
> >
> > Note: "A sequence of Unicode characters" means a sequence of code points
> (sometimes called Unicode Scalar Values). This should not be
> taken to imply
> that the characters are using any particular encoding of Unicode. The
> conversion algorithm itself may use whatever Unicode encoding is most
> convenient on that platform. However, it is important to note
> that surrogate
> pair sequences (pairs of UTF-16 code points that represent characters in
> Unicode above U+FFFF) must be handled as a single Unicode code
> point (their
> Scalar Value, that is the specific supplemental character they represent),
> rather than as the individual bytes or surrogate characters in UTF-16.
> Unpaired surrogates are not permitted. For example:
> >
> > The UTF-16 sequence 0xD800 0xDC00 represents the Unicode character
> U+10000. When performing the following steps, the value U+10000 is
> considered to be a single Unicode character in the sequence T.
> >
> > 3. Let "i" be an integer representing the current position in "T", T(i),
> with a starting value of 1, such that T(1) is the first character
> in T, T(2)
> is the second, and so forth. Let "n" represent the last position
> in T, T(n).
> >
> > 4. Starting with 1, iterate across T by increasing i by 1 and
> perform the
> following evaluation on each character.
> >
> >    a. If T begins with the string "xml" (or any upper/lower case
> variation, such as "xmL", "XML", or "xMl"), encode T(1) using rule 4.c.i
> (that is, output either "_x0078_" for "x" or "_x0058_" for "X")
> and increase
> "i" to 4.
> >
> >    Note: if the sequence starts with "xml"/"XML"/"xMl"/etc., and is
> followed by a Unicode combining character, the combining character is not
> considered part of the letter "l" for processing purposes. By contrast,
> precomposed characters such as U+013B (Latin letter capital L
> with cedilla)
> do not trigger this rule. That is "xmĻ" is encoded as xmĻ, whereas
> xml(U+0300) is encoded as _x0078_ml_x0300_. [Note that
> Normalization Form C
> means that this special case will always produce the same resulting
> sequence.]
> >
> >    b. If T(i) falls in the range 0xD800 through 0xDFFF (that
> is, there is
> an unpaired surrogate character in the sequence), stop with an error.
> >
> >    c. Else if T(i) is not a valid XML NCName character (see
> [3]) or if i=1
> and T(i) is not a valid first character of an XML NCName (see [3]) then:
> >
> >       i. If T(i) < U+10000 and T(i) is not in the range 0xD800 through
> 0xDFFF, output to "M" the sequence "_x" followed by four hexDigits
> representing the Unicode Scalar Value followed by an underscore ("_") (for
> example, "x" (U+0078) would be encoded as "_x0078_").
> >
> >       ii. Else if T(i) > U+FFFF, output to "M" the sequence
> "_x" followed
> by eight hexDigits representing the Unicode Scalar Value. For example, the
> character 0x10FFFE would be encoded as "_x0010FFFE_".
> >
> >    d. Else if T(i) = "_" (lowline) and T(i+1) = "x" or "X", output
> "_x005F_" to M.
> >
> >    e. Else output T(i) to M.
> >
> > [1]
> http://www.unicode.org/unicode/reports/tr15/#Programming_Language_
> Identifier
> s
> > [2] http://www.unicode.org/unicode/reports/tr15/
> > [3] http://www.w3.org/TR/REC-xml-names/
> >
> > Examples:
> >
> >
> >
> > Hello world -> Hello_x0020_world  // space not permitted
> > Hello_xorld -> Hello_x005F_xorld  // _x rule
> > Helloworld_ -> Helloworld_
> >
> >           x -> x
> >         xml -> _x0078_ml   // starts with xml
> >        -xml -> _x002D_xml  // starts with hyphen-minus
> >        x-ml -> x-ml        // not the string xml
> >
> >      Ælfred -> Ælfred
> >    άγνωστος -> άγνωστος
> > ᜉᜅᜎᜈ        -> _x1709__x1705__x170E__x1708_   // the Tagalog block is
> newer and not permitted in XML 1.0
> > ᏙᏚᎥ         -> _x13D9__x13DA__x13A5_  // The Cherokee block is newer and
> not permitted in XML 1.0
> >
> > xml̀moo -> _x0078_ml_x0300_moo  // Starts with "xml". Note that
> combining
> character U+0300 is considered as separate from the "l" in "xml".
> >
> > Note to editor> The last example is the only one that I have changed
> (added).
> >
> >
> >
> >
> > ----------------
> >
> >
> > > -----Original Message-----
> > > From: Martin Gudgin [mailto:mgudgin@microsoft.com]
> > > Sent: Sunday, August 25, 2002 6:03 PM
> > > To: Addison Phillips [wM]; asirv@webmethods.com
> > > Cc: W3C Public Archive; Jean-Jacques Moreau; Marc Hadley; Nilo Mitra;
> > > Noah Mendelson; Henrik Frystyk Nielsen
> > > Subject: RE: Algorithm for mapping an application defined
> name to an XML
> > > name
> > >
> > >
> > > Addison,
> > >
> > > Thanks very much for your detailed comments. I've commented inline
> > >
> > > Martin
> > >
> > > > -----Original Message-----
> > > > From: Addison Phillips [wM] [mailto:aphillips@webmethods.com]
> > > > Sent: 24 August 2002 00:05
> > > > To: Martin Gudgin; asirv@webmethods.com
> > > > Cc: W3C Public Archive; Jean-Jacques Moreau; Marc Hadley;
> > > > Nilo Mitra; Noah Mendelson; Henrik Frystyk Nielsen
> > > > Subject: RE: Algorithm for mapping an application defined
> > > > name to an XML name
> > > >
> > > >
> > > > Hi Martin,
> > > >
> > > > Thanks for the note. It's been awhile since I thought about this.
> > >
> > > Sorry it's taken us so long to incorporate your feedback.
> > >
> > > >
> > > > My edits were done from the original proposal. Although I
> > > > modified the text to be more correct about various Unicode
> > > > issues, I didn't change the structure of the original at all.
> > > > (FWIW, I would have designed and written it differently. And
> > > > I hate standards that obfuscate what's going on as much as
> > > > this one does. It not being my document, I didn't rewrite it.
> > > > I just edited the text to be more correct.)
> > >
> > > If you have ( or have time to produce ) a more readable version, I'm
> > > sure the editorial team would be very grateful.
> > >
> > /* much more deleted... */
> >
> >
>
>
Received on Wednesday, 11 December 2002 21:10:33 UTC