I18N-related comments to SOAP Version 1.2 (parts 0-2) from Martin Duerst on 2002-07-13 (xmlp-comments@w3.org from July 2002)

From: Martin Duerst <duerst@w3.org>
Date: Sat, 13 Jul 2002 21:10:33 +0900
To: xmlp-comments@w3.org
Cc: w3c-i18n-ig@w3.org
Message-Id: <4.2.0.58.J.20020713171030.00adb260@localhost>
Dear XML Protocol WG,

please receive the following i18n-related comments on your
last call versions of soap 1.2 parts 0-2. Please note that
these comments are currently not approved by the I18N WG,
but that they will most probably be discussed, and modified
and/or approved, at the WG's next teleconference next Tuesday.
I'm sending these comments already to give you as much time
as possible to reflect them in your specs.

Please copy the I18N IG on any discussion regarding these
comments.

I18N IG: Please note that discussions in the XML
Protocol WG are by default public.


General:

When printing on A4 paper, many of the examples get cut off
on the right. Examples should be reedited so that they are
somewhat less wide and can be printed on paper around the
world without loss.


Part 0:

- The examples should be changed to be more international.
   People travel all around the world, to places that have
   names with characters outside US-ASCII,... Web Services
   can easily take care of this, and this should be shown.
   (please ask your chair, who knows how to do this from
    the XML Schema primer :-).

- Example 6: The use of xml:lang="en-US" is very good. A comment
   saying why xml:lang is important would be even better.

- Example 8b: <x:date>12-14-01</x:date>: This is not interoperable!
   Please use either XML Schema dates (<x:date>2001-12-14</x:date>),
   because this is machine-to-machine communication, or something
   like <x:date xml:lang='en-US'>December 14, 2001</x:date> if this
   is intended for human viewers.

- Example 11: charset="utf-8": It would be a good chance to shortly
   explain the rules for the charset parameter with application/soap+xml
   (because otherwise, the reader has to follow two references).
   The best recommendation is probably: Don't use a 'charset' parameter
   on 'Content-Type', because then the rules for freestanding
   XML (UTF-8 and UTF-16 (the later always with BOM) as defauts,
   otherwise <?xml ... encoding='foo'...) apply.

- 4.2: "A binding, if using XML 1.0... MAY mandate that a particular
   character encoding or set of encodings be used.": This is good,
   but should be changed to say that in such a case, UTF-8/UTF-16
   should be choosen (in accordance with XML 1.0 and the Character
   Model).

- 5., last paragraph: This is written as if all white space is
   by default ignored. But it is probably meant to apply only
   to insignificant whitespace (e.g. between elements in element content).

- 5.4.2: <reason>: xml:lang is optimal, but there should be a note
   saying that it is strongly recommended.

- 5.4.2: <reason>: xml:lang is said to have a namespace name of
   "http://www.w3.org/XML/1998/namespace". This alone does not
   guarantee that the prefix will be 'xml' in XML 1.0 serialization,
   because the Infoset spec doesn't say so (or at least I didn't
   find something to that effect). This has to be nailed
   down here to avoid serializations such as
   <reason xmlns:foo='http://www.w3.org/XML/1998/namespace' foo:lang='...

- 5.4.2: <reason> is a human-readable string, but there is no way
   for the request side to indicate which language would be preferred.
   This is a serious problem. Solutions may include the definition
   of a soap feature (preferably a module) for this, or requirements/
   recommendations for bindings to make mechanisms they have available
   (e.g. Accept-Language for the HTTP binding).

- 5.4.2: In some cases, it can make sense to send <reason> in
   more than one language. Is this allowed? It may be a good idea.

- <reason> is currently the only place where human readable text is
   used. But despite Web Services being primarily machine-to-machine,
   we expect that quite some applications will include data that is
   ultimately targeted at humans, or will have to make some part
   of their processing dependent on human language and culture.
   This seems to indicate that some more work will have to be done.

- 6. This section should make it clear that soap uses the XML Schema
   type 'anyURI', and that therefore characters outside US-ASCII are
   allowed, but have to be mapped to URIs via UTF-8 (i.e. SOAP
   essentially uses IRIs, refer to XML Schema or XLink for conversion
   details) if e.g. the underlying protocol doesn't
   support IRIs. There also should be a requirement to deal with this
   in the binding framework, and an explanation of how to deal with
   this in the HTTP binding, as well as some tests (I can help
   with the tests). Also, a note in the Primer would help.

- 7.3: feature conflict between soap features and features in the
   underlying protocol: This is a general issue, not only for
   security. It should be mentioned in the chapter on bindings.


Part 2:

- 2: How is XML mixed content represented in this graph model?
   Can it be represented? If not, this is a serious problem for
   internationalization.

- 3.1.2: encoding simple values:
   There should be a note mentioning that most characters in the
   C0 range cannot be represented in XML.

- 5.1.1: Properties are restricted to simple datatypes. This may
   cause serious problems for internationalization.

- 7.5.1.3: response MAY be of content type other than application/soap+xml:
   add a note saying that care is needed because different
   content types may have different rules for the 'charset' parameter.

- Appendix A: This needs a major overhaul (Masahiro Sekiguchi already
   pointed out some problems quite a while ago).

   - Start with some introductory text explaining what's going on.

   - XML Name has two parts -> An XML Name ...

   - Let Prefix be computed: There is really no computation going on at all.

   - In order from left to right -> In order from first to last
     (otherwise, you get problems with bidirectionality)
     [but this will drop out anyway]

   - 2: change to: Let TAG be a name in an application, represented
     as a sequence of characters encoded in a particular character encoding.

   - 3: change to: Let UNI be the sequence of characters of TAG
     transcoded to Unicode with a normalizing transcoder (using NFC),
     and let M<sub>1</sub>, M<sub>2</sub>, ... , M<sub>N</sub> be the
     characters of UNI, in order from first to last.

   - Add a note: The number of characters in TAG is not necessarily
     the same as the number of characters in UNI, because transcoding
     may be one-to-many or many-to-one. The details of transcoding may
     be implementation-defined. There may be (very rarely) cases where
     there is no equivalent Unicode representation for TAG; such cases
     are not covered here.

   - remove 4.

   - Change all T<sub>foo</sub> to M<sub>foo</sub> in the rest.

   - Remove 5.1, moving up 5.2,...

   - Say explicitly that hex digits always use upper case letters.

   - Add examples with non-ASCII characters, both in the BMP (not
     only Latin-1) and outside the BMP.

Are we supposed to review the test cases, too?

Regards,     Martin.
Received on Saturday, 13 July 2002 08:11:12 UTC