RE: [Comment on WS-I18N WD] from Phillips, Addison on 2008-06-16 (www-international@w3.org from April to June 2008)

From: Phillips, Addison <addison@amazon.com>
Date: Mon, 16 Jun 2008 16:39:19 -0700
To: Dan Chiba <dan.chiba@oracle.com>
CC: "www-international@w3.org" <www-international@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA013B11CA97@EX-SEA5-D.ant.amazon.com>
> > On the other hand, does it make sense to advertise that a Web
> service supports a locale that it has no messages for? If the
> service normally has no user interface ("formatDate", "addInts",
> "sortStrings"), then the list of available locales might very well
> match the complete set available in the API. At the other end of
> the spectrum are AJAX interactions that build the UI in real time.
> Then only the messages that you actually have available are useful
> to advertise.
> >
> I think it makes sense to advertise the set of supported locales.

That would tend to be the point of this work: we provide a way to say any of the following:

- this service is locale-neutral; you may specify a locale, but it doesn't do anything to the service
- this service has a specific default locale that it uses ("it is always in German"); they user can specify whatever they want, but the service always uses this one
- this service has some specific (and specified) list of available locales (and by inference some default); the user may specify the locale to use and the service will do its best to match it from the specified list
- this service is locale sensitive; the user may specify the locale to use and the service will do its best to match it, noting that a list is not provided

> It may
> be the list of available translation languages, formatting locales,
> those locales for which linguistic sorting behavior is supported,
> or something alike.

Yes, and we need to support the service implementer making the decision about which pattern to advertise and/or use. You and I might choose entirely different criteria for choosing how we advertise locale support for a given service.

> Because a service cannot determine the appropriate
> locale for the locale sensitive service operation, it needs to be
> made
> possible for the service consumer to discover what locale is
> supported,
> in order for the application to produce the desired UI behavior.

I agree, with a nit:

- sometimes it doesn't make sense to list everything that is available. Sometimes it is better (consumes less bandwidth, processing, etc.) to say: "I'll do my best to match your request". This can even make sense when the list is quite short.


> > My concern here is that many services fall into a sort of middle
> category: they can service many locales, but only have a limited
> set of localizations. Messages from the services are necessarily
> constrained to the smaller set, while the service might actually be
> useful for a larger set.
> >
> Yes that is why we think the translation locale and the locale for
> other purposes should be identified separately.

I understand that's your intent. However I think this will confuse the vast preponderance of developers who have only a very rough idea what a "locale" or a "language" is. There are also different ways that services can be provisioned. It may not be possible to enumerate one list or the other easily. Having two things that do roughly the same thing doesn't seem that useful to me. How often do you actually set LC_MESSAGES separately from LC_ALL?

> > My tendency is still to think that this is "locale" and not
> "language". It looks like a bug to get a message like: "There were
> « 1 234 » entries sorted on 14 juin." Where the locale was clearly
> one thing and the messages in another language.
> >
> Having both #1 locale and #3 language does not mean that would
> produce
> the odd message. If using the same locale for the message
> formatting is
> a requirement, the component can use #3 language alone to make the
> message locale consistent.


But this is inconsistent with the design of WS-I18N, where "locale" is the "big knob". I tend to think that relatively few people would know how to write an application like this.

A better solution might be: if we provide a list of available locales, we can provide an additional attribute to indicate which ones have been provisioned with messages. For example:

<i18n:locale>
  <i18n:option default="true" localized="true">en-GB</i18n:option>
  <i18n:option localized="true">de</i18n:option>
  <i18n:option>fr</i18n:option>
</i18n:locale>

Here the default locale is "en-GB". German ("de") is also available, with localizations, as is French ("fr"), sans localization.

A request could come in as something like:

<i18n:locale>en-US,de-CH-1994,fr</i18n:locale> <!-- in this case, it matches "de" -->

Or perhaps:

<i18n:locale>en,zh-yue,ja-JP</i18n:locale> <!-- in this case you get en-GB as the default -->

And finally:

<i18n:locale>fr-FR</i18n:locale> <!-- you get French locale behavior, but probably en-GB messages; no "fr" is available -->


> > What is missing in the current version is that we don't provide:
> >
> > - a way to enumerate the available items
> > - a way to specify the complete set of preferences
> > - a reference to RFC 4647 Lookup (that is, locale-based resource
> negotiation)
> >
> I agree. Again my understanding is that these are to be provided as
> a separate document or a future revision.

Note that WS-I18N in its current incarnation is exactly the second draft. W3C's first version (2005-09-14) was taken from a trial balloon I wrote. At that time there was no Lookup algorithm, no LTLI (okay, there still isn't an LTLI, but that's something to fix), not much in the way of LDML, and RFC 4646 was still an Internet-Draft (with several to follow). With these items available to us, we should do the work to get WS-I18N right (it's actually a fairly minor set of revisions required, IMO).

> > I don't say that Unicode is forced upon people (although using
> SOAP is mighty close to forcing UTF-8). What I'm saying is that, as
> a parameter, it usually doesn't make a lot of sense. The data often
> has to be transcoded for the benefit of (for example) the XML
> processor anyway. The fact that data exists as some legacy encoding
> affects the results or operation of the service itself (you still
> can't store Japanese character data in a WE8ISO8859P1 database even
> if the Web service layer permits you to send it some). But it's not
> necessarily something that one can usefully specify at the service
> layer.
> >
> > Anyway, I don't want to sound completely absolutist here. I know
> what kinds of cases you're thinking of and think they have merit.
> >
> I do agree character set is generally not so useful as other
> elements
> and not encouraged to use. I just think a character set is
> considered as
> one of the elements of a locale and some people may find it useful
> if WS-I18N defines how to indicate it.

Character set is considered one of the elements of *some* locale systems. The question is: what does this parameter do or mean? If I have a <i18n:charset>ISO8859_1</i18n:charset> in my service's WS-Policy, does that mean I should transcode my SOAP request to Latin-1? Am I limited to Latin-1 characters in my request? Will I only receive Latin-1 characters in the response? The charset limitation may occur on several different levels of the system or it may simply be an assertion about the data.

Since most developers wouldn't know what an encoding was if it grew legs and bit them, that makes me wary. If nothing else, we need to put a big Health Warning sticker on it :-).

>
> If a requester is only interested in getting responses in a
> specific
> native character set (e.g. the response will be processed in a
> component
> that can only process a native encoding, or it will be stored in a
> database that can only store a specific native character set), the
> service could filter the response based on this information.
>

Degrading the data early is usually a bad option :-). Converting the data from the UTF-8 used in the transport layer to the local encoding is usually en effective enough filter---and YOUR code did it, not my beautiful, pristine service <g>. This, for example, is true when you find out that "ISO 8859-1" sometimes means "windows-1252"... but sometimes it doesn't.

Anyway, I digress. We can probably find a way to accommodate 'charset'. All I'm saying is: how we do it is important.

Addison
Received on Monday, 16 June 2008 23:39:58 UTC