RE: [WSTF] My view... [long] from Paul Deuter on 2003-01-31 (public-i18n-ws@w3.org from January 2003)

From: Paul Deuter <PaulD@plumtree.com>
Date: Fri, 31 Jan 2003 10:18:05 -0800
To: "Addison Phillips [wM]" <aphillips@webmethods.com>
Cc: "Mark Davis" <mark.davis@jtcsv.com>, "Martin Duerst" <duerst@w3.org>, <public-i18n-ws@w3.org>, <debasish@us.ibm.com>
Message-ID: <C7F00D7948B8E4468BB330152C6BA4E003191B34@cstaex03.usiplumtree.ad>
Well you are right that some systems have charset as a kind of
extension to a locale specifier.  I am thinking of Oracle:

NLS_LANG=Japanese_Japan.JA16SJIS

It could be handy for a webservice that was interacting with
an Oracle database to be able to have charset as part of the
locale tag.

Still I have always viewed charset as orthogonal to locale. I 
suppose the reason is because we often need charset to deal
with legacy locale-insensitive data.

-Paul


-----Original Message-----
From: Addison Phillips [wM] [mailto:aphillips@webmethods.com]
Sent: Friday, January 31, 2003 9:46 AM
To: Paul Deuter
Cc: Mark Davis; Martin Duerst; public-i18n-ws@w3.org;
debasish@us.ibm.com
Subject: Re: [WSTF] My view... [long]


Hi Paul,

Thanks for de-lurking. Blech! Here I am defending something odious. ;-)

Nonetheless...

I agree that Web services themselves will or should be entirely Unicode 
based, since, at the very least, they do all of their data exchange via 
XML files and the majority of implementations deal only with UTF-8 as an 
encoding.

However... my distinction here is pretty fine. It's between the concrete 
locale implementation (the actual locale thingy or thingies) that you 
and Mark appear to be referring to and communicating preferences between 
systems (the "locale tag" or the "i18n context", to use Deb's term)... 
the question is: given a client and a server running on two different 
operating environments, how does the client communicate its "i18n 
context" to the server.

The easy case is to use RFC3066 style language plus region. But this is 
not the edge case. The edge cases are generally the ones in which 
variants come into play. Traditional locales have flaws in this regard. 
This discussion of charset comes out of my enumeration of the different 
"variations" that various locale models use. Imagine that we send all of 
the setting I enumerated in a "locale tag":

lang = zh
region = CN
script = simplified
collation = binary
currency = CNY
orthography = standard?
charset = GBK

Okay, so my Java program instantiates the locale "zh-CN"... was anyone 
hurt by this? The charset isn't actually a locale member in Java and has 
no meaning in Java. It might be used to affect a byte-oriented 
interaction (probably in a very negative way).

What about a POSIX system? It's running "zh-CN.GBK@pinyin". Does it hurt 
to omit some of that information when invoking remote Web services? What 
happens if that invoked Web service is (a wrapper around) another POSIX 
program? Does the loss of information affect the outcome?

I'm not sure, but it might if the collation is "binary" and the charset 
is changed from GBK to UTF-8.

I'm not arguing that charset has a place in locales. Only that 
pretending charset hasn't been associated with locales will make life 
slightly more difficult for certain platforms. That may not be a bad 
price to pay. I don't, for example, think that charset should be 
exchanged by default. I see it as a potential optional element in an 
i18n-context exchange, because I don't really understand what the 
ramifications of not having it are yet.

Thanks,

Addison

Paul Deuter wrote:
> I agree - leave charset out of locale.
> I have encountered difficult to diagnose bugs whose cause ended
> up being locale sensitive code switching the encoding.  For example,
> setting a locale to Japanese and then getting the encoding switched
> to Shift-Jis.
> Besides we all know that there isn't a one to one mapping of locales
> to charsets.
> And finally, yes we are making progress toward Unicode everywhere.
> A design of future I18N services should assume that some encoding of
> Unicode will be used in most circumstances.
> 
> My 2 cents.
> -Paul
> 
> -----Original Message-----
> From: Addison Phillips [wM] [mailto:aphillips@webmethods.com]
> Sent: Thursday, January 30, 2003 9:25 PM
> To: Mark Davis
> Cc: Martin Duerst; public-i18n-ws@w3.org; debasish@us.ibm.com
> Subject: Re: [WSTF] My view... [long]
> 
> 
> 
> I don't think charset has a place in locale.
> 
> What we're talking about here aren't locales, though, but locale tags or 
> preferences. The problem we describing is how to equate locales between 
> operating environments. The problems center, generally, on the areas 
> away from language/region, that is, on variants and the other things 
> that I've tried to categorize and highlight. Instantiating the right 
> locale can mean the difference for legacy systems. Traditional UNIX 
> systems are (sadly) tied to their encodings.
> 
> Any tagging system will have to make hard choices about what to define 
> explicitly and what (if anything) to make platform defined. My email was 
> an attempt to enumerate these possibilities as I've experienced them and 
> try to sound out what the objections might be for including/dumping them.
> 
> I would rather not have encodings mixed up with locales.
> 
> I raised the UNIX case and am trying to think hard about it to ensure 
> that we don't miss anything.
> 
> In the last year one of the things I used locale tags for was 
> negotiating the collation sequence between Java and UNIX C-based 
> programs. Although ultimately I was able to work around the encoding 
> problem using a solution similar to that which you and Deb have 
> suggested, the experience did leave me thinking about how objections 
> might be raised and dealt with---and what happens when you cannot simply 
> replace your legacy implementation (the way I was able to)?
> 
> I don't claim to like it: I am rooting for an all-Unicode world, 
> personally. But maybe we shouldn't outlaw charset in a tagging 
> structure. We should at least consider the plausible reasons for it.
> 
> Thanks,
> 
> Addison
> 
> Mark Davis wrote:
> 
>>(I'm catching up on some mail here).
>>
>>I'd like to find out more about why you think the charset needs to be
>>included in the locale specification. I believe that it is orthogonal to the
>>whole notion. It may be necessary in the short term to have some kind of
>>mechanism on a POSIX-style system to look up the "best" locale + charset
>>combination given a locale, but that does not seem worth burdening the
>>general concept of a locale with it.
>>
>>Mark
>>________
>>mark.davis@jtcsv.com
>>IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
>>(408) 256-3148
>>fax: (408) 256-0799
>>
>>----- Original Message -----
>>From: "Addison Phillips [wM]" <aphillips@webmethods.com>
>>To: "Martin Duerst" <duerst@w3.org>; <public-i18n-ws@w3.org>
>>Cc: <debasish@us.ibm.com>
>>Sent: Wednesday, January 15, 2003 17:58
>>Subject: RE: [WSTF] My view... [long]
>>
>>
>>
>>
>>>Hi Martin,
>>>
>>>Thanks for the comments. A few notes below, which probably need more
>>>thought.
>>>
>>>Addison
>>>
>>>
>>>
>>>>-----Original Message-----
>>>>From: public-i18n-ws-request@w3.org
>>>>[mailto:public-i18n-ws-request@w3.org]On Behalf Of Martin Duerst
>>>>Sent: Wednesday, January 15, 2003 2:33 PM
>>>>To: Addison Phillips [wM]; public-i18n-ws@w3.org
>>>>Cc: debasish@us.ibm.com
>>>>Subject: Re: [WSTF] My view... [long]
>>>>
>>>>
>>>>
>>>>>I consider "locale" to be an environment setting or settings that
>>>>
>>govern
>>
>>
>>>>>how data is (to be) processed. Language preference may either be
>>>>
>>>>an aspect
>>>>
>>>>
>>>>>of "locale" or may be a separate environment setting.
>>>>
>>>>And 'locale' also has several different aspects. That lets me
>>>>generally favor a hierarchy of
>>>>
>>>>  i18n context
>>>>     - language
>>>>     - date format
>>>>     - number format
>>>>     - sort order
>>>>     - ...
>>>>
>>>>rather than
>>>>
>>>>  language
>>>>  locale
>>>>    - date format
>>>>    - number format
>>>>    - sort order
>>>>    - ...
>>>>
>>>
>>>Yes, exactly so. But I have some concerns with "dissolving" locales into
>>>"i18n context" soup, as it were. The foremost is that most operating
>>>environments (the language the actual service is written in, as opposed to
>>>the container) want a native locale object to enable those various
>>
>>settings.
>>
>>
>>>So those items are "subsettings" of a locale object, and not necessarily a
>>>full-fledged item themselves.
>>>
>>>I would put language as a sub-item of locale, but for the fact that there
>>>are existing systems that deal with language. The question is whether to
>>>coexist, thus:
>>>
>>> i18n context
>>>   - language
>>>   - locale
>>>       -language
>>>       -region etc.
>>>
>>>Or even:
>>>
>>> language
>>> i18n-context
>>>    - locale
>>>       - numberfmt, etc.
>>>
>>>
>>>
>>>>>In traditional software, the developer doesn't need to
>>>>
>>>>explicitly obtain a
>>>>
>>>>
>>>>>Locale object or setting in order to write internationalized code. The
>>>>>default behavior of locale-aware functions obtain and use the
>>>>
>>>>locale from
>>>>
>>>>
>>>>>the system environment. Yes, the developer must call the correct
>>>>
>>>>functions
>>>>
>>>>
>>>>>or methods, but s/he must only obtain a locale when the desire is to
>>>>>override the user's preference (the default).
>>>>
>>>>I guess one important point may be that developers should not have to
>>>>care about language/locale issues when these are marginal (e.g. language
>>>>of error messages), but that there may be an advantage to force them
>>>>to have to think about such issues explicitly if they are central
>>>>to the application (e.g. currency conversions in some cases).
>>>
>>>
>>>Yes, exactly. It also means that well-written code running on multi-locale
>>>capable Web service containers gets multi-locale operation for free. This
>>>may include a great deal of code that is merely well-written, not
>>
>>explicitly
>>
>>
>>>internationalized. This is, to my mind, the best indicator "we've done
>>>something right".
>>>
>>>
>>>>>---
>>>>>Aside: These last two scenarios Deb described in an off-list email last
>>>>>week. These are "runAsClient", "runAsSpecified" and "runAsServer"
>>>>>scenarios, Deb.
>>>>>---
>>>>
>>>>Deb or Addison, could you send these scenarios to the list or to Kentaro
>>>>for inclusion in the document?
>>>
>>>The email thread is a bit chaotic. We'd probably have to forward the whole
>>>series.
>>>
>>>
>>>>>I'm on record as favoring tags over an XML structure, but I'm willing
>>>>
>>to
>>
>>
>>>>>be persuaded (and try persuasion ;-)).
>>>>
>>>>I think the question of whether to use a tag or a (simple) structure
>>>>is a syntax question, and thus could work both ways. A more important
>>>>question in my view is whether we want the solution to effectively
>>>>have a limited number of values (e.g. language x country (x variant)),
>>>>or whether we want to be open-ended, and how the tag/simple structure
>>>>gets associated with the actual data for the language/locale.
>>>>
>>>
>>>"Tag vs. structure" might then better be described as "tag" vs. "data". If
>>>the specific setting transfer is not our goal, then whether we use a URN
>>
>>or
>>
>>
>>>an XML structure is merely syntax, as you say.
>>>
>>>Having played with ULocale tags for the better part of a year, I have a
>>>better feel for where limited vs. open work than I did originally. ULocale
>>>tags are "open-ended" insofar as they allow an arbitrary number of
>>>additional, optional, or vendor-defined fields. The problem is getting
>>>agreement on what those fields and their values mean. So, in practice,
>>
>>they
>>
>>
>>>are closed.
>>>
>>>What I've found is that there is a distinct set of "values" that have
>>>meaning and may actually be necessary for total interoperability. These
>>
>>are:
>>
>>
>>>- language
>>>- region
>>>- charset (UNIX processes need it, can't get away from it)
>>>- collation
>>>- script (writing system, as with Trad and Simplified, or Japanese
>>
>>kana-only
>>
>>
>>>locale)
>>>- orthography (as Bokmal/Nynorsk)
>>>- currency (euro?)
>>>- other variant (all one-time-events, e.g. EURO)
>>>
>>>This historical problem with variant is that it is a catch-all field that
>>>does too many (different) things. In my own work, I've balled the last
>>
>>three
>>
>>
>>>up into "variant" and used script only on occasion. The remaining four
>>>(lang, region, charset, collation) maintain their importance to me, even
>>
>>in
>>
>>
>>>an all Unicode environment.
>>>
>>>For me it's because I sometimes need to do Java->POSIX. Although, as Deb
>>>points out, you probably wouldn't write an ML Web services container in
>>>XPG4-style C, you might use a Java container to invoke services or connect
>>>to resources (via JCA for example) that are written in such a style. Since
>>>the service invocation may be within the context of a native interface
>>
>>exec
>>
>>
>>>call, calling setlocale() on the service itself isn't quite the nasty
>>>problem that it would otherwise be and a complete POSIX locale is needed
>>
>>to
>>
>>
>>>invoke the service in a locale sensitive way.
>>>
>>>That is, my Solaris box has a locale called "ja_JP" and it is quite
>>>different than "ja_JP.UTF-8" or "ja_JP.eucJP" or even "ja_JP@kana".
>>
>>"ja-JP"
>>
>>
>>>isn't enough information to know which locale to initialize and I can't
>>
>>even
>>
>>
>>>enumerate the locales in some environments (like XPG4) to do it via
>>>inspection. A C program that is smart might try to guess its way through
>>
>>the
>>
>>
>>>locales: this is like the plug-in idea that Deb suggested.
>>>
>>>In any case, in modern programming this is less of an issue. But I still
>>>think that the variations I've listed are important
>>>in diverse cases, too many to just ignore them.
>>>
>>>
>>>>>I have use cases galore from webMethods that relate to the
>>>>
>>>>various aspects
>>>>
>>>>
>>>>>of the above.
>>>>
>>>>Please send them in so that we can integrate them in our doc.
>>>
>>>As I get to them. The problem is always time....::sigh::
>>>
>>>
>>>>>Some things to think about:
>>>>>
>>>>>--RFC3066 language tags (that is xx-YY) are sufficient for identifying
>>>>>Java locales and possibly C# CultureInfos, but not for POSIX,
>>>>
>>>>C/C++, Mac,
>>>>
>>>>
>>>>>and other native programming environments.
>>>>
>>>>Can you tell us what's missing in the later cases? For Posix/C/C++,
>>>>clearly the encoding is missing, but with XML, this falls out of the
>>>>equation. Anything else?
>>>
>>>POSIX/C++: charset/encoding. Have to have them even though XML files are
>>>Unicode.
>>>Mac (non OS/X): script or script code
>>>Win32: collation (you can live without, of course)
>>>Databases: encoding or collation?
>>>Host systems: CCSID or code page?
>>>// ... // too many more to really list
>>>Note that 3066 tags don't cover Java variants (for example). As long as
>>
>>you
>>
>>
>>>can live with "loss-of-precision" like that, it's okay, but I don't think
>>
>>we
>>
>>
>>>can live with it.
>>>
>>>
>>>
>>>>Regards,     Martin.
>>>>
>>>>
>>>
>>>
> 
> 


-- 
Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.

+1 408.962.5487  mailto:aphillips@webmethods.com
-------------------------------------------
Internationalization is an architecture. It is not a feature.

Chair, W3C I18N WG Web Services Task Force
http://www.w3.org/International/ws
Received on Friday, 31 January 2003 13:19:13 UTC