RE: [WSTF] My view... [long] from Paul Deuter on 2003-01-31 (public-i18n-ws@w3.org from January 2003)

From: Paul Deuter <PaulD@plumtree.com>
Date: Fri, 31 Jan 2003 09:21:08 -0800
To: "Addison Phillips [wM]" <aphillips@webmethods.com>, "Mark Davis" <mark.davis@jtcsv.com>
Cc: "Martin Duerst" <duerst@w3.org>, <public-i18n-ws@w3.org>, <debasish@us.ibm.com>
Message-ID: <C7F00D7948B8E4468BB330152C6BA4E003191B31@cstaex03.usiplumtree.ad>
I agree - leave charset out of locale.
I have encountered difficult to diagnose bugs whose cause ended
up being locale sensitive code switching the encoding.  For example,
setting a locale to Japanese and then getting the encoding switched
to Shift-Jis.
Besides we all know that there isn't a one to one mapping of locales
to charsets.
And finally, yes we are making progress toward Unicode everywhere.
A design of future I18N services should assume that some encoding of
Unicode will be used in most circumstances.

My 2 cents.
-Paul

-----Original Message-----
From: Addison Phillips [wM] [mailto:aphillips@webmethods.com]
Sent: Thursday, January 30, 2003 9:25 PM
To: Mark Davis
Cc: Martin Duerst; public-i18n-ws@w3.org; debasish@us.ibm.com
Subject: Re: [WSTF] My view... [long]



I don't think charset has a place in locale.

What we're talking about here aren't locales, though, but locale tags or 
preferences. The problem we describing is how to equate locales between 
operating environments. The problems center, generally, on the areas 
away from language/region, that is, on variants and the other things 
that I've tried to categorize and highlight. Instantiating the right 
locale can mean the difference for legacy systems. Traditional UNIX 
systems are (sadly) tied to their encodings.

Any tagging system will have to make hard choices about what to define 
explicitly and what (if anything) to make platform defined. My email was 
an attempt to enumerate these possibilities as I've experienced them and 
try to sound out what the objections might be for including/dumping them.

I would rather not have encodings mixed up with locales.

I raised the UNIX case and am trying to think hard about it to ensure 
that we don't miss anything.

In the last year one of the things I used locale tags for was 
negotiating the collation sequence between Java and UNIX C-based 
programs. Although ultimately I was able to work around the encoding 
problem using a solution similar to that which you and Deb have 
suggested, the experience did leave me thinking about how objections 
might be raised and dealt with---and what happens when you cannot simply 
replace your legacy implementation (the way I was able to)?

I don't claim to like it: I am rooting for an all-Unicode world, 
personally. But maybe we shouldn't outlaw charset in a tagging 
structure. We should at least consider the plausible reasons for it.

Thanks,

Addison

Mark Davis wrote:
> (I'm catching up on some mail here).
> 
> I'd like to find out more about why you think the charset needs to be
> included in the locale specification. I believe that it is orthogonal to the
> whole notion. It may be necessary in the short term to have some kind of
> mechanism on a POSIX-style system to look up the "best" locale + charset
> combination given a locale, but that does not seem worth burdening the
> general concept of a locale with it.
> 
> Mark
> ________
> mark.davis@jtcsv.com
> IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
> (408) 256-3148
> fax: (408) 256-0799
> 
> ----- Original Message -----
> From: "Addison Phillips [wM]" <aphillips@webmethods.com>
> To: "Martin Duerst" <duerst@w3.org>; <public-i18n-ws@w3.org>
> Cc: <debasish@us.ibm.com>
> Sent: Wednesday, January 15, 2003 17:58
> Subject: RE: [WSTF] My view... [long]
> 
> 
> 
>>Hi Martin,
>>
>>Thanks for the comments. A few notes below, which probably need more
>>thought.
>>
>>Addison
>>
>>
>>>-----Original Message-----
>>>From: public-i18n-ws-request@w3.org
>>>[mailto:public-i18n-ws-request@w3.org]On Behalf Of Martin Duerst
>>>Sent: Wednesday, January 15, 2003 2:33 PM
>>>To: Addison Phillips [wM]; public-i18n-ws@w3.org
>>>Cc: debasish@us.ibm.com
>>>Subject: Re: [WSTF] My view... [long]
>>>
>>>
>>>>I consider "locale" to be an environment setting or settings that
>>>
> govern
> 
>>>>how data is (to be) processed. Language preference may either be
>>>
>>>an aspect
>>>
>>>>of "locale" or may be a separate environment setting.
>>>
>>>And 'locale' also has several different aspects. That lets me
>>>generally favor a hierarchy of
>>>
>>>   i18n context
>>>      - language
>>>      - date format
>>>      - number format
>>>      - sort order
>>>      - ...
>>>
>>>rather than
>>>
>>>   language
>>>   locale
>>>     - date format
>>>     - number format
>>>     - sort order
>>>     - ...
>>>
>>
>>Yes, exactly so. But I have some concerns with "dissolving" locales into
>>"i18n context" soup, as it were. The foremost is that most operating
>>environments (the language the actual service is written in, as opposed to
>>the container) want a native locale object to enable those various
> 
> settings.
> 
>>So those items are "subsettings" of a locale object, and not necessarily a
>>full-fledged item themselves.
>>
>>I would put language as a sub-item of locale, but for the fact that there
>>are existing systems that deal with language. The question is whether to
>>coexist, thus:
>>
>>  i18n context
>>    - language
>>    - locale
>>        -language
>>        -region etc.
>>
>>Or even:
>>
>>  language
>>  i18n-context
>>     - locale
>>        - numberfmt, etc.
>>
>>
>>>>In traditional software, the developer doesn't need to
>>>
>>>explicitly obtain a
>>>
>>>>Locale object or setting in order to write internationalized code. The
>>>>default behavior of locale-aware functions obtain and use the
>>>
>>>locale from
>>>
>>>>the system environment. Yes, the developer must call the correct
>>>
>>>functions
>>>
>>>>or methods, but s/he must only obtain a locale when the desire is to
>>>>override the user's preference (the default).
>>>
>>>I guess one important point may be that developers should not have to
>>>care about language/locale issues when these are marginal (e.g. language
>>>of error messages), but that there may be an advantage to force them
>>>to have to think about such issues explicitly if they are central
>>>to the application (e.g. currency conversions in some cases).
>>
>>
>>Yes, exactly. It also means that well-written code running on multi-locale
>>capable Web service containers gets multi-locale operation for free. This
>>may include a great deal of code that is merely well-written, not
> 
> explicitly
> 
>>internationalized. This is, to my mind, the best indicator "we've done
>>something right".
>>
>>>
>>>>---
>>>>Aside: These last two scenarios Deb described in an off-list email last
>>>>week. These are "runAsClient", "runAsSpecified" and "runAsServer"
>>>>scenarios, Deb.
>>>>---
>>>
>>>Deb or Addison, could you send these scenarios to the list or to Kentaro
>>>for inclusion in the document?
>>
>>The email thread is a bit chaotic. We'd probably have to forward the whole
>>series.
>>
>>>
>>>>I'm on record as favoring tags over an XML structure, but I'm willing
>>>
> to
> 
>>>>be persuaded (and try persuasion ;-)).
>>>
>>>I think the question of whether to use a tag or a (simple) structure
>>>is a syntax question, and thus could work both ways. A more important
>>>question in my view is whether we want the solution to effectively
>>>have a limited number of values (e.g. language x country (x variant)),
>>>or whether we want to be open-ended, and how the tag/simple structure
>>>gets associated with the actual data for the language/locale.
>>>
>>
>>"Tag vs. structure" might then better be described as "tag" vs. "data". If
>>the specific setting transfer is not our goal, then whether we use a URN
> 
> or
> 
>>an XML structure is merely syntax, as you say.
>>
>>Having played with ULocale tags for the better part of a year, I have a
>>better feel for where limited vs. open work than I did originally. ULocale
>>tags are "open-ended" insofar as they allow an arbitrary number of
>>additional, optional, or vendor-defined fields. The problem is getting
>>agreement on what those fields and their values mean. So, in practice,
> 
> they
> 
>>are closed.
>>
>>What I've found is that there is a distinct set of "values" that have
>>meaning and may actually be necessary for total interoperability. These
> 
> are:
> 
>>- language
>>- region
>>- charset (UNIX processes need it, can't get away from it)
>>- collation
>>- script (writing system, as with Trad and Simplified, or Japanese
> 
> kana-only
> 
>>locale)
>>- orthography (as Bokmal/Nynorsk)
>>- currency (euro?)
>>- other variant (all one-time-events, e.g. EURO)
>>
>>This historical problem with variant is that it is a catch-all field that
>>does too many (different) things. In my own work, I've balled the last
> 
> three
> 
>>up into "variant" and used script only on occasion. The remaining four
>>(lang, region, charset, collation) maintain their importance to me, even
> 
> in
> 
>>an all Unicode environment.
>>
>>For me it's because I sometimes need to do Java->POSIX. Although, as Deb
>>points out, you probably wouldn't write an ML Web services container in
>>XPG4-style C, you might use a Java container to invoke services or connect
>>to resources (via JCA for example) that are written in such a style. Since
>>the service invocation may be within the context of a native interface
> 
> exec
> 
>>call, calling setlocale() on the service itself isn't quite the nasty
>>problem that it would otherwise be and a complete POSIX locale is needed
> 
> to
> 
>>invoke the service in a locale sensitive way.
>>
>>That is, my Solaris box has a locale called "ja_JP" and it is quite
>>different than "ja_JP.UTF-8" or "ja_JP.eucJP" or even "ja_JP@kana".
> 
> "ja-JP"
> 
>>isn't enough information to know which locale to initialize and I can't
> 
> even
> 
>>enumerate the locales in some environments (like XPG4) to do it via
>>inspection. A C program that is smart might try to guess its way through
> 
> the
> 
>>locales: this is like the plug-in idea that Deb suggested.
>>
>>In any case, in modern programming this is less of an issue. But I still
>>think that the variations I've listed are important
>>in diverse cases, too many to just ignore them.
>>
>>>>I have use cases galore from webMethods that relate to the
>>>
>>>various aspects
>>>
>>>>of the above.
>>>
>>>Please send them in so that we can integrate them in our doc.
>>
>>As I get to them. The problem is always time....::sigh::
>>
>>>
>>>>Some things to think about:
>>>>
>>>>--RFC3066 language tags (that is xx-YY) are sufficient for identifying
>>>>Java locales and possibly C# CultureInfos, but not for POSIX,
>>>
>>>C/C++, Mac,
>>>
>>>>and other native programming environments.
>>>
>>>Can you tell us what's missing in the later cases? For Posix/C/C++,
>>>clearly the encoding is missing, but with XML, this falls out of the
>>>equation. Anything else?
>>
>>POSIX/C++: charset/encoding. Have to have them even though XML files are
>>Unicode.
>>Mac (non OS/X): script or script code
>>Win32: collation (you can live without, of course)
>>Databases: encoding or collation?
>>Host systems: CCSID or code page?
>>// ... // too many more to really list
>>Note that 3066 tags don't cover Java variants (for example). As long as
> 
> you
> 
>>can live with "loss-of-precision" like that, it's okay, but I don't think
> 
> we
> 
>>can live with it.
>>
>>
>>>
>>>Regards,     Martin.
>>>
>>>
>>
>>
> 


-- 
Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.

+1 408.962.5487  mailto:aphillips@webmethods.com
-------------------------------------------
Internationalization is an architecture. It is not a feature.

Chair, W3C I18N WG Web Services Task Force
http://www.w3.org/International/ws
Received on Friday, 31 January 2003 12:38:17 UTC