Re: HTML5 Issue 11 (encoding detection): I18N WG response... from Martin J. Dürst on 2009-10-12 (public-i18n-core@w3.org from October to December 2009)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Mon, 12 Oct 2009 14:33:24 +0900
To: Andrew Cunningham <andrewc@vicnet.net.au>
CC: Larry Masinter <masinter@adobe.com>, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, Ian Hickson <ian@hixie.ch>, "Phillips, Addison" <addison@amazon.com>, Richard Ishida <ishida@w3.org>, "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <4AD2BFA4.10504@it.aoyama.ac.jp>
This is a thought that occurred to me today, too.

When a browser enters a certain market, it has to mobilize a lot of 
resources to do localization (by volunteers or payed staff). The 
localization team may indeed be in the best position to decide which 
default encoding to use.

My understanding, from what I learned from Ian, is that HTML5 tries to 
make it easier to write a browser, without doing reverse-engineering. 
But the problem with the current "western demographic" wording is that 
browser implementers will have to re-engineer that term. As Leif 
explains in quite some detail, the definition is indeed quite circular: 
iso-8859-1 was designed for the iron curtain period Western Europe (with 
some limitations), and windows-1252 follows that that. But the term 
"Western" has many meanings, and is used much more differenciated these 
days, and languages completely unrelated to Western Europe (Kurdish, 
Swahili) use iso-8859-1 just because they fit in (and quite some more 
languages for windows-1252).

Regards,   Martin.

On 2009/10/12 10:59, Andrew Cunningham wrote:
> *shrugs*
>
>
> as far as i can tell its something that shouldn't be defined by the
> developers, but rather defined by the localisation teams who choose a
> suitable default encoding for the particular UI locale they are developing.
>
>
> Larry Masinter wrote:
>
>> Can someone please explain, again, why the discussion of default
>> configurations of a particular category of user agent in various
>> regions belongs in the definition of the HyperText Markup Language?
>>
>> What benefit can any author of a web page derive, please, from
>> knowing what the default settings of various browsers in products
>> sold into various language environments?
>>
>> What benefits to the Internet, the Web, to anyone else, is there
>> in specifying what the default configuration should be for various
>> "demographics", independent of the actual user's language and
>> preference? Does it help a Kenyan who brings a laptop for use
>> by his Egyptian wife living in Finland?
>>
>> What is going on here?
>>
>> Thanks,
>>
>> Larry
>> --
>> http://larry.masinter.net
>>
>>
>> -----Original Message-----
>> From: public-html-request@w3.org [mailto:public-html-request@w3.org]
>> On Behalf Of Leif Halvard Silli
>> Sent: Sunday, October 11, 2009 4:57 PM
>> To: Ian Hickson
>> Cc: "Martin J. Dürst"; Phillips, Addison; Andrew Cunningham; Richard
>> Ishida; public-html@w3.org; public-i18n-core@w3.org
>> Subject: Re: HTML5 Issue 11 (encoding detection): I18N WG response...
>>
>> Ian Hickson On 09-10-11 21.23:
>>
>>> On Sun, 11 Oct 2009, Leif Halvard Silli wrote (reordered):
>>>> The choice of character set - alphabet - for instance, has always
>>>> been a
>>>> political matter, and still is.
>>> Ok, then it seems sensible to use a political way of speaking to
>>> refer to the choice of alphabet.
>>
>>
>> We do not choose alphabet every day. Day to day, the right to use the
>> alphabet that your language requires is what matters. And ditto
>> language is required to express that.
>>
>>>> "Western this-and-that" is predominantly a political way of speaking.
>>> Good, then it is appropriate terminology.
>>
>>
>> Appropriate for what? Diplomatic language is political and accurate,
>> yet tries to avoid contested political phrasings.
>>
>> "Western European Language [environments]" as Addison suggested is a
>> reasonable neutral term, btw, despite use of "Western". It also gives
>> the reader much more hints about what the politics involved ...
>>
>> Western demographics, OTOH ... You mentioned Africa: Egypt was a
>> colony once. So was Kenya. Why does Kenya have an Western demographic,
>> but Egypt not?
>>
>>>> Therefore is wrong to use a wording that causes readers to think in
>>>> political terms.
>>> But you agree that it _is_ a political matter.
>>
>>
>> Which "it" are you referring to now?
>>
>>>> It is wrong to nourish the thought that if some population changes
>>>> to use an alphabet which is covered by Win1252, that they then will
>>>> start to belong to the "Western demographics".
>>> It doesn't matter if a population _changes_ to use an alphabet which
>>> is covered by 1252, because that will only affect future pages, not
>>> legacy pages, and it is only legacy pages we are concerned about.
>>
>> I see the logic, but I wonder how you can any outcome for granted. I
>> don't know what is default in Azerbaijan today ...
>>
>>> What phrase best approximates the areas of the world where _today_
>>> UAs are shipping with a 1252 default encoding?
>>
>>
>> "Western demographics" is a term that leaves the job of finding out
>> which those areas are to the reader, anyhow.
>>
>> If you want to give better hints, then you could speak about "the
>> British commonwealth, predominantly English, French, Spanish and
>> Portuguese speaking demographics, demographics that was alphabetized
>> as Western colonies earlier colonies of France, Belgium, England,
>> Spain, Portugal" - etc. You should of course add that "the list is not
>> exhaustive".
>>
>> You could also say "demographics using the Latin alphabet covered by
>> ASCII plus the letters ŠŒŽÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÚÛÜÝÞß". You may say
>> that this is circular. But at least it can help implementors find the
>> answer.
>>
>> You could also list the names of the different Latin alphabets that
>> are considered covered by Win1252: the ASCII alphabet, German
>> alphabet(s), Scandinavian, etc. See Wikipedia:
>>
>> http://en.wikipedia.org/wiki/Latin-derived_alphabet
>> http://en.wikipedia.org/wiki/Basic_modern_Latin_alphabet
>>
>> You could also say "demographics covered by the Latin alphabet, except
>> the following and other countries, which uses letters that are not
>> covered by Win1251: Turkey, Croatia, Azerbaijan etc etc"
>>
>>>> Does Croatia belong to "Western demographics, for instance? Why? And
>>>> why not? The Croatian alphabet is not covered by Win1252. What about
>>>> Serbia? Serbia uses both Cyrillic and Latin side by side.
>>> What default encodings to browsers use in those areas?
>>
>>
>> I don't know. I just know that Win1252 doesn't cover the Croatian
>> alphabet. And I have also gotten the impression that it is a problem
>> that - if using one's own alphabet is seen as the normal thing -
>> software may not default to a charset using the local alphabet.
>>
>>>> As you can see, "Western demographics" is a wording that - depending
>>>> on how you define "Western" -covers both narrower and wider than
>>>> e.g. "writing systems covered by Win1252".
>>> Is there a better term that would more accurately refer to the areas
>>> of the world where a UA needs to ship with a Win1252 default encoding?
>>
>>
>> Se above. And below.
>>
>>>> For example you could say "For demographics that are covered by what
>>>> in user agents and e-mail applications are typically known as
>>>> "Western" or "West European" encodings, then Win1252 is the best
>>>> default".
>>> That's circular logic ("Use Win1252 as a default for demographics
>>> where Win1252 is the default").
>>
>>
>> To say that "Win1252" is the default for those areas which are covered
>> by what is referred to as "Western encodings", is not a circular
>> argument.
>>
>> But your focus appears to be *areas*. And from that point of view I
>> can see why you think it is circular.
>>
>> But I thought that it was more relevant for implementors to know that
>> Win1252 is considered the default for wherever "Western Encodings" are
>> useful, than it is for them to know that there apparently exists a
>> secret Union of Window 1252 Countries ...
>>
>> However, I just now looked in Firefox to see what it meant by Western,
>> and found, under "West European", both Greek and "Western" encodings ...
>>
>> I suppose that Win1252 isn't the default encoding in Greece?
>>
>> Proves that "Western" is a very imprecise term.
>>
>>> The point is to be able to give implementation advice that is useful
>>> independent of the implementor performing any reverse engineering,
>>> studying of other user agents, etc.
>>
>> It doesn't require "reverse engineering" to find out the language of a
>> population, does it? What's really needed, if you want to do a good
>> job, is to visit that country and observe and judge.
>>
>> The issue of reverse engineering is, however, connected to what I said
>> above above about "Win1252" being the default for areas covered by
>> "Western encodings".
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
Received on Monday, 12 October 2009 05:34:20 UTC