W3C home > Mailing lists > Public > public-i18n-core@w3.org > October to December 2009

RE: HTML5 Issue 11 (encoding detection): I18N WG response...

From: Phillips, Addison <addison@amazon.com>
Date: Mon, 12 Oct 2009 00:40:52 -0400
To: Larry Masinter <masinter@adobe.com>, Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>, Ian Hickson <ian@hixie.ch>
CC: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, Andrew Cunningham <andrewc@vicnet.net.au>, Richard Ishida <ishida@w3.org>, "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <C7A5719F1E562149BA9171F58BEE2CA412980D6906@EX-IAD6-B.ant.amazon.com>
Hello Larry,

Behind the odd use of the word "demographics", what's happening here is actually quite simple. This section contains the rules for a browser to determine the character encoding of a page. This is a necessary and important part of processing a page.

After a browser has failed to determine the encoding of a document via any other means (including, optionally, groping the individual bytes), *some* character encoding must be applied before attempting to display the document. Historically, browser vendors have localized which character encoding this is. Most Western European languages use the same legacy encoding (it's Windows code page 1252), while other languages often use something different.

Furthermore, most browsers also allow the user to choose what encoding is used, overriding the localized default.

This text is supposed to enable both of these features in browsers. 

It benefits page authors not at all to know this. They would do better to follow the advice that applies even today: always *always* identify the character encoding of each page... and, if possible, use UTF-8 as that encoding.

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.


> -----Original Message-----
> From: Larry Masinter [mailto:masinter@adobe.com]
> Sent: Sunday, October 11, 2009 6:40 PM
> To: Leif Halvard Silli; Ian Hickson
> Cc: "Martin J. Dürst"; Phillips, Addison; Andrew Cunningham;
> Richard Ishida; public-html@w3.org; public-i18n-core@w3.org
> Subject: RE: HTML5 Issue 11 (encoding detection): I18N WG
> response...
> 
> Can someone please explain, again, why the discussion of default
> configurations of a particular category of user agent in various
> regions belongs in the definition of the HyperText Markup Language?
> 
> What benefit can any author of a web page derive, please, from
> knowing what the default settings of various browsers in products
> sold into various language environments?
> 
> What benefits to the Internet, the Web, to anyone else, is there
> in specifying what the default configuration should be for various
> "demographics", independent of the actual user's language and
> preference? Does it help a Kenyan who brings a laptop for use
> by his Egyptian wife living in Finland?
> 
> What is going on here?
> 
> Thanks,
> 
> Larry
> --
> http://larry.masinter.net

> 
> 
> -----Original Message-----
> From: public-html-request@w3.org [mailto:public-html-request@w3.org]
> On Behalf Of Leif Halvard Silli
> Sent: Sunday, October 11, 2009 4:57 PM
> To: Ian Hickson
> Cc: "Martin J. Dürst"; Phillips, Addison; Andrew Cunningham;
> Richard Ishida; public-html@w3.org; public-i18n-core@w3.org
> Subject: Re: HTML5 Issue 11 (encoding detection): I18N WG
> response...
> 
> Ian Hickson On 09-10-11 21.23:
> 
> > On Sun, 11 Oct 2009, Leif Halvard Silli wrote (reordered):
> >> The choice of character set - alphabet - for instance, has
> always been a
> >> political matter, and still is.
> >
> > Ok, then it seems sensible to use a political way of speaking to
> refer to
> > the choice of alphabet.
> 
> 
> We do not choose alphabet every day. Day to day, the right to use
> the alphabet that your language requires is what matters. And
> ditto language is required to express that.
> 
> >> "Western this-and-that" is predominantly a political way of
> speaking.
> >
> > Good, then it is appropriate terminology.
> 
> 
> Appropriate for what? Diplomatic language is political and
> accurate, yet tries to avoid contested political phrasings.
> 
> "Western European Language [environments]" as Addison suggested is
> a reasonable neutral term, btw, despite use of "Western". It also
> gives the reader much more hints about what the politics
> involved  ...
> 
> Western demographics, OTOH ... You mentioned Africa: Egypt was a
> colony once. So was Kenya. Why does Kenya have an Western
> demographic, but Egypt not?
> 
> >> Therefore is wrong to use a wording that causes readers to think
> in
> >> political terms.
> >
> > But you agree that it _is_ a political matter.
> 
> 
> Which "it" are you referring to now?
> 
> >> It is wrong to nourish the thought that if some population
> changes to
> >> use an alphabet which is covered by Win1252, that they then will
> start
> >> to belong to the "Western demographics".
> >
> > It doesn't matter if a population _changes_ to use an alphabet
> which is
> > covered by 1252, because that will only affect future pages, not
> legacy
> > pages, and it is only legacy pages we are concerned about.
> 
> I see the logic, but I wonder how you can any outcome for granted.
> I don't know what is default in Azerbaijan today ...
> 
> > What phrase best approximates the areas of the world where
> _today_ UAs are
> > shipping with a 1252 default encoding?
> 
> 
> "Western demographics" is a term that leaves the job of finding
> out which those areas are to the reader, anyhow.
> 
> If you want to give better hints, then you could speak about "the
> British commonwealth, predominantly English, French, Spanish and
> Portuguese speaking demographics, demographics that was
> alphabetized as Western colonies earlier colonies of France,
> Belgium, England, Spain, Portugal" - etc. You should of course add
> that "the list is not exhaustive".
> 
> You could also say "demographics using the Latin alphabet covered
> by ASCII plus the letters ŠŒŽÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÚÛÜÝÞß". You
> may say that this is circular. But at least it can help
> implementors find the answer.
> 
> You could also list the names of the different Latin alphabets
> that are considered covered by Win1252: the ASCII alphabet, German
> alphabet(s), Scandinavian, etc. See Wikipedia:
> 
> http://en.wikipedia.org/wiki/Latin-derived_alphabet

> http://en.wikipedia.org/wiki/Basic_modern_Latin_alphabet

> 
> You could also say "demographics covered by the Latin alphabet,
> except the following and other countries, which uses letters that
> are not covered by Win1251: Turkey, Croatia, Azerbaijan etc etc"
> 
> >> Does Croatia belong to "Western demographics, for instance? Why?
> And why
> >> not? The Croatian alphabet is not covered by Win1252. What about
> Serbia?
> >> Serbia uses both Cyrillic and Latin side by side.
> >
> > What default encodings to browsers use in those areas?
> 
> 
> I don't know. I just know that Win1252 doesn't cover the Croatian
> alphabet. And I have also gotten the impression that it is a
> problem that - if using one's own alphabet is seen as the normal
> thing - software may not default to a charset using the local
> alphabet.
> 
> >> As you can see, "Western demographics" is a wording that -
> depending on
> >> how you define "Western" -covers both narrower and wider than
> e.g.
> >> "writing systems covered by Win1252".
> >
> > Is there a better term that would more accurately refer to the
> areas of
> > the world where a UA needs to ship with a Win1252 default
> encoding?
> 
> 
> Se above. And below.
> 
> >> For example you could say "For demographics that are covered by
> what in
> >> user agents and e-mail applications are typically known as
> "Western" or
> >> "West European" encodings, then Win1252 is the best default".
> >
> > That's circular logic ("Use Win1252 as a default for demographics
> where
> > Win1252 is the default").
> 
> 
> To say that "Win1252" is the default for those areas which are
> covered by what is referred to as "Western encodings", is not a
> circular argument.
> 
> But your focus appears to be *areas*. And from that point of view
> I can see why you think it is circular.
> 
> But I thought that it was more relevant for implementors to know
> that Win1252 is considered the default for wherever "Western
> Encodings" are useful, than it is for them to know that there
> apparently exists a secret Union of Window 1252 Countries ...
> 
> However, I just now looked in Firefox to see what it meant by
> Western, and found, under "West European", both Greek and
> "Western" encodings ...
> 
> I suppose that Win1252 isn't the default encoding in Greece?
> 
> Proves that "Western" is a very imprecise term.
> 
> > The point is to be able to give implementation
> > advice that is useful independent of the implementor performing
> any
> > reverse engineering, studying of other user agents, etc.
> 
> It doesn't require "reverse engineering" to find out the language
> of a population, does it? What's really needed, if you want to do
> a good job, is to visit that country and observe and judge.
> 
> The issue of reverse engineering is, however, connected to what I
> said above above about "Win1252" being the default for areas
> covered by "Western encodings".
> --
> leif halvard silli

Received on Monday, 12 October 2009 04:41:35 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 12 October 2009 04:41:36 GMT