Re: #428 Accept-Language ordering for identical qvalues

On 23/01/2013 2:53 a.m., Julian Reschke wrote:
> On 2013-01-22 14:40, Nicholas Shanks wrote:
>> On 17 January 2013 09:14, Julian Reschke wrote:
>>> On 2013-01-17 09:59, Roy T. Fielding wrote:
>>>> than there are servers that implement language negotiation and
>>>> actually want to resolve ties at random.
>>>
>>> They do not "want" to resolve at random; they do so because they have
>>> implemented what the spec says. There's no reason to create an 
>>> ordered list
>>> structure when the spec says that an unordered list is sufficient.
>>
>> I think no implication of randomness should be permitted by the 
>> specifications.
>> They should instead require that a deterministic process be used, and
>> that, other than requests to services which explicitly exist to
>> provide random results (e.g. Wikipedia's "Random Page" link), the same
>> request should generate the same result providing nothing pertinent to
>> the resource has changed on the server.
>>
>> Someone, I don't recall who, gave the example of a home page loading
>> blog posts via AJAX, where the blog posts are available in two
>> languages. Random selection between the variants, where (q * qs)
>> values are equal for both languages, or are being ignored, would

That would be me. Take a note of the Androids below...

>
> Can you please give an example of clients sending these kind of header 
> field values?
>
> Clients that care can provide different qvalues, and as a matter of 
> fact, they do.

Uhm. Lets see..... where shall I start ?
  I think an overview of what happens what agents "care" enough to send 
q-values.
  Followed by a small sample of the 513 agents I have on record with no 
q-values at all.
  Judge for yourself which ones are interpreted better as sorted lists.

For starters I would like to say, that to be completely fair the 
majority of agents that I have on record (~54% of unique language:agent 
pair entries) *do* send q-values properly in accordance with the 
specification - and that same 54% of unique agent entries is all 
'voting' for the list to be ordered. I am presenting this sub-set as 
what types of complexity/confusion issues we are introducing when we 
rely solely on q-values to provide ordering semantics in the list.


WebKit ...

cs, en-us; 0.9, de-de; 0.8, ru-ru; 0.7
  - Mozilla/5.0 (X11; U; Linux; cs-CZ) AppleWebKit/532.4 (KHTML, like 
Gecko) Arora/0.10.1 Safari/532.4
  + do we consider that a list with q-values or not?
  + notice also how it is a much more "up to date" version the the 
following...

en;q=1.0, en;q=0.5, zh-cn, zh;q=0.5, en;q=0.5
  - Mozilla/5.0 (SymbianOS/9.2; U; Series60/3.1 NokiaE71-1/300.21.012; 
Profile/MIDP-2.0 Configuration/CLDC-1.1 ) AppleWebKit/413 (KHTML, like 
Gecko) Safari/413
  + Nokia Symbian and SonyEricsson WebKit/ 4XX-532 derived agents across 
the board seem to have 1 primary language set at q=1.0 followed by a 
list of others all sharing q=0.5 or no q-value at all as seen above.

cs-CZ, en-US
  - Mozilla/5.0 (Linux; U; Android 2.2; cs-cz; HTC Legend Build/FRF91) 
AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1
  + Starting with WebKit/533 all the mobiles seem to have moved to this 
2-language model with something then "en-US"

da-DK, en-US
  - Mozilla/5.0 (Linux; U; Android 4.0.4; da-dk; GT-P5110 Build/IMM76D) 
AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Safari/534.30

en-us,en
  - Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; Valve Steam Client; 
) AppleWebKit/534.1 (KHTML, like Gecko) Chrome/6.0.444.0 Safari/534.1

th-TH, en-US
  - Mozilla/5.0 (Linux; U; Android 4.0.3; th-th; A1 Build/IML74K) 
AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30


... and then we have iTunes. A massive "WTF?" going out to the iTunes 
developers if anyone is reading.

en;q=1.0,fr;q=1.0,de;q=0.9,ja;q=0.9,nl;q=0.9,it;q=0.9,es;q=0.8,pt;q=0.8,pt-PT;q=0.8,da;q=0.7,fi;q=0.7,nb;q=0.7,sv;q=0.7,ko;q=0.6,zh-Hans;q=0.6,zh-Hant;q=0.6,ru;q
=0.5,pl;q=0.5,tr;q=0.5,uk;q=0.5,ar;q=0.4,hr;q=0.4,cs;q=0.4,el;q=0.3,he;q=0.3,ro;q=0.3,sk;q=0.3,th;q=0.2,id;q=0.2,ms;q=0.2,en-GB;q=0.1,ca;q=0.1,hu;q=0.1,vi;q=0.1
  -  iTunes-iPad/5.1.1 (2; 32GB; dt:74)

en;q=1.0,fr;q=1.0,de;q=0.9,ja;q=0.9,nl;q=0.9,it;q=0.9,es;q=0.8,pt;q=0.8,pt-PT;q=0.8,da;q=0.7,fi;q=0.7,nb;q=0.7,sv;q=0.7,ko;q=0.6,zh-Hans;q=0.6,zh-Hant;q=0.6,ru;q
=0.5,pl;q=0.5,tr;q=0.5,uk;q=0.5,ar;q=0.4,hr;q=0.4,cs;q=0.4,el;q=0.3,he;q=0.3,ro;q=0.3,sk;q=0.3,th;q=0.2,id;q=0.2,ms;q=0.2,en-GB;q=0.1,ca;q=0.1,hu;q=0.1,vi;q=0.1
  - iTunes-iPhone/5.0 (4; 16GB)

en;q=1.0,fr;q=1.0,de;q=0.9,ja;q=0.9,nl;q=0.9,it;q=0.9,es;q=0.8,pt;q=0.8,pt-PT;q=0.8,da;q=0.7,fi;q=0.7,nb;q=0.7,sv;q=0.7,ko;q=0.6,zh-Hans;q=0.6,zh-Hant;q=0.6,ru;q
=0.5,pl;q=0.5,tr;q=0.5,uk;q=0.5,ar;q=0.4,hr;q=0.4,cs;q=0.4,el;q=0.3,he;q=0.3,ro;q=0.3,sk;q=0.3,th;q=0.2,id;q=0.2,ms;q=0.2,en-GB;q=0.1,ca;q=0.1,hu;q=0.1,vi;q=0.1
  - iTunes-iPhone/4.3.5 (3; 16GB)


... spiders are mostly doing a remarkably good job. At least it looks 
that way until the q-values get involved.

ja-JP,ja
  - Baiduspider+(+http://www.baidu.jp/spider/)

ja,en
  - Mozilla/5.0 (compatible; Steeler/3.5; 
http://www.tkl.iis.u-tokyo.ac.jp/~crawler/)

  ru, uk;q=0.8, be;q=0.8, en;q=0.7, *;q=0.01
  - Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
  + q=0.8 - Ukranian or Belarusian ?

  en-us,en-gb,en;q=0.99,*;q=0.01
  - TosCrawler/Nutch-1.5.1 
(http://www.toshiba.co.jp/rdc/about/crawl_info.htm; <dc-crawler at ml 
dot toshiba dot co dot jp>)
  + q=1.0 - English US or British? (no so much trouble for humans but 
for a search engine it might cause indexing trouble).

Don't know if you would call some of the major search engine bots 
popular or even "fixable problem"?


I host a translation server so it is likely that these below are from 
actual users working on text translation. You know, the kind of person 
who *really* objects to getting a randomly-wrong language displayed. 
Also these people are highly knowledgeable about language codes and what 
they mean, so if they entered these manually it was for a specific 
reason according to how they or their tools author interpreted the 
Accept-Language specs.

Note how the first entries have no q-value and are *sorted* as if they 
were q=1.0, which is what the spec says to do when no q-value is 
supplied remember ... Treat it as q=1.0.

  ca,ca-ES,es-es;q=0.9,es;q=0.9,en-US;q=0.9,en;q=0.9,es-419;q=0.8,ca-AD;q=0.8,en-gb;q=0.8,de-de;q=0.7,de;q=0.7,ca-CA;q=0.7,cs-CZ;q=0.6,cs;q=0.6,it-it;q=0.6,it;q=0.6,es-CL;q=0.5,en-au;q=0.5,fr-FR;q=0.5,fr;q=0.4,ru-ru;q=0.4,ru;q=0.4,es-x-mtfrom-en;q=0.4,es-ar;q=0.3,ja-JP;q=0.3,ja;q=0.3,pt-PT;q=0.2,pt;q=0.2,do-es;q=0.2,do;q=0.1,es-x-mtfrom-it;q=0.1,nl-nl;q=0.1,nl;q=0.1,en-en;q=0.0
  - Mozilla/5.0 (X11; Linux x86_64; rv:10.0.6) Gecko/20100101 
Firefox/10.0.6 Iceweasel/10.0.6
  + q=1.0 - Catalan Valencian or Spanish Catalan?
  + q=0.9 - Spanish or English? Generic or nationalized grammar?
  + q=0.8 - Spanish or Catalan Andoran or English or German or Catalan 
Valencian?
  + q=0.6 - want to try again with German or Catalan Generic?
  + q=0.5 - Spanish or Australian English or French?
  + q=0.4 - what about French or Russian?
  + q=0.3 - Argentine Spanish or Japanese?
  + q=0.1 - Spanish or Dutch?

de,de-DE,en-US;q=0.9,en;q=0.9,nl-nl;q=0.8,nl;q=0.8,en-gb;q=0.8,ro-RO;q=0.7,ro;q=0.7,fr-FR;q=0.6,fr;q=0.6,de-DE-1901;q=0.5,tr-TR;q=0.5,tr;q=0.5,pl-PL;q=0.4,pl;q=0.4,nl-NL;q=0.3,de-de;q=0.3,de-at;q=0.3,en-us;q=0.2,pl-pl;q=0.2,de;q=0.1,en-us;q=0.1,en;q=0.0
  - Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.15) 
Gecko/20110303 Firefox/3.6.15
  + q=0.9 - English Generic or US-centric ?
  + q=0.8 - Dutch or English?
  + q=0.5 - German or Turkish?
  + q=0.3 - Dutch or German?
  + q=0.2 - English or Polish?
  + q=0.1 - German or English?
  + q=0.1 - oops Cancel that q=0.9 US English option.
  + q=0.0 - oops Cancel that q=0.9 generic English option.

  + I skip q=1.0 (none), q=0.7, q=0.6 and q=0.4 because these, while 
being alternatives sharing a q-value, are in the ISO definitions 
semantically equivalent aliases for the same language. So any selection 
algorithm other than if-it-exists is a waste of CPU cycles but not a 
user problem.


We have only a few agents sending "q=1.0", by my interpretation of 2616 
these few are the "correct" users of q-values when q=1:

  en;q=1.0
  - w3m/0.5.2
  also the YoudaoBot spider with a mix of language codes. It seems to be 
trying to fetch different translations specifically for some reason.

en-us;q=1.0, es-ve;q=0.5
  - Mozilla/4.1 (U; BREW 3.1.5; en-US; Teleca/Q05A/INT)
  - NetFront/3.5.1 (BREW 5.0.1.2; U; en-us; LG; NetFront/3.5.1/AMB) 
Sprint LN510 MMP/2.0 Profile/MIDP-2.1 Configuration/CLDC-1.1
  there are a few other variations of this "NetFront/" framework from 
Samsung and LG mobile devices.

The rest (~50 unique agent:language pairs) using q=1.0 somewhere in the 
A-L header are all WebKit derived agents. We already covered how well 
they handle q-values.


Still a fair few browser few browser agents around with no q-values.

zh-cn,zh-tw
  - Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.0.1) 
Gecko/2008070208 Firefox/3.0.1

zh-cn,zh-tw
- Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9.2.3) 
Gecko/20100401 Firefox/3.6.3

en,zh,fr,de,it
  - Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.20) Gecko/20081217 
Firefox/2.0.0.20 Novarra-Vision/8.0

ru, en-US, en
  - Mozilla/5.0 (compatible; Konqueror/4.4; Linux) KHTML/4.4.5 (like Gecko)

ru, uk, en-US, en
- Mozilla/5.0 (compatible; Konqueror/4.4; FreeBSD) KHTML/4.4.3 (like Gecko)


HTH
Amos

Received on Thursday, 24 January 2013 08:38:08 UTC