Re: #428 Accept-Language ordering for identical qvalues

I have a 111 MB access log beginning in 2004 that records UA, AL & AC headers which I don't mind posting publicly. I have not analysed it so do not know what fraction of this log is bots, and it's only from an obscure 'test your accept language/charset headers' page on my personal site, so could be full of weird junk.

"wc -l" says there are 1.8 million lines in there, and the head of the file looks fine, but when I run tail, I see "n" where my script should have written a new line, so I probably edited the script at some point and missed a backslash. that will need cleaning up.


\ Nicholas.

On 21 Jan 2013, at 02:06, Mark Nottingham <mnot@mnot.net> wrote:

> That's interesting, thanks. 
> 
> One thing to add; even if the client includes a q=0, the server can still ignore it. 
> 
> Cheers,
> 
> P.S. If you are able (considering privacy issues, etc.) and want to dump such data in a useable format, feel free to ask for a repository on the github account.
> 
> 
> 
> On 21/01/2013, at 12:56 PM, Amos Jeffries <squid3@treenet.co.nz> wrote:
> 
>> My collection of 2 years worth of language headers says no.
>> 
>> Of 2018 unique Accept-Language header field-values;
>> 1532 are using q-values in a strictly sorted list
>> 491 are not using q-values
>> 14 are using "q=0.0".
>> 5 are using q-values and non-qvalues without ordering the sent list (1 looks otherwise normal, teh others are using puny-codes)
>> 
>> The 14 are also unique in being very long and having multiple entries with equal q-values. They are still without exception strictly ordered with the entries having no q-value entries first (as if q=1.0 was used for sort but omitted sending). They are also containing a number of oddities such as multiple entries for language codes with differing q-values.
>> 
>> NP: Of those 14 odd A-L headers noted above I have UA details on 8 of them. All claim to be Firefox but the Gecko dates do not line up with other info on those versions (the 11.0 was released some years before 3.5.9 on the same OS) so the whole input is a bit suspect.
>> 
>> 
>> The 5 cases un-ordered list have puny-code values with no q-value being listed after an otherwise normal series of languages. Like so:
>> "en-us,en;q=0.5,x-ns1qHkbtrt8Nhv,x-ns2E1e0Nnym7b6"
>> 
>> I have a few cases of q-value ordered list followed by wildcard "*" with no q-value. Sender obviously assuming the list is ordered.
>> 
>> 
>> 
>> Broken down by UA, which I started ~6 months ago at Juliens suggestion I have 54289 distinct UA visiting, of which;
>> 21756 are not sending A-L header at all
>> 19621 unique UA are using a single language code with no q-value
>> 12495 unique UA are using q-values as above.
>> 8 are sending only wildcard "*" or "*/*"
>> 
>> The remainder ~400 roughly match up with the 491 AL field-values not using q-values. Are older agents (Windows 98, NT, 2k stand out), agents sending the same language multiple times (VoilaBot variants and Safari there), or sending sub-language variants with the generic form last eg "en-GB,en", "en-US,en", "en-US,en,*" (Tablets and Mobile Safari mostly). Obviously assuming sorted lists even back into the Windows 98 ones.
>> 
>> There are also a few bots sending exactly 2 puny-code entries.
>> 
>> 
>> Amos
> 
> --
> Mark Nottingham   http://www.mnot.net/
> 
> 
> 
> 

Received on Monday, 21 January 2013 14:23:22 UTC