- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Mon, 30 Jun 2008 17:58:32 -0400
- To: Julian Reschke <julian.reschke@gmx.de>
- Cc: Ian Hickson <ian@hixie.ch>, uri@w3.org, HTML WG <public-html@w3.org>
Julian Reschke wrote: > > Julian Reschke wrote: >> ... >> Here I was asking for a different number (the amount of pages that use >> non-ASCII characters in queries, which are *also* not included in the >> document's encoding). >> ... > > Sorry, confused two cases. > > The number above would be interesting for the issue how to encode query > characters that aren't compatible with the document encoding. I can't find that, since I can't reliably detect the document encoding (when it's not specified by HTTP/<meta>) and so I get lots of false positives when scanning through the data, and I can't test most of the matched pages in a web browser since they've changed in the months since I first downloaded them all. The only cases (out of ~130K pages listed on dmoz.org) where I noticed there was a real problem, and where the page with the problem still exists (so I could see that Opera translated the broken characters into "?"), were: http://www.jiraiya.com/pc/ - Shift_JIS - <a href="http://www.amazon.co.jp/exec/obidos/external-search?search-type=ss&tag=sendaijiraiya-22&keyword=螟丞ュ舌・驟偵€€&index=books-jp" target="_blank">あの有名な漫画「夏子の酒」</a>の作者尾瀬あきら先生のご推 薦選ばれました!! http://www.a-travel.sk/ - windows-1250 - <p><a href="?atravel=Ͳsko&id=43"><font color="#ff6600">Írsko</font></a> Dublin</p> (I haven't checked how other browsers handle these cases.) (This seems too little data (and too imprecise) to be able to draw any conclusions at all.) > The number that I actually was looking for is the amount of pages > > - contain (unescaped) non-ASCII character in queries, *and* > - use a document encoding other than UTF-8 (*). If I grep the undecoded byte streams of 130K pages for /(?i)href\s*=\s*"[^"]*\?[^"]*[^\x00-\x7f]/ (i.e. non-ASCII bytes in queries in double-quoted href attributes, not counting &-escaped non-ASCII characters, unless I got something wrong), then skip all "mailto:" links, then count the number of pages per charset (determined by HTTP and <meta>), I get big5 6 euc-jp 1 euc-kr 6 euc_kr 1 gb2312 106 gbk 1 iso-8859-1 196 iso-8859-15 4 iso-8859-2 23 iso-8859-9 8 none 1 pt-iso-8859-1 1 shift_jis 24 utf-8 67 windows-1250 15 windows-1251 29 windows-1252 17 windows-1254 22 windows-1255 12 windows-1255; 1 windows-1256 3 windows-874 1 windows-932 1 x-sjis 3 (The pages' charset distribution is as in http://philip.html5.org/data/charsets.html - most significantly, this is about one in eight of all the gb2312 pages.) Is that kind of what you were looking for? -- Philip Taylor pjt47@cam.ac.uk
Received on Monday, 30 June 2008 21:59:07 UTC