- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Mon, 30 Jun 2008 15:22:33 +0100
- To: Julian Reschke <julian.reschke@gmx.de>
- CC: Ian Hickson <ian@hixie.ch>, uri@w3.org, HTML WG <public-html@w3.org>
Julian Reschke wrote:
>
> Julian Reschke wrote:
>> ...
>> Here I was asking for a different number (the amount of pages that use
>> non-ASCII characters in queries, which are *also* not included in the
>> document's encoding).
>> ...
>
> Sorry, confused two cases.
>
> The number above would be interesting for the issue how to encode query
> characters that aren't compatible with the document encoding.
I can't find that, since I can't reliably detect the document encoding
(when it's not specified by HTTP/<meta>) and so I get lots of false
positives when scanning through the data, and I can't test most of the
matched pages in a web browser since they've changed in the months since
I first downloaded them all.
The only cases (out of ~130K pages listed on dmoz.org) where I noticed
there was a real problem, and where the page with the problem still
exists (so I could see that Opera translated the broken characters into
"?"), were:
http://www.jiraiya.com/pc/ - Shift_JIS - <a
href="http://www.amazon.co.jp/exec/obidos/external-search?search-type=ss&tag=sendaijiraiya-22&keyword=螟丞ュ舌・驟偵€€&index=books-jp"
target="_blank">あの有名な漫画「夏子の酒」</a>の作者尾瀬あきら先生のご推
薦選ばれました!!
http://www.a-travel.sk/ - windows-1250 - <p><a
href="?atravel=Ͳsko&id=43"><font
color="#ff6600">Írsko</font></a> Dublin</p>
(I haven't checked how other browsers handle these cases.)
(This seems too little data (and too imprecise) to be able to draw any
conclusions at all.)
> The number that I actually was looking for is the amount of pages
>
> - contain (unescaped) non-ASCII character in queries, *and*
> - use a document encoding other than UTF-8 (*).
If I grep the undecoded byte streams of 130K pages for
/(?i)href\s*=\s*"[^"]*\?[^"]*[^\x00-\x7f]/ (i.e. non-ASCII bytes in
queries in double-quoted href attributes, not counting &-escaped
non-ASCII characters, unless I got something wrong), then skip all
"mailto:" links, then count the number of pages per charset (determined
by HTTP and <meta>), I get
big5 6
euc-jp 1
euc-kr 6
euc_kr 1
gb2312 106
gbk 1
iso-8859-1 196
iso-8859-15 4
iso-8859-2 23
iso-8859-9 8
none 1
pt-iso-8859-1 1
shift_jis 24
utf-8 67
windows-1250 15
windows-1251 29
windows-1252 17
windows-1254 22
windows-1255 12
windows-1255; 1
windows-1256 3
windows-874 1
windows-932 1
x-sjis 3
(The pages' charset distribution is as in
http://philip.html5.org/data/charsets.html - most significantly, this is
about one in eight of all the gb2312 pages.)
Is that kind of what you were looking for?
--
Philip Taylor
pjt47@cam.ac.uk
Received on Monday, 30 June 2008 14:23:24 UTC