Re: URIs in HTML5 and issues arising

Julian Reschke wrote:
> 
> Julian Reschke wrote:
>> ...
>> Here I was asking for a different number (the amount of pages that use 
>> non-ASCII characters in queries, which are *also* not included in the 
>> document's encoding).
>> ...
> 
> Sorry, confused two cases.
> 
> The number above would be interesting for the issue how to encode query 
> characters that aren't compatible with the document encoding.

I can't find that, since I can't reliably detect the document encoding 
(when it's not specified by HTTP/<meta>) and so I get lots of false 
positives when scanning through the data, and I can't test most of the 
matched pages in a web browser since they've changed in the months since 
I first downloaded them all.

The only cases (out of ~130K pages listed on dmoz.org) where I noticed 
there was a real problem, and where the page with the problem still 
exists (so I could see that Opera translated the broken characters into 
"?"), were:

http://www.jiraiya.com/pc/ - Shift_JIS - <a 
href="http://www.amazon.co.jp/exec/obidos/external-search?search-type=ss&tag=sendaijiraiya-22&keyword=&#34719;&#19998;&#65389;&#33292;&#12539;&#39519;&#20597;&#128;&#128;&index=books-jp" 
target="_blank">あの有名な漫画「夏子の酒」</a>の作者尾瀬あきら先生のご推 
薦選ばれました!!

http://www.a-travel.sk/ - windows-1250 - <p><a 
href="?atravel=&#882;sko&amp;id=43"><font 
color="#ff6600">&Iacute;rsko</font></a>&nbsp; Dublin</p>

(I haven't checked how other browsers handle these cases.)

(This seems too little data (and too imprecise) to be able to draw any 
conclusions at all.)

> The number that I actually was looking for is the amount of pages
> 
> - contain (unescaped) non-ASCII character in queries, *and*
> - use a document encoding other than UTF-8 (*).

If I grep the undecoded byte streams of 130K pages for 
/(?i)href\s*=\s*"[^"]*\?[^"]*[^\x00-\x7f]/ (i.e. non-ASCII bytes in 
queries in double-quoted href attributes, not counting &-escaped 
non-ASCII characters, unless I got something wrong), then skip all 
"mailto:" links, then count the number of pages per charset (determined 
by HTTP and <meta>), I get

            big5 6
          euc-jp 1
          euc-kr 6
          euc_kr 1
          gb2312 106
             gbk 1
      iso-8859-1 196
     iso-8859-15 4
      iso-8859-2 23
      iso-8859-9 8
            none 1
   pt-iso-8859-1 1
       shift_jis 24
           utf-8 67
    windows-1250 15
    windows-1251 29
    windows-1252 17
    windows-1254 22
    windows-1255 12
   windows-1255; 1
    windows-1256 3
     windows-874 1
     windows-932 1
          x-sjis 3

(The pages' charset distribution is as in 
http://philip.html5.org/data/charsets.html - most significantly, this is 
about one in eight of all the gb2312 pages.)

Is that kind of what you were looking for?

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Monday, 30 June 2008 14:23:24 UTC