W3C home > Mailing lists > Public > public-html@w3.org > June 2008

Re: URIs in HTML5 and issues arising

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Mon, 30 Jun 2008 15:22:33 +0100
Message-ID: <4868EC29.2000206@cam.ac.uk>
To: Julian Reschke <julian.reschke@gmx.de>
CC: Ian Hickson <ian@hixie.ch>, uri@w3.org, HTML WG <public-html@w3.org>

Julian Reschke wrote:
> Julian Reschke wrote:
>> ...
>> Here I was asking for a different number (the amount of pages that use 
>> non-ASCII characters in queries, which are *also* not included in the 
>> document's encoding).
>> ...
> Sorry, confused two cases.
> The number above would be interesting for the issue how to encode query 
> characters that aren't compatible with the document encoding.

I can't find that, since I can't reliably detect the document encoding 
(when it's not specified by HTTP/<meta>) and so I get lots of false 
positives when scanning through the data, and I can't test most of the 
matched pages in a web browser since they've changed in the months since 
I first downloaded them all.

The only cases (out of ~130K pages listed on dmoz.org) where I noticed 
there was a real problem, and where the page with the problem still 
exists (so I could see that Opera translated the broken characters into 
"?"), were:

http://www.jiraiya.com/pc/ - Shift_JIS - <a 

http://www.a-travel.sk/ - windows-1250 - <p><a 
color="#ff6600">&Iacute;rsko</font></a>&nbsp; Dublin</p>

(I haven't checked how other browsers handle these cases.)

(This seems too little data (and too imprecise) to be able to draw any 
conclusions at all.)

> The number that I actually was looking for is the amount of pages
> - contain (unescaped) non-ASCII character in queries, *and*
> - use a document encoding other than UTF-8 (*).

If I grep the undecoded byte streams of 130K pages for 
/(?i)href\s*=\s*"[^"]*\?[^"]*[^\x00-\x7f]/ (i.e. non-ASCII bytes in 
queries in double-quoted href attributes, not counting &-escaped 
non-ASCII characters, unless I got something wrong), then skip all 
"mailto:" links, then count the number of pages per charset (determined 
by HTTP and <meta>), I get

            big5 6
          euc-jp 1
          euc-kr 6
          euc_kr 1
          gb2312 106
             gbk 1
      iso-8859-1 196
     iso-8859-15 4
      iso-8859-2 23
      iso-8859-9 8
            none 1
   pt-iso-8859-1 1
       shift_jis 24
           utf-8 67
    windows-1250 15
    windows-1251 29
    windows-1252 17
    windows-1254 22
    windows-1255 12
   windows-1255; 1
    windows-1256 3
     windows-874 1
     windows-932 1
          x-sjis 3

(The pages' charset distribution is as in 
http://philip.html5.org/data/charsets.html - most significantly, this is 
about one in eight of all the gb2312 pages.)

Is that kind of what you were looking for?

Philip Taylor
Received on Monday, 30 June 2008 14:23:24 UTC

This archive was generated by hypermail 2.4.0 : Saturday, 9 October 2021 18:44:33 UTC