- From: Erik van der Poel <erikv@google.com>
- Date: Fri, 22 Aug 2008 08:30:50 -0700
- To: "Martin Duerst" <duerst@it.aoyama.ac.jp>
- Cc: janssen@parc.xerox.com, www-international@w3.org
Hello Martin,
On Thu, Aug 21, 2008 at 9:21 PM, Martin Duerst <duerst@it.aoyama.ac.jp> wrote:
> Just in case you happen to have this data around, or you happen
> to run another count, in a previous discussion (not sure it was
> on this mailing list), the question came up about the percentage
> of 'illegal' URIs in HTML, including such things as incomplete
> %-encodings and the like. Any data available?
No, sorry, I don't have data on incomplete %-encodings. I'm not sure
what you include in "the like". If I were to write such a program,
we'd have to agree on those details first. :-)
> At 05:26 08/08/13, Erik van der Poel wrote:
>>URLs "travel over the Web" in a number of different directions and
>>contexts, and the proportion of URLs that contain %-escaped non-UTF-8
>>depends on the context. Around May 2007, in the context of HTML
>>attribute values that normally carry URLs (e.g. "href" in the "a"
>>tag), we found the following proportions in a sample of Google's index
>>("raw" means not %-escaped):
>>
>>1.2% non-ascii query
>>0.74% escaped non-ascii query
>>0.44% escaped non-utf-8 query
>>0.48% raw non-ascii query
>>0.44% raw non-utf-8 query
>>
>>1.1% non-ascii path
>>0.9% escaped non-ascii path
>>0.18% escaped non-utf-8 path
>
> So 80% of non-ASCII paths use UTF-8 for escaping, i.e.
> can be converted to IRIs.
Yes.
>>0.2% raw non-ascii path
>>0.099% raw non-utf-8 path
>
> I'm not sure I understand this. Is this encoding derived
> from the containing page?
Yes.
> In that case, it's irrelevant,
> because even raw non-utf-8 web addresses can be totally
> fine IRIs in the right context.
That data may not be relevant to the subject of this thread, but I
generated it for other reasons at the time. I agree that raw non-UTF-8
paths are not particularly interesting now that Firefox 3 is
converting those to escaped UTF-8 (just like MSIE 6 and 7 modulo
localized versions that may have a different setting). However, raw
non-UTF-8 query parts are very important to us because the major
browsers convert them to Unicode and then back to the non-UTF-8
encoding.
> Or is the encoding derived
> by explicit lookup (meaning that 5% of raw non-ASCII paths
> have to be looked up in the (non-UTF-8) page encoding to
> retrieve something, whereas they result in a 404 when
> converted to UTF-8)?
No, none of these URLs were looked up in this count.
Erik
Received on Friday, 22 August 2008 15:31:37 UTC