- From: Erik van der Poel <erikv@google.com>
- Date: Fri, 22 Aug 2008 08:30:50 -0700
- To: "Martin Duerst" <duerst@it.aoyama.ac.jp>
- Cc: janssen@parc.xerox.com, www-international@w3.org
Hello Martin, On Thu, Aug 21, 2008 at 9:21 PM, Martin Duerst <duerst@it.aoyama.ac.jp> wrote: > Just in case you happen to have this data around, or you happen > to run another count, in a previous discussion (not sure it was > on this mailing list), the question came up about the percentage > of 'illegal' URIs in HTML, including such things as incomplete > %-encodings and the like. Any data available? No, sorry, I don't have data on incomplete %-encodings. I'm not sure what you include in "the like". If I were to write such a program, we'd have to agree on those details first. :-) > At 05:26 08/08/13, Erik van der Poel wrote: >>URLs "travel over the Web" in a number of different directions and >>contexts, and the proportion of URLs that contain %-escaped non-UTF-8 >>depends on the context. Around May 2007, in the context of HTML >>attribute values that normally carry URLs (e.g. "href" in the "a" >>tag), we found the following proportions in a sample of Google's index >>("raw" means not %-escaped): >> >>1.2% non-ascii query >>0.74% escaped non-ascii query >>0.44% escaped non-utf-8 query >>0.48% raw non-ascii query >>0.44% raw non-utf-8 query >> >>1.1% non-ascii path >>0.9% escaped non-ascii path >>0.18% escaped non-utf-8 path > > So 80% of non-ASCII paths use UTF-8 for escaping, i.e. > can be converted to IRIs. Yes. >>0.2% raw non-ascii path >>0.099% raw non-utf-8 path > > I'm not sure I understand this. Is this encoding derived > from the containing page? Yes. > In that case, it's irrelevant, > because even raw non-utf-8 web addresses can be totally > fine IRIs in the right context. That data may not be relevant to the subject of this thread, but I generated it for other reasons at the time. I agree that raw non-UTF-8 paths are not particularly interesting now that Firefox 3 is converting those to escaped UTF-8 (just like MSIE 6 and 7 modulo localized versions that may have a different setting). However, raw non-UTF-8 query parts are very important to us because the major browsers convert them to Unicode and then back to the non-UTF-8 encoding. > Or is the encoding derived > by explicit lookup (meaning that 5% of raw non-ASCII paths > have to be looked up in the (non-UTF-8) page encoding to > retrieve something, whereas they result in a 404 when > converted to UTF-8)? No, none of these URLs were looked up in this count. Erik
Received on Friday, 22 August 2008 15:31:37 UTC