- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Fri, 22 Aug 2008 13:21:06 +0900
- To: "Erik van der Poel" <erikv@google.com>, janssen@parc.xerox.com
- Cc: www-international@w3.org
Hello Erik, Great to see your data. Just in case you happen to have this data around, or you happen to run another count, in a previous discussion (not sure it was on this mailing list), the question came up about the percentage of 'illegal' URIs in HTML, including such things as incomplete %-encodings and the like. Any data available? At 05:26 08/08/13, Erik van der Poel wrote: > >Hi Bill, > >URLs "travel over the Web" in a number of different directions and >contexts, and the proportion of URLs that contain %-escaped non-UTF-8 >depends on the context. Around May 2007, in the context of HTML >attribute values that normally carry URLs (e.g. "href" in the "a" >tag), we found the following proportions in a sample of Google's index >("raw" means not %-escaped): > >1.2% non-ascii query >0.74% escaped non-ascii query >0.44% escaped non-utf-8 query >0.48% raw non-ascii query >0.44% raw non-utf-8 query > >1.1% non-ascii path >0.9% escaped non-ascii path >0.18% escaped non-utf-8 path So 80% of non-ASCII paths use UTF-8 for escaping, i.e. can be converted to IRIs. >0.2% raw non-ascii path >0.099% raw non-utf-8 path I'm not sure I understand this. Is this encoding derived from the containing page? In that case, it's irrelevant, because even raw non-utf-8 web addresses can be totally fine IRIs in the right context. Or is the encoding derived by explicit lookup (meaning that 5% of raw non-ASCII paths have to be looked up in the (non-UTF-8) page encoding to retrieve something, whereas they result in a 404 when converted to UTF-8)? [similar questions apply to the other parts of an URI] Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Friday, 22 August 2008 08:00:18 UTC