Re: are there lots of URLs that have non-UTF-8 percent-encoded octets in them?

Hello Erik,

Great to see your data.

Just in case you happen to have this data around, or you happen
to run another count, in a previous discussion (not sure it was
on this mailing list), the question came up about the percentage
of 'illegal' URIs in HTML, including such things as incomplete
%-encodings and the like. Any data available?

At 05:26 08/08/13, Erik van der Poel wrote:
>Hi Bill,
>URLs "travel over the Web" in a number of different directions and
>contexts, and the proportion of URLs that contain %-escaped non-UTF-8
>depends on the context. Around May 2007, in the context of HTML
>attribute values that normally carry URLs (e.g. "href" in the "a"
>tag), we found the following proportions in a sample of Google's index
>("raw" means not %-escaped):
>1.2% non-ascii query
>0.74% escaped non-ascii query
>0.44% escaped non-utf-8 query
>0.48% raw non-ascii query
>0.44% raw non-utf-8 query
>1.1% non-ascii path
>0.9% escaped non-ascii path
>0.18% escaped non-utf-8 path

So 80% of non-ASCII paths use UTF-8 for escaping, i.e.
can be converted to IRIs.

>0.2% raw non-ascii path
>0.099% raw non-utf-8 path

I'm not sure I understand this. Is this encoding derived
from the containing page? In that case, it's irrelevant,
because even raw non-utf-8 web addresses can be totally
fine IRIs in the right context. Or is the encoding derived
by explicit lookup (meaning that 5% of raw non-ASCII paths
have to be looked up in the (non-UTF-8) page encoding to
retrieve something, whereas they result in a 404 when
converted to UTF-8)?

[similar questions apply to the other parts of an URI]

Regards,   Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University

Received on Friday, 22 August 2008 08:00:18 UTC