Re: are there lots of URLs that have non-UTF-8 percent-encoded octets in them?

Hello Martin,

On Thu, Aug 21, 2008 at 9:21 PM, Martin Duerst <duerst@it.aoyama.ac.jp> wrote:
> Just in case you happen to have this data around, or you happen
> to run another count, in a previous discussion (not sure it was
> on this mailing list), the question came up about the percentage
> of 'illegal' URIs in HTML, including such things as incomplete
> %-encodings and the like. Any data available?

No, sorry, I don't have data on incomplete %-encodings. I'm not sure
what you include in "the like". If I were to write such a program,
we'd have to agree on those details first. :-)

> At 05:26 08/08/13, Erik van der Poel wrote:
>>URLs "travel over the Web" in a number of different directions and
>>contexts, and the proportion of URLs that contain %-escaped non-UTF-8
>>depends on the context. Around May 2007, in the context of HTML
>>attribute values that normally carry URLs (e.g. "href" in the "a"
>>tag), we found the following proportions in a sample of Google's index
>>("raw" means not %-escaped):
>>
>>1.2% non-ascii query
>>0.74% escaped non-ascii query
>>0.44% escaped non-utf-8 query
>>0.48% raw non-ascii query
>>0.44% raw non-utf-8 query
>>
>>1.1% non-ascii path
>>0.9% escaped non-ascii path
>>0.18% escaped non-utf-8 path
>
> So 80% of non-ASCII paths use UTF-8 for escaping, i.e.
> can be converted to IRIs.

Yes.

>>0.2% raw non-ascii path
>>0.099% raw non-utf-8 path
>
> I'm not sure I understand this. Is this encoding derived
> from the containing page?

Yes.

> In that case, it's irrelevant,
> because even raw non-utf-8 web addresses can be totally
> fine IRIs in the right context.

That data may not be relevant to the subject of this thread, but I
generated it for other reasons at the time. I agree that raw non-UTF-8
paths are not particularly interesting now that Firefox 3 is
converting those to escaped UTF-8 (just like MSIE 6 and 7 modulo
localized versions that may have a different setting). However, raw
non-UTF-8 query parts are very important to us because the major
browsers convert them to Unicode and then back to the non-UTF-8
encoding.

> Or is the encoding derived
> by explicit lookup (meaning that 5% of raw non-ASCII paths
> have to be looked up in the (non-UTF-8) page encoding to
> retrieve something, whereas they result in a 404 when
> converted to UTF-8)?

No, none of these URLs were looked up in this count.

Erik

Received on Friday, 22 August 2008 15:31:37 UTC