Re: URIs in HTML5 and issues arising

Ian Hickson wrote:
>> With UTF-8/percent-escaping, the page may very well work as desired, 
>> because the server happens to understand that encoding
> 
> There is no question that always using UTF-8 would be better than the 
> current mess.

Yes.

> 
>> (see Google case cited in Webkit bug report).
> 
> Do you mean the case that gets converted to &#...;? That's not UTF-8.

Nope. I meant UTF-8 encoded, then percent-escaped.

> (If you mean something else, could you provide a link?)

<https://bugs.webkit.org/show_bug.cgi?id=15119#c1>:

"When the character is not representable, Firefox falls back on UTF-8, 
which amusingly gives the "correct" answer for Google. IE and Safari 
both substitute a literal question mark for the invalid character."

>> Finally, if you copy & paste the URL, you wouldn't see the replacement 
>> characters anyway, right? In which case the default handling (using 
>> UTF-8) would apply; which even more is a reason to consider making this 
>> mandatory (because otherwise following the link inside the document and 
>> the copy/paste case yield different results).
> 
> Having the encoding be essentially random is far worse than converting the 
> character to a question mark, IMHO.

It wouldn't be random.

> Anyway, the whole issue is easily avoided by authors by just using UTF-8. 
> This entire problem can only be reached in invalid documents anyway.

The problem I'd like to solve is that you can't use unescaped non-ASCII 
query parameters in non-UTF8 pages, and also, if you use UTF8 the page 
may be broken by recoding.

>>>> I care because I'd like to see documents using non-ASCII characters 
>>>> in query parts become compliant no matter what encoding they are in.
>>> Unless we change the definition of HTML5's URLs to be conforming even 
>>> when those URLs would not be treated as IRIs, I don't see any way to 
>>> get there from here.
>> We could break the affected pages and/or add a mechanism through which 
>> pages can opt-in into the sane UTF-8 based behavior.
> 
> Breaking the pages isn't an option, and an opt-in is already available: 
> use UTF-8. This issue is not even remotely important enough on the grand 
> scale of things to deserve special syntax or options or whatnot.

I think it is a very big issue if the interpretation of an identifier is 
subject to the document encoding; and yes, even if that is the case for 
"invalid pages".

The affected pages are already sort of broken, because the "URLs" they 
contain do not survive copy/paste (into email), and probably also not 
bookmarking. We really should consider breaking them totally to get out 
of the mess.

 > ...

BR, Julian

Received on Monday, 30 June 2008 07:11:31 UTC