Re: URIs in HTML5 and issues arising

On Mon, 30 Jun 2008, Julian Reschke wrote:
>
> <https://bugs.webkit.org/show_bug.cgi?id=15119#c1>:
> 
> "When the character is not representable, Firefox falls back on UTF-8, 
> which amusingly gives the "correct" answer for Google. IE and Safari 
> both substitute a literal question mark for the invalid character."

That test case is bogus, as it doesn't include the ie= parameter to 
Google. Google supports a multitude of encodings in the URL, but the 
encoding has to be specified using the ie= parameter. If it's omitted, 
UTF-8 is assumed. If it was set to ie=big5, then it would work fine:

   http://www.google.com/search?ie=big5&q=%EB%A5
   

> > Having the encoding be essentially random is far worse than converting 
> > the character to a question mark, IMHO.
> 
> It wouldn't be random.

It might as well be, for all the server can tell. Unless you do something 
like what Google does (either explicitly include the encoding in the URL 
or define a server-expected encoding) the server has no way to know what 
encoding was used.

After all, that's the whole point of IRIs and UTF-8.


> > Anyway, the whole issue is easily avoided by authors by just using 
> > UTF-8. This entire problem can only be reached in invalid documents 
> > anyway.
> 
> The problem I'd like to solve is that you can't use unescaped non-ASCII 
> query parameters in non-UTF8 pages

You can, just don't use characters outside of the character set of the 
character encoding that you are using. (As the author, you presumably have 
control over what characters you use.) One way to guarantee that is to not 
use HTML character references.


> and also, if you use UTF8 the page may be broken by recoding.

Yes, character encoding conversion needs to be HTML-aware, and should 
convert non-ASCII URLs to ASCII URIs in the process. (Sadly this is 
non-trivial -- indeed, provably impossible in the general case -- for URLs 
in scripts.)


> > > We could break the affected pages and/or add a mechanism through 
> > > which pages can opt-in into the sane UTF-8 based behavior.
> > 
> > Breaking the pages isn't an option, and an opt-in is already 
> > available: use UTF-8. This issue is not even remotely important enough 
> > on the grand scale of things to deserve special syntax or options or 
> > whatnot.
> 
> I think it is a very big issue if the interpretation of an identifier is 
> subject to the document encoding; and yes, even if that is the case for 
> "invalid pages".

It's a big issue when you look at it from the point of view of the URI 
spec, but when you look at the big picture, it's really not a big deal.


> The affected pages are already sort of broken, because the "URLs" they 
> contain do not survive copy/paste (into email), and probably also not 
> bookmarking. We really should consider breaking them totally to get out 
> of the mess.

Conforming validators will flag all such instances. That's as much as 
we've given any number of similar issues in the spec, I don't see that 
this one is any more important.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Monday, 30 June 2008 08:03:46 UTC