- From: Ian Hickson <ian@hixie.ch>
- Date: Mon, 30 Jun 2008 08:03:08 +0000 (UTC)
- To: Julian Reschke <julian.reschke@gmx.de>
- Cc: uri@w3.org, HTML WG <public-html@w3.org>
On Mon, 30 Jun 2008, Julian Reschke wrote: > > <https://bugs.webkit.org/show_bug.cgi?id=15119#c1>: > > "When the character is not representable, Firefox falls back on UTF-8, > which amusingly gives the "correct" answer for Google. IE and Safari > both substitute a literal question mark for the invalid character." That test case is bogus, as it doesn't include the ie= parameter to Google. Google supports a multitude of encodings in the URL, but the encoding has to be specified using the ie= parameter. If it's omitted, UTF-8 is assumed. If it was set to ie=big5, then it would work fine: http://www.google.com/search?ie=big5&q=%EB%A5 > > Having the encoding be essentially random is far worse than converting > > the character to a question mark, IMHO. > > It wouldn't be random. It might as well be, for all the server can tell. Unless you do something like what Google does (either explicitly include the encoding in the URL or define a server-expected encoding) the server has no way to know what encoding was used. After all, that's the whole point of IRIs and UTF-8. > > Anyway, the whole issue is easily avoided by authors by just using > > UTF-8. This entire problem can only be reached in invalid documents > > anyway. > > The problem I'd like to solve is that you can't use unescaped non-ASCII > query parameters in non-UTF8 pages You can, just don't use characters outside of the character set of the character encoding that you are using. (As the author, you presumably have control over what characters you use.) One way to guarantee that is to not use HTML character references. > and also, if you use UTF8 the page may be broken by recoding. Yes, character encoding conversion needs to be HTML-aware, and should convert non-ASCII URLs to ASCII URIs in the process. (Sadly this is non-trivial -- indeed, provably impossible in the general case -- for URLs in scripts.) > > > We could break the affected pages and/or add a mechanism through > > > which pages can opt-in into the sane UTF-8 based behavior. > > > > Breaking the pages isn't an option, and an opt-in is already > > available: use UTF-8. This issue is not even remotely important enough > > on the grand scale of things to deserve special syntax or options or > > whatnot. > > I think it is a very big issue if the interpretation of an identifier is > subject to the document encoding; and yes, even if that is the case for > "invalid pages". It's a big issue when you look at it from the point of view of the URI spec, but when you look at the big picture, it's really not a big deal. > The affected pages are already sort of broken, because the "URLs" they > contain do not survive copy/paste (into email), and probably also not > bookmarking. We really should consider breaking them totally to get out > of the mess. Conforming validators will flag all such instances. That's as much as we've given any number of similar issues in the spec, I don't see that this one is any more important. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Monday, 30 June 2008 08:03:48 UTC