- From: Julian Reschke <julian.reschke@gmx.de>
- Date: Mon, 30 Jun 2008 09:10:35 +0200
- To: Ian Hickson <ian@hixie.ch>
- CC: uri@w3.org, HTML WG <public-html@w3.org>
Ian Hickson wrote: >> With UTF-8/percent-escaping, the page may very well work as desired, >> because the server happens to understand that encoding > > There is no question that always using UTF-8 would be better than the > current mess. Yes. > >> (see Google case cited in Webkit bug report). > > Do you mean the case that gets converted to &#...;? That's not UTF-8. Nope. I meant UTF-8 encoded, then percent-escaped. > (If you mean something else, could you provide a link?) <https://bugs.webkit.org/show_bug.cgi?id=15119#c1>: "When the character is not representable, Firefox falls back on UTF-8, which amusingly gives the "correct" answer for Google. IE and Safari both substitute a literal question mark for the invalid character." >> Finally, if you copy & paste the URL, you wouldn't see the replacement >> characters anyway, right? In which case the default handling (using >> UTF-8) would apply; which even more is a reason to consider making this >> mandatory (because otherwise following the link inside the document and >> the copy/paste case yield different results). > > Having the encoding be essentially random is far worse than converting the > character to a question mark, IMHO. It wouldn't be random. > Anyway, the whole issue is easily avoided by authors by just using UTF-8. > This entire problem can only be reached in invalid documents anyway. The problem I'd like to solve is that you can't use unescaped non-ASCII query parameters in non-UTF8 pages, and also, if you use UTF8 the page may be broken by recoding. >>>> I care because I'd like to see documents using non-ASCII characters >>>> in query parts become compliant no matter what encoding they are in. >>> Unless we change the definition of HTML5's URLs to be conforming even >>> when those URLs would not be treated as IRIs, I don't see any way to >>> get there from here. >> We could break the affected pages and/or add a mechanism through which >> pages can opt-in into the sane UTF-8 based behavior. > > Breaking the pages isn't an option, and an opt-in is already available: > use UTF-8. This issue is not even remotely important enough on the grand > scale of things to deserve special syntax or options or whatnot. I think it is a very big issue if the interpretation of an identifier is subject to the document encoding; and yes, even if that is the case for "invalid pages". The affected pages are already sort of broken, because the "URLs" they contain do not survive copy/paste (into email), and probably also not bookmarking. We really should consider breaking them totally to get out of the mess. > ... BR, Julian
Received on Monday, 30 June 2008 07:11:31 UTC