- From: Ian Hickson <ian@hixie.ch>
- Date: Tue, 24 Jun 2008 19:43:08 +0000 (UTC)
- To: Julian Reschke <julian.reschke@gmx.de>
- Cc: uri@w3.org
On Tue, 24 Jun 2008, Julian Reschke wrote: > Ian Hickson wrote: > > > Could you please be more specific? Any URI is a IRI, so a query component > > > based on an encoding other than UTF-8 still is a legal IRI. > > > > The IRI spec would have the query component always encoded as UTF-8, as I > > understand it. > > IRIs consist of Unicode characters. UTF-8 only enters the picture when > an IRI is converted to a URI. > > If you start with a URI *or* IRI, and then append query parameters, you > always have the choice not to use non-ASCII characters, and to decide > yourself what character encoding to use before percent-escaping. Sure. But in this document: <!DOCTYPE HTML> <title>Test</title> <meta charset="ISO-8859-13"> <a href="results.cgi/Ž?Ž">Link</a> ...what is the link? It's not a URI, as it contains non-ASCII characters. It could be an IRI, but compatibility with Web content requires that it not be treated per the IRI spec. Safari, for instance, will fetch the following URI (assuming the base URL is http://example.com/): http://example.com/results.cgi/%C5%BD?%DE It yields a valid URI, but somewhere we have to define the processing that led to two characters in the same URL being encoded using two different character encodings. Right now, URI and IRI don't define this, so I'm defining it in the HTML5 spec. This is unfortunately requiring very intimate interaction with the parsing rules of URIs, which is far less orthogonality than I would like. IE actually sends http://example.com/results.cgi/%C5%BD?* where "*" is the ISO-8859-13-encoded 8-bit byte for that character. If you target an iframe, IE uses the encoding of the iframe. You can see some of these behaviours if you look at these tests: http://hixie.ch/tests/adhoc/uri/encoding/ > Now, that being said, is there anything HTML5 could do so we can get > closer to a strict UTF-8 world in the future? Such as allowing servers > to serve document in an encoding != UTF-8, but still get query > parameters to be consistently encoded in UTF-8? There might be, but I don't see any way to get there at the moment. Any suggestions would be very welcome. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 24 June 2008 19:43:43 UTC