- From: Julian Reschke <julian.reschke@gmx.de>
- Date: Tue, 24 Jun 2008 21:57:22 +0200
- To: Ian Hickson <ian@hixie.ch>
- CC: uri@w3.org
Ian Hickson wrote: > On Tue, 24 Jun 2008, Julian Reschke wrote: >> Ian Hickson wrote: >>>> Could you please be more specific? Any URI is a IRI, so a query component >>>> based on an encoding other than UTF-8 still is a legal IRI. >>> The IRI spec would have the query component always encoded as UTF-8, as I >>> understand it. >> IRIs consist of Unicode characters. UTF-8 only enters the picture when >> an IRI is converted to a URI. >> >> If you start with a URI *or* IRI, and then append query parameters, you >> always have the choice not to use non-ASCII characters, and to decide >> yourself what character encoding to use before percent-escaping. > > Sure. But in this document: > > <!DOCTYPE HTML> > <title>Test</title> > <meta charset="ISO-8859-13"> > <a href="results.cgi/Ž?Ž">Link</a> > > ...what is the link? It's not a URI, as it contains non-ASCII characters. Correct. > It could be an IRI, but compatibility with Web content requires that it Is *is* a IRI. > not be treated per the IRI spec. Safari, for instance, will fetch the ...not be converted to a URI per the IRI spec... > following URI (assuming the base URL is http://example.com/): > > http://example.com/results.cgi/%C5%BD?%DE > > It yields a valid URI, but somewhere we have to define the processing that > led to two characters in the same URL being encoded using two different > character encodings. Right now, URI and IRI don't define this, so I'm > defining it in the HTML5 spec. This is unfortunately requiring very > intimate interaction with the parsing rules of URIs, which is far less > orthogonality than I would like. You could change the algorithm how to get to the IRI in the first place, such as making it equivalent to: <a href="results.cgi/Ž?Þ"> ...in which case the standard IRI->URI conversion would yield the expected result. > IE actually sends http://example.com/results.cgi/%C5%BD?* where "*" is > the ISO-8859-13-encoded 8-bit byte for that character. If you target an Now that suggests to me that there is no interop between IE and Safari, and thus whatever you specify *may* break something. > iframe, IE uses the encoding of the iframe. You can see some of these > behaviours if you look at these tests: > > http://hixie.ch/tests/adhoc/uri/encoding/ > > >> Now, that being said, is there anything HTML5 could do so we can get >> closer to a strict UTF-8 world in the future? Such as allowing servers >> to serve document in an encoding != UTF-8, but still get query >> parameters to be consistently encoded in UTF-8? > > There might be, but I don't see any way to get there at the moment. Any > suggestions would be very welcome. A form attribute through which the site can state: "I want UTF-8-encoding-then-percent-escaping, no matter what the document encoding was"? Or potentially, in a more distant future, some way of specifying URI templates (*)? BR, Julian (*) Yes, when they are ready...
Received on Tuesday, 24 June 2008 19:58:13 UTC