- From: Frank Ellermann <nobody@xyzzy.claranet.de>
- Date: Wed, 30 Apr 2008 01:30:55 +0200
- To: www-international@w3.org
Erik van der Poel wrote: > RFC 3987 does mention related issues. The complete section 7 is informative, I fear I never read it, because syntax and and prose in chapters 1..3 satisfied all my "now that is really KISS" desires. For chapter 4 (BiDi) I anyway couldn't judge it. [Digression... I'm not completely convinced that numbers are really written LTR in RTL languages, or if they just have a "little endian" concept where RTL languages use "big endian"] >| when a new Web form is set up using UTF-8 as the character >| encoding of the form page, the returned query URIs will >| use UTF-8 as the character encoding (unless the user, for >| whatever reason, changes the character encoding) and will >| therefore be compatible with IRIs." A reason to change it is a browser not supporting UTF-8, but I'm confident that the number of netscape 2.02 users sharply declined by 100% from one to zero worldwide last year. Of course using UTF-8 is the most robust solution, queries in an URI can't say which other percent-encoded charset they might use, FWIW it can be no charset at all, percent-encoded raw octets of an ICO or similar. For that general issue IRIs are only an example, it affects all queries as soon as a part of it is about non-ASCII. >| Second, it may include URIs constructed based on character >| encodings other than UTF-8. These URIs may be produced by >| user agents that do not conform to this specification and >| that use legacy character encodings to convert non-ASCII >| characters to URIs. It could be an ftp URI talking about file names on a server using a legacy charset, or similar cases for other schemes. RFC 3987 merely repeats what RFC 2277 before and RFC 5198 later say, use UTF-8 over the wire, anything else requires a way to indicate the charset. And for HTTP GET forms the resulting URI can't say what it is, clients trying to state that an URI is not UTF-8 are doomed: * A percent-encoded ICO is not UTF-8, nor any other charset. * An URI by definition is US-ASCII following STD 66 syntax, otherwise it is broken and potentially dangerous. * And RFC 3987 quietly adds the concept "as far as URIs use percent-encoded octets it is either some binary gibberish, or percent-encoded UTF-8". The last point is the real magic in RFC 3987, it deprecates the whole zoo of legacy charsets (again) without mentioning the fact. For definitions of "legacy" starting with UTF-16, UTF-32, UTF-7, UTF-1, and then covering anything that is not UTF-8 or its proper subset US-ASCII. > I'm not sure whether we are communicating here. I'm talking > about URIs that are sent from the client to a server We are on the same track, if a user clicks on a "raw" IRI in the href on the KOI8-R test page it cannot work with almost all clients (minus popular browsers), because HTTP supports only URIs, not "raw" IRIs. Besides old clients have no way to figure out the server in these two IRIs (one KOI8-R IRI for the Cyril test TLD Wiki, one Unicode IRI given with NCRs for the Greek test TLD Wiki). > Currently, HTML browsers convert from Unicode to the document > encoding when an HTML form is submitted or an href with a > non-ASCII query part is clicked. That sounds strange. For the <ihost> part I found that FF2 converts it from legacy (KOI8-R) or Unicode (NCRs) to the corresponding IDNA A-labels, otherwise the links don't work. For <ipath> Martin's test suite showed that FF2 didn't get this right for legacy non-UTF-8 charsets (JFTR also not for iso-8859-1). Fixing that should be straight forward: Treat any (X)HTML document internally as Unicode (RFC 2070 and later), if in doubt use UTF-8 for Unicode (RFC 2277 and 5198), and finally percent-encode UTF-8 (RFC 3986 and 3987). Where does the bit of doing something *else* for a <query> or <iquery> enter the picture ? What is the point of doing something else, i.e. different from a <path> or <ipath> ? > I'm saying that it would be more consistent if the browsers > used NCRs for both forms *and* hrefs, since the server > doesn't know which one the user was dealing with. Do you want to send NCRs in URI query parts over the wire ? Including sending "&" as "&" etc. ? I don't see how that can be a good idea, servers would then be faced with questions of how often they need to decode NCRs in the URI *plus* the known issues of decoding %25hh, %25%25hh, etc. > Maybe HTML forms and hrefs with query parts can be > specified in HTML 5 instead of IRIbis. HTML5 already tries to reinvent the complete Internet, but as far as HTML5 is a manual for browser implementors, yes, HTML5 might need to talk about these issues. And maybe it does already, I have only read the "diff" draft carefully, now waiting for the next round of "official" HTML5 drafts. Frank
Received on Tuesday, 29 April 2008 23:59:07 UTC