- From: Erik van der Poel <erikv@google.com>
- Date: Wed, 30 Apr 2008 13:25:32 -0700
- To: "Frank Ellermann" <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
- Cc: www-international@w3.org
On Tue, Apr 29, 2008 at 4:30 PM, Frank Ellermann <nobody@xyzzy.claranet.de> wrote: > Of course using UTF-8 is the most robust solution, queries > in an URI can't say which other percent-encoded charset they > might use Within the context of HTML and HTTP, queries don't have to say which charset they are using, because there is already an agreement in place: the major browsers and servers use the charset of the HTML. > Where does the bit of doing something *else* for a <query> > or <iquery> enter the picture ? What is the point of doing > something else, i.e. different from a <path> or <ipath> ? I suspect that the major browser developers decided that there are too many servers out there that expect the query part to arrive in the encoding of the original HTML. I also believe that there are historical reasons for this situation. I'm sure someone will correct me if I get this wrong. In 1995, UTF-8 was not really in the picture yet (in HTML/HTTP). We (Netscape and others) were trying to support the major markets first, i.e. Japan, etc. As you know, Japan has 3 major encodings: Shift_JIS, EUC-JP and ISO-2022-JP. Windows and Mac use Shift_JIS, while several Unixes used EUC-JP. However, the servers would get confused when their HTML was in Shift_JIS and a Unix client submitted the form in EUC-JP. So we decided to use whatever charset was being used in the original HTML, since that was presumably what the server was expecting. Hrefs with query parts are of course related to HTML forms, so it was decided to use the same encoding there. For the path part, I'm just guessing, but I suspect there was less inertia there, so that it was possible for MSIE to push for escaped UTF-8 when the original path was *raw* (not %-escaped). Now that even Firefox has agreed to do this, I suspect it will stay this way. > Do you want to send NCRs in URI query parts over the wire ? It's not a question of what I want. It's a question of what the browsers already do. > Including sending "&" as "&" etc. ? I don't see how > that can be a good idea, servers would then be faced with > questions of how often they need to decode NCRs in the URI No, the & of the NCR is sent as %26. The # and ; are also %-escaped. Yes, this means that users cannot type a literal "〹", but content providers can switch to UTF-8 if that is important to them. The alternatives have their drawbacks too: MSIE uses question marks, which obviously loses info. Firefox uses UTF-8, even if the server is expecting something else, which is quite obviously bad too. The whatwg spec suggests using fallbacks, which also lose info: http://www.whatwg.org/specs/web-forms/current-work/#unacceptableCharacters All 3 of these alternatives also differ from HTML form handling, so that is another drawback (from the server's point of view, since the server cannot tell the difference between a request URI from an HTML form and a request URI from an href). > HTML5 already tries to reinvent the complete Internet Just curious, but which main parts of HTML5 are you referring to here? Erik
Received on Wednesday, 30 April 2008 20:26:18 UTC