- From: Frank Ellermann <nobody@xyzzy.claranet.de>
- Date: Thu, 1 May 2008 10:21:00 +0200
- To: www-international@w3.org
Erik van der Poel wrote: > Within the context of HTML and HTTP, queries don't have to > say which charset they are using, because there is already > an agreement in place: the major browsers and servers use > the charset of the HTML. Ugh. I clearly missed your point, assuming that RFC 3987 is KISS and nobody would do something else after 2005. > I also believe that there are historical reasons for this > situation. I'm sure someone will correct me if I get this > wrong. In 1995, UTF-8 was not really in the picture yet Yes, RFC 2277 came later, UTF-7 was for mail, and Unicode was not very popular... > As you know, Japan has 3 major encodings: Shift_JIS, > EUC-JP and ISO-2022-JP. ...I know next to nothing about JP, only that my MUA sticks to ISO-2022-JP in replies, instead of switching to UTF-8 or ISO-2022-JP-2 on the fly when I try to use Latin-1 in the reply. But hopefully I now got the idea how things were done in the pre-3987 era: Because the user decided to use a form on a Web page using legacy charset X the user agent apparently supported this legacy charset X. Therefore it could make sense to assume that the user agent will also send its query using legacy charset X. When the user tries to enter characters not supported in legacy charset X they could still be encoded as NCRs, that is unambiguously Unicode since about 1997 (RFC 2070). And percent encoded octets are either binary or octets of the legacy charset X. So far it's clear. But there's a twist, precisely the same form can exist on more than one page, and the other pages can use other charsets. How does the server figure this out ? Using a Referer header can't be a good idea. AFAIK GET requests have no Content-Type. Query URLs are supposed to be "portable", e.g., noted in text/plain mail. I'm lost. > I'm just guessing, but I suspect there was less inertia > there, so that it was possible for MSIE to push for escaped > UTF-8 when the original path was *raw* (not %-escaped). OT: It is only fair if MS sometimes sees the light first :-) NLSFUNC in a CONFIG.SYS for DOS was odd at its time, but not as odd as mkkernel before locale were introduced. They also picked up Unicode early, and I'm delighted that I can simply say CHCP 858 in W2K, and get a difference from CHCP 850. IMO some things in OS/2 were better, but only two codepages with 850 implicitly meaning 858 was not one those better features. Back to CHCP 1004 or rather 1252 and the wonders of STD 66: > the & of the NCR is sent as %26. Tricky, when I understand this correctly the HTTP server is supposed to undecode all %hh first (once), at this time it knows the legacy charset X by some magic not yet clear for me, in theory allowing to transform X to say UTF-8 or UTF-16, whatever the server prefers, after that all NCRs can be also resolved... Looking at it again, no, I must miss a clue, what about %hh for a binary ICO, they were not in a legacy charset X. Is the decoding done later, per query parameter, where a form application should know what is text and what is binary ? > this means that users cannot type a literal "〹" This could be bad for forms with text areas, when users have to write < when they mean "<" etc., but admittedly that's a POST form, the interesting cases are URIs and GET. But we get quite a lot of shaky assumptions here, and the RFC 3987 approach would be simpler. Normally I defend old browsers when something like IRIs has a backwards compatible form, or when <br clear="all"> works for any browser unlike <br style="clear: both">. But I'd draw the line when when the old concept never really worked, then it is justified to ditch it and try something better. > Firefox uses UTF-8, even if the server is expecting > something else, which is quite obviously bad too. IMO bad on the side of the server, the only "something else" it's entitled to expect could be ISO-8859-1, unless I miss one or more clues as noted above. [I didn't understand the whatwg draft by looking at it for one minute, what you have written here was clearer] > Just curious, but which main parts of HTML5 are you > referring to here? It introduces various new elements. It deprecates various elements, some like <tt> are no nonsense. The diff draft didn't bother to mention that it introduces IRIs. It adds an unregistered highly controversial URI scheme javascript. It deprecates various charsets following no clear rationale apart from overruling CharMod. It uses a Wiki page as the normative reference for rel= relations. It introduces the ugly ping= concept. It silently deprecates XHTML 1 without saying so (I'm not always against doing things "between the lines", as in the RFC 3987 case). It invents a new parse mode instead of SGML (I'm not saying that this is wrong, I just note that HTML5 apparently reinvents the Internet). Of course I only read 10% of the draft, but what I found is enough to think that HTML5 cannot fly "as is". It is far too ambitious, in IETF terms it is not a task for a single WG, it is a job for a complete area. The part that I like (so far) is its aspect of a manual for browser implementors. Frank
Received on Thursday, 1 May 2008 08:19:22 UTC