Re: BiDi IRI deployment?

On Tue, Apr 29, 2008 at 4:30 PM, Frank Ellermann
<nobody@xyzzy.claranet.de> wrote:
>  Of course using UTF-8 is the most robust solution, queries
>  in an URI can't say which other percent-encoded charset they
>  might use

Within the context of HTML and HTTP, queries don't have to say which
charset they are using, because there is already an agreement in
place: the major browsers and servers use the charset of the HTML.

>  Where does the bit of doing something *else* for a <query>
>  or <iquery> enter the picture ?  What is the point of doing
>  something else, i.e. different from a <path> or <ipath> ?

I suspect that the major browser developers decided that there are too
many servers out there that expect the query part to arrive in the
encoding of the original HTML. I also believe that there are
historical reasons for this situation. I'm sure someone will correct
me if I get this wrong. In 1995, UTF-8 was not really in the picture
yet (in HTML/HTTP). We (Netscape and others) were trying to support
the major markets first, i.e. Japan, etc. As you know, Japan has 3
major encodings: Shift_JIS, EUC-JP and ISO-2022-JP. Windows and Mac
use Shift_JIS, while several Unixes used EUC-JP. However, the servers
would get confused when their HTML was in Shift_JIS and a Unix client
submitted the form in EUC-JP. So we decided to use whatever charset
was being used in the original HTML, since that was presumably what
the server was expecting. Hrefs with query parts are of course related
to HTML forms, so it was decided to use the same encoding there.

For the path part, I'm just guessing, but I suspect there was less
inertia there, so that it was possible for MSIE to push for escaped
UTF-8 when the original path was *raw* (not %-escaped). Now that even
Firefox has agreed to do this, I suspect it will stay this way.

>  Do you want to send NCRs in URI query parts over the wire ?

It's not a question of what I want. It's a question of what the
browsers already do.

>  Including sending "&" as "&amp;" etc. ?  I don't see how
>  that can be a good idea, servers would then be faced with
>  questions of how often they need to decode NCRs in the URI

No, the & of the NCR is sent as %26. The # and ; are also %-escaped.

Yes, this means that users cannot type a literal "&#12345;", but
content providers can switch to UTF-8 if that is important to them.
The alternatives have their drawbacks too:

MSIE uses question marks, which obviously loses info.

Firefox uses UTF-8, even if the server is expecting something else,
which is quite obviously bad too.

The whatwg spec suggests using fallbacks, which also lose info:

http://www.whatwg.org/specs/web-forms/current-work/#unacceptableCharacters

All 3 of these alternatives also differ from HTML form handling, so
that is another drawback (from the server's point of view, since the
server cannot tell the difference between a request URI from an HTML
form and a request URI from an href).

>  HTML5 already tries to reinvent the complete Internet

Just curious, but which main parts of HTML5 are you referring to here?

Erik

Received on Wednesday, 30 April 2008 20:26:18 UTC