Re: BiDi IRI deployment? from Frank Ellermann on 2008-05-01 (www-international@w3.org from April to June 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Thu, 1 May 2008 10:21:00 +0200
To: www-international@w3.org
Message-ID: <fvbucc$66b$1@ger.gmane.org>
Erik van der Poel wrote:

> Within the context of HTML and HTTP, queries don't have to
> say which charset they are using, because there is already
> an agreement in place: the major browsers and servers use
> the charset of the HTML.

Ugh.  I clearly missed your point, assuming that RFC 3987 is
KISS and nobody would do something else after 2005.

> I also believe that there are historical reasons for this
> situation. I'm sure someone will correct me if I get this
> wrong. In 1995, UTF-8 was not really in the picture yet

Yes, RFC 2277 came later, UTF-7 was for mail, and Unicode
was not very popular...

> As you know, Japan has 3 major encodings: Shift_JIS,
> EUC-JP and ISO-2022-JP.

...I know next to nothing about JP, only that my MUA sticks
to ISO-2022-JP in replies, instead of switching to UTF-8 or
ISO-2022-JP-2 on the fly when I try to use Latin-1 in the
reply.  But hopefully I now got the idea how things were
done in the pre-3987 era:

Because the user decided to use a form on a Web page using
legacy charset X the user agent apparently supported this
legacy charset X.  Therefore it could make sense to assume
that the user agent will also send its query using legacy
charset X.  

When the user tries to enter characters not supported in
legacy charset X they could still be encoded as NCRs, that
is unambiguously Unicode since about 1997 (RFC 2070).  And
percent encoded octets are either binary or octets of the
legacy charset X.

So far it's clear.  But there's a twist, precisely the same
form can exist on more than one page, and the other pages
can use other charsets.  How does the server figure this
out ?  Using a Referer header can't be a good idea.  AFAIK
GET requests have no Content-Type.  Query URLs are supposed
to be "portable", e.g., noted in text/plain mail.  I'm lost.

> I'm just guessing, but I suspect there was less inertia 
> there, so that it was possible for MSIE to push for escaped
> UTF-8 when the original path was *raw* (not %-escaped).

OT:  It is only fair if MS sometimes sees the light first :-)
NLSFUNC in a CONFIG.SYS for DOS was odd at its time, but not
as odd as mkkernel before locale were introduced.  They also
picked up Unicode early, and I'm delighted that I can simply
say CHCP 858 in W2K, and get a difference from CHCP 850.  IMO
some things in OS/2 were better, but only two codepages with
850 implicitly meaning 858 was not one those better features.

Back to CHCP 1004 or rather 1252 and the wonders of STD 66:

> the & of the NCR is sent as %26.

Tricky, when I understand this correctly the HTTP server is
supposed to undecode all %hh first (once), at this time it
knows the legacy charset X by some magic not yet clear for
me, in theory allowing to transform X to say UTF-8 or UTF-16,
whatever the server prefers, after that all NCRs can be also
resolved...  

Looking at it again, no, I must miss a clue, what about %hh
for a binary ICO, they were not in a legacy charset X.  Is
the decoding done later, per query parameter, where a form
application should know what is text and what is binary ?
 
> this means that users cannot type a literal "&#12345;"

This could be bad for forms with text areas, when users have
to write &lt; when they mean "<" etc., but admittedly that's
a POST form, the interesting cases are URIs and GET.  But we
get quite a lot of shaky assumptions here, and the RFC 3987
approach would be simpler.  

Normally I defend old browsers when something like IRIs has
a backwards compatible form, or when <br clear="all"> works
for any browser unlike <br style="clear: both">.  But I'd
draw the line when when the old concept never really worked,
then it is justified to ditch it and try something better.

> Firefox uses UTF-8, even if the server is expecting 
> something else, which is quite obviously bad too.

IMO bad on the side of the server, the only "something else"
it's entitled to expect could be ISO-8859-1, unless I miss
one or more clues as noted above.  [I didn't understand the
whatwg draft by looking at it for one minute, what you have
written here was clearer]

> Just curious, but which main parts of HTML5 are you
> referring to here?

It introduces various new elements.  It deprecates various
elements, some like <tt> are no nonsense.  The diff draft
didn't bother to mention that it introduces IRIs.  It adds
an unregistered highly controversial URI scheme javascript.
It deprecates various charsets following no clear rationale
apart from overruling CharMod.  It uses a Wiki page as the
normative reference for rel= relations.  It introduces the
ugly ping= concept.  It silently deprecates XHTML 1 without
saying so (I'm not always against doing things "between the
lines", as in the RFC 3987 case).  It invents a new parse
mode instead of SGML (I'm not saying that this is wrong,
I just note that HTML5 apparently reinvents the Internet).

Of course I only read 10% of the draft, but what I found is
enough to think that HTML5 cannot fly "as is".  It is far
too ambitious, in IETF terms it is not a task for a single
WG, it is a job for a complete area.  The part that I like
(so far) is its aspect of a manual for browser implementors.

 Frank
Received on Thursday, 1 May 2008 08:19:22 UTC