Re: BiDi IRI deployment? from Erik van der Poel on 2008-05-01 (www-international@w3.org from April to June 2008)

From: Erik van der Poel <erikv@google.com>
Date: Thu, 1 May 2008 06:51:07 -0700
To: "Frank Ellermann" <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
Cc: www-international@w3.org
Message-ID: <c07a32650805010651o45889be8x513acfdd2d8d31cb@mail.gmail.com>
On Thu, May 1, 2008 at 1:21 AM, Frank Ellermann
<nobody@xyzzy.claranet.de> wrote:
>  Erik van der Poel wrote:
> > Within the context of HTML and HTTP, queries don't have to
>  > say which charset they are using, because there is already
>  > an agreement in place: the major browsers and servers use
>  > the charset of the HTML.
>
>  Ugh.  I clearly missed your point, assuming that RFC 3987 is
>  KISS and nobody would do something else after 2005.

Developers don't change their implementations just because a spec that
follows the KISS principle came out. There are various backward
compatibility considerations, and other reasons not to implement new
specs, such as laziness, lack of programmer resources, lack of
awareness, a perception that the spec is not so important, or simply
NIH (Not Invented Here) syndrome.

>  So far it's clear.  But there's a twist, precisely the same
>  form can exist on more than one page, and the other pages
>  can use other charsets.  How does the server figure this
>  out ?  Using a Referer header can't be a good idea.

I agree.

>  AFAIK
>  GET requests have no Content-Type.

AFAIK, most browsers do not send a Content-Type header with a GET
request, and most servers do not expect that either.

A fairly widely used mechanism ("hack?") is <input type=hidden
name=enc value=euc-jp>. In Google's case, the name is "ie" (input
encoding). We also have an output encoding (oe). Yes, this means that
you cannot use the exact same HTML form on pages that are in different
charsets. It also means that you must be careful to put the same
charset name in the hidden input parameter and in the HTTP/HTML META
charset(s).

>  Query URLs are supposed
>  to be "portable", e.g., noted in text/plain mail.  I'm lost.

Yes, there are restrictions. You would have to convert to URI format
(%-escaped) before putting it in the email. (And the server would have
to accept escaped query parts. MSIE does not escape them when they are
raw (not escaped) in hrefs, and some servers depend on this behavior.)

>  > the & of the NCR is sent as %26.
>
>  Tricky, when I understand this correctly the HTTP server is
>  supposed to undecode all %hh first (once), at this time it
>  knows the legacy charset X by some magic not yet clear for
>  me, in theory allowing to transform X to say UTF-8 or UTF-16,
>  whatever the server prefers, after that all NCRs can be also
>  resolved...

Yes.

>  Looking at it again, no, I must miss a clue, what about %hh
>  for a binary ICO, they were not in a legacy charset X.  Is
>  the decoding done later, per query parameter, where a form
>  application should know what is text and what is binary ?

Yes, the server-side app is expected to know the type of each parameter.

>  Normally I defend old browsers when something like IRIs has
>  a backwards compatible form, or when <br clear="all"> works
>  for any browser unlike <br style="clear: both">.  But I'd
>  draw the line when when the old concept never really worked,
>  then it is justified to ditch it and try something better.

I agree that we should encourage HTML form authors to use charsets
that can encode all of Unicode. UTF-8 is the best choice. GB18030 is
another possibility. The others have other problems, that I won't get
into here.

Having said that, it still would be nice if the major browsers would
all do the same thing when faced with legacy charsets, and my
suggestion is to use decimal NCRs, whether the user started with an
HTML form or an href. I may have to take this suggestion to the HTML
and/or HTML5 mailing list.

>  > Firefox uses UTF-8, even if the server is expecting
>  > something else, which is quite obviously bad too.
>
>  IMO bad on the side of the server, the only "something else"
>  it's entitled to expect could be ISO-8859-1, unless I miss
>  one or more clues as noted above.

Yes, the specs said and say various things about ISO-8859-1 and UTF-8,
but, as I said, we already have an agreement in place (to use the
charset of the HTML).

>  [I didn't understand the
>  whatwg draft by looking at it for one minute, what you have
>  written here was clearer]

Well, you may not want to spend much time trying to understand that
spec. It says some pretty strange things, such as readability for
users, when the string is simply being passed from machine to machine
(client to server).

>  > Just curious, but which main parts of HTML5 are you
>  > referring to here?
>
>  It introduces various new elements.  It deprecates various
>  elements, some like <tt> are no nonsense.  The diff draft
>  didn't bother to mention that it introduces IRIs.  It adds
>  an unregistered highly controversial URI scheme javascript.
>  It deprecates various charsets following no clear rationale
>  apart from overruling CharMod.  It uses a Wiki page as the
>  normative reference for rel= relations.  It introduces the
>  ugly ping= concept.  It silently deprecates XHTML 1 without
>  saying so (I'm not always against doing things "between the
>  lines", as in the RFC 3987 case).  It invents a new parse
>  mode instead of SGML (I'm not saying that this is wrong,
>  I just note that HTML5 apparently reinvents the Internet).
>
>  Of course I only read 10% of the draft, but what I found is
>  enough to think that HTML5 cannot fly "as is".  It is far
>  too ambitious, in IETF terms it is not a task for a single
>  WG, it is a job for a complete area.  The part that I like
>  (so far) is its aspect of a manual for browser implementors.

Interesting. Thanks for taking the time to write down your impressions.

Erik
Received on Thursday, 1 May 2008 13:51:49 UTC