Re: revised "generic syntax" internet draft from Roy T. Fielding on 1997-04-15 (uri@w3.org from April 1997)

From: Roy T. Fielding <fielding@kiwi.ICS.UCI.EDU>
Date: Mon, 14 Apr 1997 19:32:32 -0700
To: uri@bunyip.com
Cc: Harald.T.Alvestrand@uninett.no
Message-Id: <9704141932.aa24523@paris.ics.uci.edu>
I am going to try this once more, and then end this discussion.
I have already repeated these arguments several times, on several
lists, over the past two years, and their substance has been repeatedly
ignored by Martin and Francois.  That is why I get pissed-off when
Martin sends out statements of consensus which are not true, whether he
realizes them to be false or not.  The html-i18n RFC is a travesty
because actual solutions to the problems were IGNORED in favor of
promoting a single charset standard (Unicode).  I personally would
approve of systems using Unicode, but I will not standardize solutions
which are fundamentally incompatible with existing practice.  In my world,
it is more important to have a deployable standard than a perfect standard.

PROBLEM 1:  Users in network environments where non-ASCII characters
            are the norm would prefer to use language-specific characters
            in their URLs, rather than ASCII translations.

Proposal 1a: Do not allow such characters, since the URL is an address
             and not a user-friendly string.  Obviously, this solution
             causes non-Latin character users to suffer more than people
             who normally use Latin characters, but is known to interoperate
             on all Internet systems.

Proposal 1b: Allow such characters, provided that they are encoded using
             a charset which is a superset of ASCII.  Clients may display
             such URLs in the same charset of their retrieval context,
             in the data-entry charset of a user's dialog, as %xx encoded
             bytes, or in the specific charset defined for a particular
             URL scheme (if that is the case).  Authors must be aware that
             their URL will not be widely accessible, and may not be safely
             transportable via 7-bit protocols, but that is a reasonable
             trade-off that only the author can decide.

Proposal 1c: Allow such characters, but only when encoded as UTF-8.
             Clients may only display such characters if they have a
             UTF-8 font or a translation table.  Servers are required to
             filter all generated URLs through a translation table, even
             when none of their URLs use non-Latin characters.  Browsers
             are required to translate all FORM-based GET request data
             to UTF-8, even when the browser is incapable of using UTF-8
             for data entry.  Authors must be aware that their
             URL will not be widely accessible, and may not be safely
             transportable via 7-bit protocols, but that is a reasonable
             trade-off that only the author can decide.  Implementers
             must also be aware that no current browsers and servers
             work in this manner (for obvious reasons of efficiency),
             and thus recipients of a message would need to maintain two
             possible translations for every non-ASCII URL accessed.
             In addition, all GET-based CGI scripts would need to be
             rewritten to perform charset translation on data entry, since
             the server is incapable of knowing what charset (if any)
             is expected by the CGI.  Likewise for all other forms of
             server-side API.

Proposal 1a is represented in the current RFCs and the draft,
since it was the only one that had broad agreement among the
implementations of URLs.  I proposed Proposal 1b as a means
to satisfy Martin's original requests without breaking all
existing systems, but he rejected it in favor of Proposal 1c.
I still claim that Proposal 1c cannot be deployed and will
not be implemented, for the reasons given above.  The only
advantage of Proposal 1c is that it represents the
Unicode-uber-alles method of software standardization.

Proposal 1b achieves the same result, but without requiring
changes to systems that have never used Unicode in the past.
If Unicode becomes the accepted charset on all systems, then
Unicode will be the most likely choice of all systems, for the
same reason that systems currently use whatever charset is present
on their own system.

Martin is mistaken when he claims that Proposal 1c can be implemented
on the server-side by a few simple changes, or even a module in Apache.
It would require a Unicode translation table for all cases where the
server generates URLs, including those parts of the server which are
not controlled by the Apache Group (CGIs, optional modules, etc.).
We cannot simply translate URLs upon receipt, since the server has no
way of knowing whether the characters correspond to "language" or
raw bits.  The server would be required to interpret all URL characters
as characters, rather than the current situation in which the server's
namespace is distributed amongst its interpreting components, each of which
may have its own charset (or no charset).  Even if we were to make such
a change, it would be a disaster since we would have to find a way to
distinguish between clients that send UTF-8 encoded URLs and all of those
currently in existence that send the same charset as is used by the HTML
(or other media type) page in which the FORM was obtained and entered
by the user.

The compromise that Martin proposed was not to require UTF-8, but
merely recommend it on such systems.  But that doesn't solve the problem,
so why bother?


PROBLEM 2:  When a browser uses the HTTP GET method to submit HTML
            form data, it url-encodes the data within the query part
            of the requested URL as ASCII and/or %xx encoded bytes.
            However, it does not include any indication of the charset
            in which the data was entered by the user, leading to the
            potential ambiguity as to what "characters" are represented
            by the bytes in any text-entry fields of that FORM.

Proposal 1a:  Let the form interpreter decide what the charset is, based
              on what it knows about its users.  Obviously, this leads to
              problems when non-Latin charset users encounter a form script
              developed by an internationally-challenged programmer.

Proposal 1b:  Assume that the form includes fields for selecting the data
              entry charset, which is passed to the interpreter, and thus
              removing any possible ambiguity.  The only problem is that
              users don't want to manually select charsets.

Proposal 1c:  Require that the browser submit the form data in the same
              charset as that used by the HTML form.  Since the form
              includes the interpreter resource's URL, this removes all
              ambiguity without changing current practice.  In fact,
              this should already be current practice.  Forms cannot allow
              data entry in multiple charsets, but that isn't needed if the
              form uses a reasonably complete charset like UTF-8.

Proposal 1d:  Require that the browser include a <name>-charset entry
              along with any field that uses a charset other than the one
              used by the HTML form.  This is a mix of 1b and 1c, but
              isn't necessary given the comprehensive solution of 1c
              unless there is some need for multi-charset forms.

Proposal 1e:  Require all form data to be UTF-8.  That removes the
              ambiguity for new systems, but does nothing for existing
              systems since there are no browsers that do this.  Of course
              they don't, because the CGI and module scripts that interpret
              existing form data DO NOT USE UTF-8, and therefore would
              break if browsers were required to use UTF-8 in URLs.

The real question here is whether or not the problem is real.  Proposal 1c
solves the ambiguity in a way which satisfies both current and any future
practice, and is in fact the way browsers are supposed to be designed.
Yes, early browsers were broken in this regard, but we can't fix early
browsers by standardizing a proposal that breaks ALL browsers.  The only
reasonable solution is provided by Proposal 1c and matches what non-broken
browsers do in current practice.  Furthermore, IT HAS NOTHING TO DO WITH
THE GENERIC URL SYNTAX, which is what we are supposed to be discussing.

The above are the only two problems that I've heard expressed by Martin
and Francois -- if you have others to mention, do so now.  Neither of the
above problems are solved by requiring the use of UTF-8, which is what
Larry was saying over-and-over-and-over again, apparently to deaf ears.

I think both of us are tired of the accusations of being ASCII-bigots
simply because we don't agree with your non-solutions.  Either agree to
a solution that works in practice, and thus is supported by actual
implementations, or we will stick with the status quo, which at least
prevents us from breaking things that are not already broken.

BTW, I am not the editor -- I am one of the authors.  Larry is the only
editor of the document right now.

 ...Roy T. Fielding
    Department of Information & Computer Science    (fielding@ics.uci.edu)
    University of California, Irvine, CA 92697-3425    fax:+1(714)824-1715
    http://www.ics.uci.edu/~fielding/
Received on Monday, 14 April 1997 22:48:21 UTC