Re: revised "generic syntax" internet draft from Gary Adams - Sun Microsystems Labs BOS on 1997-04-15 (uri@w3.org from April 1997)

From: Gary Adams - Sun Microsystems Labs BOS <Gary.Adams@east.sun.com>
Date: Tue, 15 Apr 1997 11:53:56 -0400 (EDT)
To: uri@bunyip.com, fielding@kiwi.ICS.UCI.EDU
Cc: Harald.T.Alvestrand@uninett.no
Message-Id: <libSDtMail.9704151153.13046.gra@zeppo>
> From: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>
> 
> I am going to try this once more, and then end this discussion.
> I have already repeated these arguments several times, on several
> lists, over the past two years, and their substance has been repeatedly

This is a very good summarization of the key issues that were discussed.
The arguments have been passionate at times, because there are a fair
number of misunderstandings that have taken place along the way.

> ignored by Martin and Francois.  That is why I get pissed-off when
> Martin sends out statements of consensus which are not true, whether he
> realizes them to be false or not.  The html-i18n RFC is a travesty

>From my perspective, it's a travesty that the W3C HTML 3.2 DTD did not
include the LANG attribute and that the numeric character references
were left as ISO 8859-1. This has allowed a tremendous number of HTML
pages to be authored with an inferior font-centric view of characters.

Whatever the final resolution is for I18N URIs, it should include an
appropriate means of labeling the data in an unambiguous fashion. Most 
servers are not delivering character set labeled documents today, and 
most browsers are still guessing at character encodings or requiring the
user to manually select them.


> because actual solutions to the problems were IGNORED in favor of
> promoting a single charset standard (Unicode).  I personally would
> approve of systems using Unicode, but I will not standardize solutions
> which are fundamentally incompatible with existing practice.  In my world,
> it is more important to have a deployable standard than a perfect standard.
> 
> PROBLEM 1:  Users in network environments where non-ASCII characters
>             are the norm would prefer to use language-specific characters
>             in their URLs, rather than ASCII translations.

URLs are often used to access information in filesystems, but they are also
equally useful as keys into datbase systems. The way you've stated the 
problem it is merely a "user preference". From my perspective the documents
already are in a non-ASCII name system and need to be unambiguously made
visible via the URL mechanism.  On a Unix system this could be as simple as 
having the server return inodes instead of pathnames. The question is 
whether or not the native characters could be encoded in the URL safely
and unambiguously. Today using the %HH escape mechanism allows any character
encoding to be used safely, but the result is ambiguous.

Using the HotJava browser yesterday to view

   http://www.alis.com:8085/~yergeau/url_utf8.htm
   
I was able to manually select the "View"->"Character Set" -> "Other" -> UTF8   
and see the accented characters in the document text as well as in the 
presentation of the URL. This worked for the 8bit UTF8 bytes, but was 
not implemented for the %HH escaped characters. This would be a very
useful feature to support in an I18N browser.

> 
> Proposal 1a: Do not allow such characters, since the URL is an address
>              and not a user-friendly string.  Obviously, this solution
>              causes non-Latin character users to suffer more than people
>              who normally use Latin characters, but is known to interoperate
>              on all Internet systems.

This is essentially the status quo argument that was used in the old 
7-bit vs 8-bit clean protocols. Same argument was used back when I got my 
first uppercase-lowercase chip for the dumb terminal ~1979. A resonable
assumption for toady's target platform has got to be at least a bitmap display 
and virtual keyboard, or a localized keyboard with locale specific fonts
for rendering.

> 
> Proposal 1b: Allow such characters, provided that they are encoded using
>              a charset which is a superset of ASCII.  Clients may display
>              such URLs in the same charset of their retrieval context,
>              in the data-entry charset of a user's dialog, as %xx encoded
>              bytes, or in the specific charset defined for a particular
>              URL scheme (if that is the case).  Authors must be aware that
>              their URL will not be widely accessible, and may not be safely
>              transportable via 7-bit protocols, but that is a reasonable
>              trade-off that only the author can decide.

>From a pracrtical point of view the author must make a charset choice for the
content of the document and for the name for the document in its persistent
storage.  e.g. a SJIS document saved to an EUC JP file system. In addition 
a web administrator must ensure that the appropriate headers are presented
on the wire, such as the HTML META elements describing the encoding or
perhaps the Apache variant mechanism for Content-* HTTP headers.

> 
> Proposal 1c: Allow such characters, but only when encoded as UTF-8.
>              Clients may only display such characters if they have a
>              UTF-8 font or a translation table.  Servers are required to
>              filter all generated URLs through a translation table, even
>              when none of their URLs use non-Latin characters.  Browsers
>              are required to translate all FORM-based GET request data
>              to UTF-8, even when the browser is incapable of using UTF-8
>              for data entry.  Authors must be aware that their
>              URL will not be widely accessible, and may not be safely
>              transportable via 7-bit protocols, but that is a reasonable
>              trade-off that only the author can decide.  Implementers
>              must also be aware that no current browsers and servers
>              work in this manner (for obvious reasons of efficiency),
>              and thus recipients of a message would need to maintain two
>              possible translations for every non-ASCII URL accessed.
>              In addition, all GET-based CGI scripts would need to be
>              rewritten to perform charset translation on data entry, since
>              the server is incapable of knowing what charset (if any)
>              is expected by the CGI.  Likewise for all other forms of
>              server-side API.

I don't think I see this proposal the same way. Today 8-bit characters are 
not defined as safe in URLs. The non-ASCII %HH escaped bytes are not labeled.
If a CGI FORM has defined the character set to return then it has a claim
on what should be returned. In the case of ISINDEX GET requests where it
is not possible to specify the returned encoding it is up to the browser
to select an appropriate representation.  

My first choice would be for some form of self describing notation that would
allow URLs to be character set labeled, so any character set could be used.
In practice this would never work. My second choice would be for a universal
character set to be allowed so that most of the world's languages could
be represented. That is what I though the point was behind the UTF8 %HH escaped
notation for non-ASCII characters was intended to accomplish in a safe
and unambiguous fashion.

> 
> Proposal 1a is represented in the current RFCs and the draft,
> since it was the only one that had broad agreement among the
> implementations of URLs.  I proposed Proposal 1b as a means
> to satisfy Martin's original requests without breaking all
> existing systems, but he rejected it in favor of Proposal 1c.
> I still claim that Proposal 1c cannot be deployed and will
> not be implemented, for the reasons given above.  The only
> advantage of Proposal 1c is that it represents the
> Unicode-uber-alles method of software standardization.

This sounds like a highly charged comment. The alternative seems to
be restricting URLs to ASCII charcters and ISO8859-1 %HH escaped URLs.
Eventually another mechanism will be needed to indirectly address
the localized resources I mentioned above.

> 
> Proposal 1b achieves the same result, but without requiring
> changes to systems that have never used Unicode in the past.
> If Unicode becomes the accepted charset on all systems, then
> Unicode will be the most likely choice of all systems, for the
> same reason that systems currently use whatever charset is present
> on their own system.

Speaking from a Sun perspective, the Java platform (including browsers and
servers) has adopted Unicode and I am lobbying hard to get Unicode into XML,
DSig/PICS, webnfs, etc.

> 
> Martin is mistaken when he claims that Proposal 1c can be implemented
> on the server-side by a few simple changes, or even a module in Apache.
> It would require a Unicode translation table for all cases where the
> server generates URLs, including those parts of the server which are
> not controlled by the Apache Group (CGIs, optional modules, etc.).

Another way to state this consideration is that an application running 
in a non-Latin1 environment has two choices when communicating over 
a network. It can blindly transmit unlabeled information on the wire 
and hope for a common agreement with external clients, or it must provide
a canonical representation of the information. e.g. I might choose to 
transmit a UTF8 %HH escaped URL for a local EUC-jp filesystem based 
file and still include a "Content-Encoding: SJIS" header to disambiguate
the transaction so a localized browser could correctly present both 
the name and the contents of the document.

> We cannot simply translate URLs upon receipt, since the server has no
> way of knowing whether the characters correspond to "language" or
> raw bits.  The server would be required to interpret all URL characters
> as characters, rather than the current situation in which the server's
> namespace is distributed amongst its interpreting components, each of which
> may have its own charset (or no charset).  Even if we were to make such
> a change, it would be a disaster since we would have to find a way to
> distinguish between clients that send UTF-8 encoded URLs and all of those
> currently in existence that send the same charset as is used by the HTML
> (or other media type) page in which the FORM was obtained and entered
> by the user.

I agree that it would be disastrous to allow a mixed namespace without 
requiring appropriately labeled transactions. If the TV Guide is 
only available in SJIS/EUC encodings I may be out of luck trying
to use the local search engine.

> 
> The compromise that Martin proposed was not to require UTF-8, but
> merely recommend it on such systems.  But that doesn't solve the problem,
> so why bother?

I may wrong, but the only long term viable solutions I can see are the 
ones that require labeling or the ones that require a canonical
representation.

> 
> 
> PROBLEM 2:  When a browser uses the HTTP GET method to submit HTML
>             form data, it url-encodes the data within the query part
>             of the requested URL as ASCII and/or %xx encoded bytes.
>             However, it does not include any indication of the charset
>             in which the data was entered by the user, leading to the
>             potential ambiguity as to what "characters" are represented
>             by the bytes in any text-entry fields of that FORM.

This eventually booils down to the client decides, the server decides,
or there is a mutually agreed upon represntation.

> 
> Proposal 1a:  Let the form interpreter decide what the charset is, based
>               on what it knows about its users.  Obviously, this leads to
>               problems when non-Latin charset users encounter a form script
>               developed by an internationally-challenged programmer.

It may be helpful to consider two distinct things that happen on
the client side of the transaction. Clearly the localized user input 
mechanism will be done using the appropriate fonts and devices for the
end user. e.g. Braille reader/writer, Japanese input methods, etc. 
The second step of taking the inputs and encoding them so the are safe
and unambiguous for the server is a separate operation.

If the form specified a return encoding that is what the browser should
use. This places the burden on the browser to implement a wide range of
character converters for all the servers it could communicate with.

> 
> Proposal 1b:  Assume that the form includes fields for selecting the data
>               entry charset, which is passed to the interpreter, and thus
>               removing any possible ambiguity.  The only problem is that
>               users don't want to manually select charsets.

User selected operations will be prone to error and would not be necessary if
the protocol contained enough of the proper labels.

> 
> Proposal 1c:  Require that the browser submit the form data in the same
>               charset as that used by the HTML form.  Since the form
>               includes the interpreter resource's URL, this removes all
>               ambiguity without changing current practice.  In fact,
>               this should already be current practice.  Forms cannot allow
>               data entry in multiple charsets, but that isn't needed if the
>               form uses a reasonably complete charset like UTF-8.

This batch model for FORMs has always seemed a bit crippled to me. Eventually,
I want to send a Java applet from a server to a client browser and have it
dialog  with the local user to obtain and validate some information and then 
return to the server where it will dialog with a backend application to store
and process the transaction.

> 
> Proposal 1d:  Require that the browser include a <name>-charset entry
>               along with any field that uses a charset other than the one
>               used by the HTML form.  This is a mix of 1b and 1c, but
>               isn't necessary given the comprehensive solution of 1c
>               unless there is some need for multi-charset forms.

Multi-charset forms might only be required in a multi-lingual setting.
I work with people developing bilingual translation dictionaries. To 
present a form with a Russian word and a list of Thai translations, it
could require more than one character set.

> 
> Proposal 1e:  Require all form data to be UTF-8.  That removes the
>               ambiguity for new systems, but does nothing for existing
>               systems since there are no browsers that do this.  Of course
>               they don't, because the CGI and module scripts that interpret
>               existing form data DO NOT USE UTF-8, and therefore would
>               break if browsers were required to use UTF-8 in URLs.
>

There is a chicken and the egg problem with introducing any new 
web client-server standard. Since the installed based does not
have a viable I18N standard the only mechanism in use are only
functional in local settings. e.g. Mosaic-L10N with ISO2022 character
support. 

The argument about breaking existing applications is clearly a versioning 
issue. If HTML 4.0 and HTTP 1.2 were defined to use UTF-8 %HH escaped 
URLs then they would have a safe and unambiguous interpretation that would
be globally reliable. 

> The real question here is whether or not the problem is real.  Proposal 1c
> solves the ambiguity in a way which satisfies both current and any future
> practice, and is in fact the way browsers are supposed to be designed.
> Yes, early browsers were broken in this regard, but we can't fix early
> browsers by standardizing a proposal that breaks ALL browsers.  The only
> reasonable solution is provided by Proposal 1c and matches what non-broken
> browsers do in current practice.  Furthermore, IT HAS NOTHING TO DO WITH
> THE GENERIC URL SYNTAX, which is what we are supposed to be discussing.
>

The generic URL syntax supports the notion of encoding unsafe 8-bit octets
using the %HH escape mechanism. If the URL specification does not include 
an interpretation of those octets, then a higher level application protocol 
is needed to assign specific interpretation of the bytes. Today it is 
unspecified and has been abused in an ambiguous fashion.

 
> The above are the only two problems that I've heard expressed by Martin
> and Francois -- if you have others to mention, do so now.  Neither of the
> above problems are solved by requiring the use of UTF-8, which is what
> Larry was saying over-and-over-and-over again, apparently to deaf ears.

My understanding of the I18N problem is the end to end communication from 
end user to the backend database in a safe and unambiguous (reliable)
manner. The objections to using UTF8 with the %HH escape mechanism proposal,
have sounded like a more narrow interpretation of the role of URLs to 
simply the on the wire transaction between a web user agent and a web 
document server. There is a lot of disagreement about whether the current
implementations are in fact broken or not. Personally, the lack of labeling
leads me to believe that systems currently work by chance only and that 
there is no widely deployed interoperability between cuurrent ad hoc I18N
web interfaces.

> 
> I think both of us are tired of the accusations of being ASCII-bigots
> simply because we don't agree with your non-solutions.  Either agree to
> a solution that works in practice, and thus is supported by actual
> implementations, or we will stick with the status quo, which at least
> prevents us from breaking things that are not already broken.
>

I don't want to reopen any resolved debates, but I'd like to see if 
there is a possibility for consensus before the discussion is closed
prematurely. i.e. Are there any "facts" still in need of investigation 
or are the only unresolved issues questions of "opinion"? (My opinion
is that the current system is already broken, if this could be 
subtantiated would that invalidate the "status quo" as a viable 
alternative?)
 
> BTW, I am not the editor -- I am one of the authors.  Larry is the only
> editor of the document right now.
> 
>  ...Roy T. Fielding
>     Department of Information & Computer Science    (fielding@ics.uci.edu)
>     University of California, Irvine, CA 92697-3425    fax:+1(714)824-1715
>     http://www.ics.uci.edu/~fielding/

______________________________________________________________________
Gary R. Adams				Email: Gary.Adams@East.Sun.COM
Sun Microsystems Laboratories   	Tel: (508) 442-0416
Two Elizabeth Drive			Fax: (508) 250-5067
Chelmsford MA 01824-4195 USA		(Sun mail stop: UCHL03-207)
SWAN URL:				http://labboot.East/~gra/
WWW URL:		http://www.sunlabs.com/people/gary.adams/
______________________________________________________________________
Received on Tuesday, 15 April 1997 12:00:13 UTC