Re: revised "generic syntax" internet draft from Martin J. Duerst on 1997-04-15 (uri@w3.org from April 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Tue, 15 Apr 1997 16:31:51 +0200 (MET DST)
To: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>
Cc: uri@bunyip.com, Harald.T.Alvestrand@uninett.no
Message-Id: <Pine.SUN.3.96.970415120519.708A-100000@enoshima>
On Mon, 14 Apr 1997, Roy T. Fielding wrote:

> I am going to try this once more, and then end this discussion.
> I have already repeated these arguments several times, on several
> lists, over the past two years, and their substance has been repeatedly
> ignored by Martin and Francois.

Thanks for entering into serious discussion. It is true that it was
about two years ago when I for the first time contacted the uri group
and asked about internationalization and URLs. I quickly saw at
that time that there were rather fixed oppinions about what an URL
had to be (kind of like a telephone number) and that typability
on ASCII keyboards seemed more important than anything else. Also,
I didn't have any idea of how the solution should have to look.

But I remained with the impression that denying the benefits of
natural-language URLs to people outside the basic Latin world
was neither fair nor technically necessary. Also, I found that
there were many other people that were interested in a solution,
for various reasons, and often with much more direct needs.
I had many occasions to discuss with them, and with others that
raised doubts or questions. I repeatedly made presentations of
the state of the discussion, the alternatives available, the
issues involved. I discovered that some of the great concerns
that some people raised were not really that important, and
that it was possible to explain this rather easily. I and many
others did a lot of homework, and a lot of work in other groups.
Also, within the past two years, technology has changed a lot.
Java was barely visible two years ago. Unicode was one solution
of many, with almost no applications available.

After these two years, I had finally come to the conclusion that
I had a working solution, an upgrade path, a lot of good arguments,
and a lot of other people that also cared. It was only then that
I started to push strongly for what I and others have come
to understand is the right way to go.


> That is why I get pissed-off when
> Martin sends out statements of consensus which are not true, whether he
> realizes them to be false or not.

I get pissed-off when we need about two months of piling argument
on argument to finally have a clear response from you, and when
you seem to ignore all the changes that have been going on in
the past two years.


> The html-i18n RFC is a travesty
> because actual solutions to the problems were IGNORED in favor of
> promoting a single charset standard (Unicode).

What "actual solutions"? If you think you could have done that work
better, why didn't you do it? Why is everybody in the non-English
community happy with RFC 2070, and the solutions are adopted by
the IAB (as Larry has told me), the W3C and ISO (and of course the
browser makers)? For the most crucial part in RFC 2070, namely the
definition of ISO 10646 as the document character set, which was
already anounced in RFC 1866, why does a developper from a major
software company tell in a public workshop that his company did
things different originally, but that now they know it better,
and they are doing it as we proposed?


> I personally would
> approve of systems using Unicode, but I will not standardize solutions
> which are fundamentally incompatible with existing practice.

What "fundamental incompatibility"? Is a recommendation suggesting
the use of a particularly well suited character encoding a
"fundamental incompatibility" when at present we don't know
the character encoding anyway?


I very much like the division into PROBLEM 1 and PROBLEM 2
below. PROBLEM 1 is URLs in general, such as domain names,
paths, and resource names. PROBLEM 2 is FORMs. They are
indeed different, because for PROBLEM 1, we have very sparse
namespaces and not very much beyond ASCII yet, whereas
for PROBLEM 2, we have very dense namespaces and already
a lot of use (and chaos). Of course, there are interactions,
because an URL with a # or ? part can also be used as a
primary entry point.

> PROBLEM 1:  Users in network environments where non-ASCII characters
>             are the norm would prefer to use language-specific characters
>             in their URLs, rather than ASCII translations.
> 
> Proposal 1a: Do not allow such characters, since the URL is an address
>              and not a user-friendly string.  Obviously, this solution
>              causes non-Latin character users to suffer more than people
>              who normally use Latin characters, but is known to interoperate
>              on all Internet systems.

The URL may have been designed as a non-user-friendly address,
but to say that it IS not a user-friendly string is ignoring
actual practice. I have just had a look at your web page, and
you use meaningful URLs, like everybody else.


> Proposal 1b: Allow such characters, provided that they are encoded using
>              a charset which is a superset of ASCII.  Clients may display
>              such URLs in the same charset of their retrieval context,
>              in the data-entry charset of a user's dialog, as %xx encoded
>              bytes, or in the specific charset defined for a particular
>              URL scheme (if that is the case).  Authors must be aware that
>              their URL will not be widely accessible, and may not be safely
>              transportable via 7-bit protocols, but that is a reasonable
>              trade-off that only the author can decide.

The problem here is that it's not display or 7-bit channels or whatever
that makes this proposal fail. It is that URLs are transferred from
paper to the computer and back, and that may happen many times.
In a recent message to Francois, you seem to have completely ignored
this fact, you spoke about an URL only being an URL after it is
input into the browser. Yet the draft says:

                                 A URL may be represented in a variety
   of ways: e.g., ink on paper, pixels on a screen, or a sequence of
   octets in a coded character set.

Let's make an example. Assume somebody is constructing an URL
using KOI-8, one of the more popular encodings for Cyrillic.
She is writing down that URL (in Cyrillic) on paper, and
passing it to a friend. The friend types it in, but has no
idea (and wouldn't want to care) what encoding it was.
Maybe he happens to be on a machine that uses iso-8859-5.
He won't be able to find the URL. Obviously, things only
work for those URLs for which we have a defined mapping.

We can define that mapping in several ways. Currently it is
undefined. Possible solutions include a global definition
(or at least a recommendation), a solution per protocol or
per server, or some kind of tagging (like RFC 1522/2048).
Obviously, all but the first solution are very clumsy.
Would you like to have an URL such as
	http://[us-ascii]www.ics.uci.edu/~fielding
or so? Probably not. Also, making encoding of URLs depend
on protocols and schemes will make generic URL software
very difficult and nonextensible. Having to contact a
server every time a transition from a binary form (%HH,...)
to a visible form with the actual characters (or backwards)
is done would be a true waste of connection bandwidth.


> Proposal 1c: Allow such characters, but only when encoded as UTF-8.
>              Clients may only display such characters if they have a
>              UTF-8 font or a translation table.

There are no UTF-8 fonts. And the new browsers actually have such
translation tables already, and know how to deal with the fonts
they have on their system. And those that don't, they wont be worse
off than up to now.

>	Servers are required to
>              filter all generated URLs through a translation table, even
>              when none of their URLs use non-Latin characters.

Servers don't really generate URLs. They accept URLs in requests
and try to match them with the resources they have. The URLs get
created, implicitly, by the users who name resources and enter data.

Anyway, it is extremely easy and simple for a server to test
whether an URL contains only ASCII, and in that case not do any
kind of transcoding. And this test will be extremely efficient
and cheap.

>	Browsers
>              are required to translate all FORM-based GET request data
>              to UTF-8, even when the browser is incapable of using UTF-8
>              for data entry.

Let's come back to forms later. They are a special case.

>	Authors must be aware that their
>              URL will not be widely accessible, and may not be safely
>              transportable via 7-bit protocols, but that is a reasonable
>              trade-off that only the author can decide.

What 7-bit protocols? The internet is 8-bit throughout. Mail is
7-bit, and that might be your concern. If you are worrying about
something else, please tell us. Now let's have a look at mail.
Assume a user finds a cute URL in a Japanese web page, and this
web page is written in EUC (used on Unix boxes) and comes down
to the browser in that encoding. Now let's assume the user
is on a Mac. When he copies the URL into the clipboard (or maybe
earlier), this URL is transcoded to Shift-JIS, because the Mac
internally uses Shift-JIS for Japanese. The last time I checked
this was quite some time ago, it was probably with Netscape 2.
The user then might copy this Japanese URL into a mail he writes
to a Japanese friend. When the mail is sent off, it is translated
to JIS, because that's the way Japanese mails are sent around.
The user might not have set his mail software correctly, and
then this might not be done, but then he won't be able to send
Japanese mail at all. Now JIS, in MIME called iso-2022-jp, is
7-bit, and that's why these characters will pass through a 7-bit
channel nicely. [The PC also uses Shift-JIS internally, and
everything should be pretty much the same, but I don't have
a box here to test it.]

>	Implementers
>              must also be aware that no current browsers and servers
>              work in this manner (for obvious reasons of efficiency),
>              and thus recipients of a message would need to maintain two
>              possible translations for every non-ASCII URL accessed.

With exception of very dense namespaces such as with FORMs, it is
much easier to do transcoding on the server. This keeps upgrading
in one spot (i.e. a server can decide to switch on transcoding and
other things if its authors are giving out beyond-ASCII URLs).

>              In addition, all GET-based CGI scripts would need to be
>              rewritten to perform charset translation on data entry, since
>              the server is incapable of knowing what charset (if any)
>              is expected by the CGI.  Likewise for all other forms of
>              server-side API.

Again, this is forms. See later.


> Proposal 1a is represented in the current RFCs and the draft,
> since it was the only one that had broad agreement among the
> implementations of URLs.  I proposed Proposal 1b as a means
> to satisfy Martin's original requests without breaking all
> existing systems, but he rejected it in favor of Proposal 1c.

I showed above why Proposal 1b doesn't work. The computer->
paper->computer roundtrip that is so crucial for URLs is
completely broken.


My proposal is not identical to Proposal 1c. It leaves
everybody the freedom to create URLs with arbitrary
octets. It's a recommendation.


> I still claim that Proposal 1c cannot be deployed and will
> not be implemented, for the reasons given above.  The only
> advantage of Proposal 1c is that it represents the
> Unicode-uber-alles method of software standardization.

For document content, there is no problem to add a header
as in Email or HTTP. And you can even let the user guess
the encoding. For URLs, as said above, none of that works.


> Proposal 1b achieves the same result, but without requiring
> changes to systems that have never used Unicode in the past.
> If Unicode becomes the accepted charset on all systems, then
> Unicode will be the most likely choice of all systems, for the
> same reason that systems currently use whatever charset is present
> on their own system.

There is a strong need for interoperability. We don't want to
force anybody to cooperate, but there are quite some users
that want to use their local encoding on their boxes, but still
want to be able to exchange URLs the way English speakers do
with basic ASCII. The only way to do this is to specify some
character encoding for interoperability, and the only such
available encoding is Unicode/ISO 10646/JIS 221/KS 5700/...
It's not "just another character standard", it's THE international
standard to which all other character standards are alligned.


> Martin is mistaken when he claims that Proposal 1c can be implemented
> on the server-side by a few simple changes, or even a module in Apache.
> It would require a Unicode translation table for all cases where the
> server generates URLs, including those parts of the server which are
> not controlled by the Apache Group (CGIs, optional modules, etc.).

There are ways that a CGI script can tell the server which encoding
it wants the same way that there are ways to identify the filenames
in a particular directory to be in a certain encoding. Anyway,
for FORM/query part, please see below.


> We cannot simply translate URLs upon receipt, since the server has no
> way of knowing whether the characters correspond to "language" or
> raw bits.  The server would be required to interpret all URL characters
> as characters, rather than the current situation in which the server's
> namespace is distributed amongst its interpreting components, each of which
> may have its own charset (or no charset).

There is indeed the possibility that there is some raw data in an
URL. But I have to admit that I never yet came across one. The data:
URL by Larry actually translates raw data to BASE64 for efficiency
and readability reasons.
And if you study the Japanese example above, you will also very well
see that assuming that some "raw bits" get conserved is a silly idea.
Both HTML and paper, as the main carriers of URLs, don't conserve
bit identity; they converve character identity. That's why the
draft says:

                                     The interpretation of a URL
   depends only on the characters used and not how those characters
   are represented on the wire.

This doesn't just magically stop at 0x7F!


> Even if we were to make such
> a change, it would be a disaster since we would have to find a way to
> distinguish between clients that send UTF-8 encoded URLs and all of those
> currently in existence that send the same charset as is used by the HTML
> (or other media type) page in which the FORM was obtained and entered
> by the user.

I have shown how this can work easily for sparce namespaces. The solution
is to test both raw and after conversion from UTF-8 to the legacy encoding.
This won't need many more accesses to the file system, because if a
string looks like correct UTF-8, it's extremely rare that it is something
else, and if doesn't look like correct UTF-8, there is no need to
transcode.
For dense namespaces such as forms, see below.


> The compromise that Martin proposed was not to require UTF-8, but
> merely recommend it on such systems.  But that doesn't solve the problem,
> so why bother?

The idea of recommending UTF-8 has various reasons:

- Don't want to force anybody to use UNicode/ISO 10646.
- Don't want to make old URLs illegal, just have a smooth
	transition strategy.
- Fits better into a Draft Standard.



> PROBLEM 2:  When a browser uses the HTTP GET method to submit HTML
>             form data, it url-encodes the data within the query part
>             of the requested URL as ASCII and/or %xx encoded bytes.
>             However, it does not include any indication of the charset
>             in which the data was entered by the user, leading to the
>             potential ambiguity as to what "characters" are represented
>             by the bytes in any text-entry fields of that FORM.

This is indeed the FORMs/Query part problem. It's harder because
there is more actual use of beyond-ASCII characters, and because
the namespace is dense. But it is also easier because it is mainly
a problem between server and browser, with a full roundtrip to paper
and email only in rare cases.

I like the list of proposals that Roy has made below, because we
came up with a very similar list at the last Unicode conference
in Mainz.

> Proposal 1a:  Let the form interpreter decide what the charset is, based
>               on what it knows about its users.  Obviously, this leads to
>               problems when non-Latin charset users encounter a form script
>               developed by an internationally-challenged programmer.
> 
> Proposal 1b:  Assume that the form includes fields for selecting the data
>               entry charset, which is passed to the interpreter, and thus
>               removing any possible ambiguity.  The only problem is that
>               users don't want to manually select charsets.
> 
> Proposal 1c:  Require that the browser submit the form data in the same
>               charset as that used by the HTML form.  Since the form
>               includes the interpreter resource's URL, this removes all
>               ambiguity without changing current practice.  In fact,
>               this should already be current practice.  Forms cannot allow
>               data entry in multiple charsets, but that isn't needed if the
>               form uses a reasonably complete charset like UTF-8.

This is mostly current practice, and it is definitely a practice
that should be pushed. At the moment, it should work rather well,
but the problems appear with transcoding servers and proxies.
For transcoding servers (there are a few out there already),
the transcoding logic or whatever has to add some field (usually
a hidden field in the FORM) that indicates which encoding was
sent out. This requires a close interaction of the transcoding
part and the CGI logic, and may not fit well into a clean
server architecture. For a transcoding proxy (none out there
yet as of my knowledge, but perfectly possible with HTTP 1.1),
the problem gets even worse.


> Proposal 1d:  Require that the browser include a <name>-charset entry
>               along with any field that uses a charset other than the one
>               used by the HTML form.  This is a mix of 1b and 1c, but
>               isn't necessary given the comprehensive solution of 1c
>               unless there is some need for multi-charset forms.
> 
> Proposal 1e:  Require all form data to be UTF-8.  That removes the
>               ambiguity for new systems, but does nothing for existing
>               systems since there are no browsers that do this.  Of course
>               they don't, because the CGI and module scripts that interpret
>               existing form data DO NOT USE UTF-8, and therefore would
>               break if browsers were required to use UTF-8 in URLs.
> 
> The real question here is whether or not the problem is real.  Proposal 1c
> solves the ambiguity in a way which satisfies both current and any future
> practice, and is in fact the way browsers are supposed to be designed.

Your analysis is close to perfect, but you forget transcoding
proxies. As it is damn clumsy to have all these tables in all
browsers and so, we might very soon see browsers that request
UTF-8 only and that rely on transcoding proxies to deal with
servers with legacy encodings. As long as conversion happens
downstreams, but URLs are treated as raw data upstreams,

When we discussed this FORMS/Query part in Mainz in March, we
also seemed to get stuck in this problem. But we found a way
out. It works as follows:

We adopt Proposal 2c and an upgrade path to UTF-8. The upgrade
consists of a token of information from the FORM to the client,
and a token of information from the client back.

For the information from the client to the server, we could
recycle the proposed ACCEPT-CHARSET *attribute* on INPUT
and such from RFC 2070. The idea of this attribute was to
be able to indicate for each input field the character encodings
that would be acceptable to the server. As Roy correctly says,
this is unnecessary overkill. A study cited by Peter Edberg
from Apple in his Mainz Unicode conference paper did not show
a single use of this attribute. But the study is from November
1995, so it may be dated.
Anyway, recycling the ACCEPT-CHARSET attribute would mean that
it is only used as ACCEPT-CHARSET="UTF-8", and applies for
all relevant fields of a FORM when it appears.
Because it is rather long and may already be in use somewhere,
an alternative is to define another attribute or a HTTP header
for the same purpose.

To send information back from the browser, a HTTP header would
probably also be the best solution. Ideally, it would be
	FORM-UTF-8: Yes
This has the advantage to be short, and to be easily implementable
in its opposite, namely
	FORM-UTF-8: No
so that after a few years of transition, we can phase out this
header. It would be sent back only if UTF-8 is requested, of
course. The result would be Proposal 2c and UTF-8 if both
server and client can handle it. On the server side, especially
if we make the information from the server to the client an HTTP
header, it can be handled without bothering the CGI script
(other that it has to tell us which encoding it wants the
data in).

> Yes, early browsers were broken in this regard, but we can't fix early
> browsers by standardizing a proposal that breaks ALL browsers.

Current practice is broken. We have to fix it. And we can do it
without breaking anything.


> The only
> reasonable solution is provided by Proposal 1c and matches what non-broken
> browsers do in current practice.  Furthermore, IT HAS NOTHING TO DO WITH
> THE GENERIC URL SYNTAX, which is what we are supposed to be discussing.

Generic URL syntax assumes that URLs are handled as charaters for
the ASCII range and as octets for the rest. This doesn't work,
and can be fixed.


> The above are the only two problems that I've heard expressed by Martin
> and Francois -- if you have others to mention, do so now.

You got that correct in principle. But what you propose as solutions
doesn't work the way you think it would.


> Neither of the
> above problems are solved by requiring the use of UTF-8, which is what
> Larry was saying over-and-over-and-over again, apparently to deaf ears.

The problems can be solved by recommending the use of UTF-8, and
taking care of the details. And for a truely WORLD Wide Web, the
problems should be solved. It is clear that Latin and English have
a bit of a head start when it comes to international communication.
But there is no serious technology limit anymore to allow people
to use their own language and script.


> I think both of us are tired of the accusations of being ASCII-bigots
> simply because we don't agree with your non-solutions.

Neither you nor Larry exactly expressed where you saw the problems.
Your behaviour was very difficult to explain to me, and this lead
to certain suspicions and accusations. I hope you will study what
I wrote above in detail, and see where your ideas for solutions
don't work, and where our proposals may work better than you
thought. I am looking forward to discussing the details.


> Either agree to
> a solution that works in practice, and thus is supported by actual
> implementations, or we will stick with the status quo, which at least
> prevents us from breaking things that are not already broken.

Well, I repeat my offer. If you help me along getting into Apache,
or tell me whom to contact, I would like to implement what I have
described above. The deadline for submitting abstracts to the
Unicode conference in San Jose in September is at the end of this
week. I wouldn't mind submitting an abstract there with a title
such as "UTF-8 URLs and their implementation in Apache".


Regards,	Martin.
Received on Tuesday, 15 April 1997 10:34:53 UTC