Re: I18N Concensus - Generic Syntax Document from Martin J. Duerst on 1997-03-07 (uri@w3.org from March 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Fri, 7 Mar 1997 14:50:36 +0100 (MET)
To: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>
Cc: URI List <uri@bunyip.com>
Message-Id: <Pine.SUN.3.95q.970307134328.245D-100000@enoshima>
Hello Roy,

Many thanks for voicing your concerns and giving me a chance
to answer them.

On Fri, 7 Mar 1997, you wrote:

> >+ It is recommended that UTF-8 [RFC 2044] be used to represent characters
> >+ with octets in URLs, wherever possible.
> >
> >+ For schemes where no single character->octet encoding is specified,
> >+ a gradual transition to UTF-8 can be made by servers make resources
> >+ available with UTF-8 names on their own, on a per-server or a
> >+ per-resource basis. Schemes and mechanisms that use a well-
> >+ defined character->octet encoding which is however not UTF-8 should
> >+ define the mapping between this encoding and UTF-8, because generic
> >+ URL software is unlikely to be aware of and to be able to handle
> >+ such specific conventions.
> 
> Here is where you lose me.

Don't worry. I hope we will have you back soon again :-).

> I have no desire to add a UTF-8 character
> mapping table to our server.

There is no need to do so. The above is only a *recommendation*.
For your server, if you:

- Don't have or think you will have anything else than ASCII
- Do think the URLs that are used on your server aren't
	related to characters
- Think that it's too difficult to find out which resource
	name is in which encoding
- Think that you need a really small server and the tables
	you would need would be too large
- Think that people using your URLs don't care about knowing
	what characters are behind them anyway
- Think that English (or Suaheli) are enough to serve the
	world's need, and everybody should learn English
	for common communication or to make life easier
	for software engineers
- Are just too laizy, have other priorities, don't have
	the necessary expertise, and so on

In all those cases, and probably quite a few more, you don't
need to add UTF-8 character mapping facilities to your server.
Of course, it is still nice if you try to do it.

In addition, for systems that already are Unicode-based,
such as Plan9, the Newton, Windows NT, Java,..., you also
don't need any tables, just some really short piece of code.



> An HTTP server doesn't need one -- its URLs are
> either composed by computation (in which case knowing the charset is not
> possible) or by derivation from the filesystem (in which case it will use
> whatever charset the filesystem uses, and in any case has no way of
> determining whether or not that charset is UTF-8).

It's not the HTTP server that causes the need to have characters
encoded in some defined way. It's the users that want to know what's
behind an URL, a facility which is so obviously useful to English
users that they might not even notice it, but which is not consistently
available to others.

Anyway, for computation, there are very many things you can subsume
under that term, and quite a few include character manipulation, and
in these cases, you usually know (explicitly or implicitly) what
character encoding you deal with. If you don't, there are not many
useful computations you can make with characters. If you can be more
specific for what you think about with computation.

For filesystems, there are quite different kinds. You are probably
assuming a UNIX-like file system, where the interpretation of
filenames in terms of characters depends on the font settings in
your xterm or the font in your glass tty ROM. There are other systems
where the interpretation of filenames in terms of characters is
very clearly defined (see above), so your argument is not general.
Anyway, there are various ways to determine the character encoding
of a filename on a UNIX-like system. The problem is quite similar
to determining the character encoding of the resources themselves,
we know that it's not easy, but we know that it's the right thing
to do, and that we have find means to make this easier on such
systems.

Also, there are probably servers available on IBM hosts,
where filenames are in EBCDIC. What do those servers do?
Do they accept URLs based on octet identity of filenames?
Or do they do conversion, so that users get what they
expect, namely character identity?
For an ASCII URL such as:
	http://www.ibmmain.com/Fielding.html
Do you expect this to look as above, or do you think
it is (or should be)
	http://www.ibmmain.com/%C6%89%85%93%84%89%95%87K%88%A3%94%93
because the server is too lazy to convert from/to EBCDIC?
Or would you like, as a data provider, to calculate the
names in EBCDIC so that you don't know what they mean, but
they appear as meaningful URLs to outside users?

I guess the only sensible answer here, even for you, is that
the server does conversion. What we are proposing with UTF-8
is just that not only English/Latin users get this nice and
natural service, but that there is at least the *possibility*
that others can establish it, too, even if you yourself don't
want to get involved in it.

Of course, there is some danger that after some time, with
enough UTF-8 servers and clients around, users will just
expect that it works on all servers and clients, and that
you might get forced by your user base to do some implementation.
But that's probably a long time ahead, and would just be
the ultimate proof of the desirability of a consistent
character encoding in URLs, and not an argument to
try to avoid it.


> The server doesn't care
> and should not care.

The users care, and that's why the servers probably should care.
Or do you just serve random data, because anyway the server
doesn't care whether the users get something reasonable?


> It is therefore inappropriate to suggest that it should
> add such a table when doing so would only bloat the server and slow-down
> the URL<->resource mapping process.

There is no suggestion to add a table. Implementation is up to you.
A table-based mapping can be extremely fast. It won't slow down
the process if it's done correctly. Depending on the character
encodings you have on your server, tables don't have to be very
large; there are numerous very efficient techniques for sparsely
populated tables. Also, there is no need to have the conversion
inside the server. For example, if your server is file-based,
you can have a small program running once a day that for every
filename in your legacy encoding creates a link using UTF-8. Here,
the disadvantage of UNIX-like file systems that don't have a defined
character encoding for filenames turns into an advantage. This
makes the resources available under both the legacy-encoded URLs
and the UTF-8 encoded URLs, which is nice to provide a smooth
upgrade path (as discussed in my original outline).



> >>    Data corresponding to excluded characters must be escaped in order
> >>    to be properly represented within a URL.  However, there do exist
> >>    some systems that allow characters from the "unwise" and "national"
> >>    sets to be used in URL references (section 3); a robust
> >>    implementation should be prepared to handle those characters when
> >>    it is possible to do so.
> >
> >Change to:
> >
> >There exist some systems that allow characters/octets from the
> >"unwise" and "others" sets to be used in URL references (section 3).
> >Until a uniform representation for characters within URLs is firmly
> >established, such practice is not stable with respect to transcoding
> >and therefore should be avoided.
> >However, robust implementations should be prepared to handle those
> >octet values when it is possible to do so.
> 
> No thanks -- the existing paragraph is far better.  Transcoding is
> not an issue unless they are already violating the specification,
> in which case they are prepared to suffer the consequences.
> The purpose of the paragraph is to prevent an implementer from
> interpreting the spec too literally and crashing on a non-urlc
> character.

The problem is that a lot of them are currently prepared to
"suffer" the consequences because it just works, because there
are no visible consequences. And as long as it works, people
will continue to use it because it provides some very convenient
features to them, just disallowing it officially won't keep
them from using it. Telling them where and why it will stop
to work will hopefully let some of them understand and will
have them (for the time being, at least) discontinue this
practice it. 


Regards,	Martin.
Received on Friday, 7 March 1997 08:54:42 UTC