Re: revised "generic syntax" internet draft from Martin J. Duerst on 1997-04-23 (uri@w3.org from April 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Wed, 23 Apr 1997 18:06:07 +0200 (MET DST)
To: "Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU>
Cc: uri@bunyip.com
Message-Id: <Pine.SUN.3.96.970420170754.245C-100000@enoshima>
The message I am answering to here lies a while back,
it raises some important issues, and I don't want to
let it unanswered (and am very interested on getting
many points below answered). If necessary, please split
your messages up by topic.

On Tue, 15 Apr 1997, Roy T. Fielding wrote:

> >What "fundamental incompatibility"? Is a recommendation suggesting
> >the use of a particularly well suited character encoding a
> >"fundamental incompatibility" when at present we don't know
> >the character encoding anyway?
> 
> Yes, because at present we don't tell the client to transcode the URL.
> Any transcoding is guaranteed to fail on some systems, because the
> URL namespace has always been private to the generator (the server in
> "http" or "ftp" or "gopher" URLs, the filesystem in "file" URLs, etc.).

As far as current pages contain only URLs with %HH for "dangerous"
octets, there is no transcoding (except for ASCII<->EBCDIC and the
like, which we will ignore here). And this is currently the only
legal use.

After we have firmly established UTF-8 as a recommendation for
URLs, we can then go on and allow URLs in native encoding.
These will be transcoded wherever transcoding of the carrying
document happens, and will finally be transcoded into UTF-8
(and converted to %HH if necessary) before being sent to
the server. This covers all currently legal "moved around"
URLs.

For the currently non-legal "moved around" URLs and for the
URLs generated at the browser (FORMs), the solution works
as follows:

For non-legal "moved around" URLs (please note that according
to Roy's attitude to standards, we wouldn't be required to
take care of them, but if we can, why shouldn't we), after
trying with transcoding to UTF-8 as described above, we try
without transcoding. This covers the case that we have received
the document from its original source without any intermediate
transcoding (which cannot be guaranteed, but should be fairly
common at present). We have a second network round-trip, but
as this is only to recover an illegal case, it's not too bad.

For URLs generated at the browser (FORMs), we have to
exchange some information between server and browser
(FORM-UTF8: Yes). This again covers two cases, namely
the case that the server can handle UTF-8 and the case
that the server and the browser use the same charset.


> What is more likely: the client
> knows how to transcode from the data-entry dialog charset to UTF-8,
> or the user is using the same charset as the server?  On my system, the
> latter is more likely.  I suspect that this will remain an interoperability
> problem for some time, regardless of what the URL standard says.

What kind of system are you using?
And what kind of characters on that system?

Anyway, let's have a look at some cases. For Western Europe,
the Mac, DOS, Windows, and Unix boxes all use their own code pages.
Unix is mostly Latin-1 now, but there are some legacy systems
(you can have the old HP encoding on a HP box,...). Windows CP 1252
is almost, but not quite, equal to 8859-1. For Eastern Europe and
for Cyrillic, the situation is worse. For Japanese, you have EUC
on Unix and SJIS on PC/Mac. I could go on and on. And this won't
improve very quickly in the next few years. To deploy UTF-8
conversion capabilities where they are not yet available is
much easier.

The situation is actually a little bit better because of the
fact that Latin-1 is very well established for the Web (thanks
TBL for this one!). For Western Europe, things work therefore
despite having different charsets on different machines, because
somebody realized that characters, and not octets, have to be
preserved. The only thing we have to do now is to realize
that this was actually a very good idea, and to just apply
it to the whole world, while guaranteeing smooth transition
from the mess we have now.


> Proposal 1b allows cooperating systems to have localized URLs that work
> (at least locally) on systems deployed today.

The web never was local, and will never be. Something that only
works locally (where locally means a given language/script *and*
a given kind of computer) is a dead end for the web. What we need
is technology that can be made to work everywhere where there
are users that want to use it.



> >> Proposal 1c: Allow such characters, but only when encoded as UTF-8.
> >>              Clients may only display such characters if they have a
> >>              UTF-8 font or a translation table.
> >
> >There are no UTF-8 fonts. And the new browsers actually have such
> >translation tables already, and know how to deal with the fonts
> >they have on their system. And those that don't, they wont be worse
> >off than up to now.
> 
> Unless they are currently using iso-8859-1 characters in URLs, on pages
> encoded using iso-8859-1, which are also displayed correctly by far more
> browsers than just the ones you refer to.  Likewise for EUC URLs on
> EUC-encoded pages, and iso-2022-kr URLs on iso-2022-kr-encoded pages.
> The fact is, these browsers treat the URL as part of the HTML data
> stream and, for the most part, display it according to that charset
> and not any universal charset.

It is very clear to all of us that such URLs should be treated in
the charset of the document *as long as they are part of the document*.
The discussion with Keld just recently confirmed this.
Anything else would give big headaches everywhere.

The question is what happens when the URLs are passed from the HTML
document to the URL machinery in the browser. If they are interpreted
"as is", i.e. they are seen as octets, that works as long as the
HTML document came straight from the server and was set up carefully.
However, if there is a transcoding proxy, or transcoding happened
already on the server, or the URL took some other steps from the
"point of generation" (the original filename or whatever) to the
HTML page, for example by having been cut-and-pasted or by having
been transcribed on paper or by email, then nothing is guaranteed.
And that's why these URLs are currently illegal :-).


> >>	Servers are required to
> >>              filter all generated URLs through a translation table, even
> >>              when none of their URLs use non-Latin characters.
> >
> >Servers don't really generate URLs. They accept URLs in requests
> >and try to match them with the resources they have. The URLs get
> >created, implicitly, by the users who name resources and enter data.
> 
> The Apache source code is readily available and includes, as distributed,
> five different mechanisms that generate URLs: directory listings,
> configuration files (<Location>), request rewrite modules
> (redirect/alias/rewrite), request handling modules (imap), and CGI scripts.
> The first two are part of the server core and related to the filesystem,
> and thus could be mapped to a specific charset and thereby to a translation
> table per filesystem, with some overhead. The next two (modules) can be
> plugged-in or out based on user preference and there is no means for the
> server to discover whether or not they are generating UTF-8 encoded URLs,
> so we would have to assume that all modules would be upgraded as well.
> CGI scripts suffer the same problem, but exacerbated by the fact that CGI
> authors don't read protocol specs.

Many thanks for these details. One can indeed say that in these
cases, the server is generating URLs. Another way to see it is
that the server sends out URLs that already exist somewhere.
But this is a theoretical discussion.

For the implementation, I think there are three important aspects
we have to consider. One is the question of "what can we do
between the original point where we get the URL (or whatever
we call it) and the point where it is sent to the browser?".
This is discussed above nicely by Roy. The second question is
"Where do these URLs actually come from?". The third question
is "where do these URLs point to?".

For the second question, we have several possibilities. In the
case of directory listings, it's the file system itself that
gives us the data, in the case of rewrite and handling modules,
the data comes from files on the server. In both cases, we can
assume that we know (per server, per directory, or per file)
what encoding is used, so that we can make the necessary
transformations. In the case of CGI scripts, we can make the
assumption that each CGI script likewise has its "charset".
Scripts that deal with serveral "charset"s (such as those
in Japan that have to guess what they get sent in a query
part) are not that much of a problem, because their authors
know the issues, and will be happy if their job gets easier.

The question of "where do these URLs point to" is important
because for example if we get an URL in Latin-2 from a file,
we could convert that to UTF-8 if we know that it points to
our server, because we know that we accept UTF-8. On the other
hand, if it points to another server, we should be careful with
converting it, because we don't know whether that other server
accepts UTF-8 or not. This distinction is not too difficult
if we do a full parse of an HTML document, but that is probably
too much work to do, and for some other kinds of documents, it
may not be that easy (and we don't want, on a server, to know
all document types).

One important point is of course how the server currently
handles these things. For example, are non-ASCII URLs converted
to %HH, are non-ASCII URLs checked for and complained about,
or what is being done? In the case of redirects, the URL is
escaped, in the case of mod_imap, I failed to find the
place where this is done.


Anyway, what can we do? As Roy has said, for the internal
stuff, we can definitely do more than for CGIs. For CGIs,
and this is the last line of defense also for the rest,
we can probably do two things:

- Either leave the CGI alone, in its own world, with its
	own "charset", and not do any transcoding.
	We may call this the raw mode.
- Have the CGI "register" with some Apache settings to
	tell us what "charset" it is working in, so that
	we can do the appropriate translations. This
	registration might be done separately for incomming
	(query part) and outgoing (result) stuff.

We also have to distinguish between old and new CGIs. Old
CGIs should be left alone (whatever URLs they generate
for our server will be cared for by our backwards compatibility
measures), new CGIs should be correctly registered and should
only work with the new URLs.


Other things that the Apache group should try to think about
(I would definitely be willing to help) is how to be able
to write out all those pages that contain redirects and stuff
and that in some cases get seen by the user in different
languages. This would be a great service to the various
users, and could probably be implemented with very few
additional utility functions. This would then also determine
the "charset" of the outgoing message, and this in turn
would help to know what exactly to do with URL for which
we know what characters they represent.


> I am not talking theoretically here -- the above describes
> approximately 42% of the installed base of publically accessible
> Internet HTTP servers.  It would be nice to have a standard that
> doesn't make them non-compliant.

We definitely agree here. And as there is no requirement,
just a *recommendation*, that won't happen. The exercise
in backwards compatibility and upgrading strategy we are
doing above is a very good thing to do in my eyes, but
it is not necessary for the UTF-8 for URL recommendation
to be workable. That a server stays exactly the way it is
now is something that is completely accounted for in our
proposal. And for new server installations, we don't have
to worry about CGI scripts taking assumptions that were
never guaranteed by the standard.


> >>	Implementers
> >>              must also be aware that no current browsers and servers
> >>              work in this manner (for obvious reasons of efficiency),
> >>              and thus recipients of a message would need to maintain two
> >>              possible translations for every non-ASCII URL accessed.
> >
> >With exception of very dense namespaces such as with FORMs, it is
> >much easier to do transcoding on the server. This keeps upgrading
> >in one spot (i.e. a server can decide to switch on transcoding and
> >other things if its authors are giving out beyond-ASCII URLs).
> 
> As a server implementer, I say that claim is bogus.  It isn't even
> possible to do that in Apache.  Show me the code before trying to
> standardize it.

Well, I don't have the code ready. But our basic problem is the
following: We have a file system in encoding X, and we want to be
prepared to receive URLs in encoding X (for backwards compatibility
issues) and in UTF-8. We can easily do this with a module that combines
rewrite and subrequests. Here is the sketch of an algorithm that
returns a (possibly) valid local URL:

	input: url as it came in.
	if (url is ASCII-only)
		return url;
	if (url is valid UTF-8) {
		url2= convert-to-X(URL);
		if (subrequest with url2 successful)
			return url2;
		else
			return url;
	}
	else	/* URL can only be in encoding X, or wrong */
		return URL;

> >I showed above why Proposal 1b doesn't work. The computer->
> >paper->computer roundtrip that is so crucial for URLs is
> >completely broken.
> 
> Not completely.  It only breaks down when you transmit localized
> URLs outside the local environment.  That is the price you pay.

We are not ready to pay this price. It strongly discriminates
against non-English and non-Latin users. It is not due of a
real WORLD-wide web. If we have to pay it, there is something
wrong with the design of URLs. We know that the solution might
not just happen magically, but that's why we are working on it.


> >My proposal is not identical to Proposal 1c. It leaves
> >everybody the freedom to create URLs with arbitrary
> >octets. It's a recommendation.
> 
> Then it doesn't solve the problem.

Well, it solves the problem for those that want the problem
being solved. That's what counts. Currently, even those that
want the problem solved, and that know how this can be done,
can't solve it.


> If you want interoperability, you must use US-ASCII.  Getting 
> interoperability via Unicode won't be possible until all systems
> support Unicode, which is not the case today.  Setting out goals
> for eventual standardization is a completely different task than
> actually writing what IS the standard, which is why Larry asked
> that it be in a separate specification.  If you are convinced that
> Unicode is the only acceptable solution for FUTURE URLs, then
> write a proposed standard for FUTURE URLs.  To make things easier,
> I have proposed changing the generic syntax to allow more octets
> in the urlc set, and thus not even requiring the %xx encoding.
> In that way, if UTF-8 is someday accepted as the only standard,
> then it won't conflict with the existing URL standard.

If the standard gets changed so that %HH is no longer necessary,
this has to be done in a very careful way. If it is done carelessly,
it will do more harm than benefit.

Actually, if we would take the position that octets beyond 0x7F
and without %HH-encoding are illegal and therefore don't need
backwards compatibility, and if we change the standard as you
propose it to allow more characters (not octets), then we can
nicely deal with everything, and don't even need all those
backwards compatibility things we were speaking about.
We just declare the following:
	- Whatever is in %HH should be handled as octets and
		passed along as such in %HH form.
	- Whatever is characters (outside ASCII) should be
		treated as characters and passed along as
		such. When it is submitted to a server, the
		characters should be encoded as UTF-8.


> >> We cannot simply translate URLs upon receipt, since the server has no
> >> way of knowing whether the characters correspond to "language" or
> >> raw bits.  The server would be required to interpret all URL characters
> >> as characters, rather than the current situation in which the server's
> >> namespace is distributed amongst its interpreting components, each of which
> >> may have its own charset (or no charset).
> >
> >There is indeed the possibility that there is some raw data in an
> >URL. But I have to admit that I never yet came across one. The data:
> >URL by Larry actually translates raw data to BASE64 for efficiency
> >and readability reasons.
> >And if you study the Japanese example above, you will also very well
> >see that assuming that some "raw bits" get conserved is a silly idea.
> >Both HTML and paper, as the main carriers of URLs, don't conserve
> >bit identity; they converve character identity. That's why the
> >draft says:
> >
> >                                     The interpretation of a URL
> >   depends only on the characters used and not how those characters
> >   are represented on the wire.
> >
> >This doesn't just magically stop at 0x7F!
> 
> Transcoding a URL changes the bits.  If those bits were not characters,
> how do you transcode them?  Moreover, how does the transcoder differentiate
> between %xx as data and %xx as character needing to be transcoded to UTF-8?
> It can't, so requiring UTF-8 breaks in practice.

No. If something is %HH, it is always octets, and doesn't have to be
transcoded (it would be rather difficult to change current transcoders
to do that). This is part of our original proposal, and it seems that
you didn't understand this until now. What is transcoded, however, is
the URLs that are (currently illegally) sent as actual characters.
The assumption is of course that for those URLs that are actual data,
the %HH escaping is always used, and only for those that represent
characters (the big majority), actual characters outside ASCII are
used.



> >> Even if we were to make such
> >> a change, it would be a disaster since we would have to find a way to
> >> distinguish between clients that send UTF-8 encoded URLs and all of those
> >> currently in existence that send the same charset as is used by the HTML
> >> (or other media type) page in which the FORM was obtained and entered
> >> by the user.
> >
> >I have shown how this can work easily for sparce namespaces. The solution
> >is to test both raw and after conversion from UTF-8 to the legacy encoding.
> >This won't need many more accesses to the file system, because if a
> >string looks like correct UTF-8, it's extremely rare that it is something
> >else, and if doesn't look like correct UTF-8, there is no need to
> >transcode.
> 
> You keep waving around this "easily" remark without understanding the
> internals of a server.  If I thought it was easy to do those things,
> they would have been implemented two years ago.

Well, the whole thing was discussed ad extenso in ftp-wg, and it was
decided that accepting filenames in a legacy encoding and in UTF-8
and figuring out which one it was was definitely implementable for
an ftp server. Please read the newest ftp internationalization draft
or the archives of the group. Of course, HTTP and FTP are not the same,
but in many ways, they are similar.


> What do you do if you have two legacy resources, one of which has the
> same octet encoding as the UTF-8 transcoding of the other?

This can indeed happen, but it is extremely rare. For some legacy
encodings (e.g. KOI-8 which is very popular in Russia), the
probability is zero. For Latin-1, the situation is the following:
The legal Latin-1 sequences that are also legal UTF-8 sequences
and that when interpreted as UTF-8 only contain characters from
the Latin-1 character set are two-letter sequences of the following
form:

	The first letter is an A-^ or an A-~.
	The second letter is 0xAx or 0xBx, i.e. not a letter
	but a symbol such as inverted question mark, 1/4, +-,
	cent sign, or so on.

Please note that any single one-letter beyond-ASCII Latin-1
character (which is the most frequent way these letters appear
in Latin-1) makes it impossible to be UTF-8. Even in cases
where something can theoretically be UTF-8, there are a lot
of things that in practice will make the probability that
there are clashes EXTREMELY low. For examlpe, I asked a friend
of mine maintaining a large Japanese<->English dictionary
to cull all the Japanese entries in his dictionary that when
encoded as EUC or SJIS could possibly be UTF-8. For EUC,
he found some 2.7% or about 3000 entries. For SJIS, he found
just two entries. Having a look at the EUC entries in a
UTF-8 editor, I find a lot of ASCII characters (wrongly
coded as two or three bytes but not culled), a lot of
accented characters and characters from all kinds of
alphabets including Greek, Arabic, and Hebrew, and a lot
of undefined codepoints. In the whole file, I found one
single Kanji (a Chinese simplified one that would never
appear in Japanese) and no Hiragana or Katakana at all.

Of course, this sample may not be representative for
Japanese filenames, but if the chance that something is
legal UTF-8 is so low, and the chance that something
is reasonable UTF-8 is even much much lower, how low
do you think the chance that somebody will have just
exactly those two filenames that produce a clash?

If it weren't that this thing could be tweaked and hacked,
I would very much volunteer to include a check in Apache
for such a case (*this* would come at significant file
system access costs) and send the first person that
encounters this something really nice for 100$ or even more :-).
And I bet I could keep these 100$ for quite a long time.


> How do you
> justify the increased time to access a resource due to the failed
> (usually filesystem) access on every request?  Why should a busy server
> do these things when they are not necessary to support existing web
> services?

There is no increased time. It's either legal UTF-8 (in which
case the chances that it indeed is UTF-8 are 99% or higher)
or it is not UTF-8 (in which case the chances are 100% that
it is the legacy encoding). This point was mentionned several
times by now, and I hope you finally understand it.


> After all, there are at least a hundred other problems that the draft
> *does* solve, and you are holding it up.

We don't hold it up. We have a clear proposal (the currently
officially proposed wording was drafted by you), and you just
have to include it or propose something else that meets the
same intentions. Many people in this group have clearly stated
that the draft as it currently stands is not sufficient.


> >> Proposal 1c:  Require that the browser submit the form data in the same
> >>               charset as that used by the HTML form.  Since the form
> >>               includes the interpreter resource's URL, this removes all
> >>               ambiguity without changing current practice.  In fact,
> >>               this should already be current practice.  Forms cannot allow
> >>               data entry in multiple charsets, but that isn't needed if the
> >>               form uses a reasonably complete charset like UTF-8.
> >
> >This is mostly current practice, and it is definitely a practice
> >that should be pushed. At the moment, it should work rather well,
> >but the problems appear with transcoding servers and proxies.
> >For transcoding servers (there are a few out there already),
> >the transcoding logic or whatever has to add some field (usually
> >a hidden field in the FORM) that indicates which encoding was
> >sent out. This requires a close interaction of the transcoding
> >part and the CGI logic, and may not fit well into a clean
> >server architecture.
> 
> Isn't that what I've been saying?  Requiring UTF-8 would require all
> servers to be transcoding servers, which is a bad idea.  I am certainly
> not going to implement one, which should at least be a cause for concern
> amongst those proposing to make it the required solution.

Just a moment. Transcoding happens at different points. What I mean
by a transcoding server is a server that transcodes the documents
it serves. For example, it would get a request with 
	Accept-Charset: iso-8859-2, iso-8859-5, iso-8859-1;q=0.0
for a German document it keeps in iso-8859-1. It would take that
document, transcode it to iso-8859-2, and serve it. This is a
functionality that would be rather good to have for a server,
in order to keep clients small.


> >For a transcoding proxy (none out there
> >yet as of my knowledge, but perfectly possible with HTTP 1.1),
> >the problem gets even worse.
> 
> Whoa!  A transcoding proxy is non-compliant with HTTP/1.1.
> See the last part of section 5.1.2 in RFC 2068.

The last section of 5.1.2 prohibits URLs that are sent upstream
(from client via proxy to server) to be changed. There is nothing
that prohibits a proxy that gets a request with
	Accept-Charset: iso-8859-2, iso-8859-5, iso-8859-1;q=0.0
from a client and that retrieves a resource tagged charset=iso-8859-1
to do the neccessary translations. Indeed, the design of the whole
architecture just suggests that such and similar translations
are one of the main jobs proxies are made for (besides security
and caching).

The fact that changing URLs upstream is explicitly prohibited
very clearly shows that treating URLs as octets and relying
on your proposal 2c (send the URL back in the same encoding
you got the FORM page) is going to break sooner or later.


> >Well, I repeat my offer. If you help me along getting into Apache,
> >or tell me whom to contact, I would like to implement what I have
> >described above. The deadline for submitting abstracts to the
> >Unicode conference in San Jose in September is at the end of this
> >week. I wouldn't mind submitting an abstract there with a title
> >such as "UTF-8 URLs and their implementation in Apache".
> 
> Sorry, I'd like to finish my Ph.D. sometime this century.
> How can I help you do something that is already known to be impossible?

By studying what I and others write and finding out that it
is not as impossible as you thought it is :-).

Regards,	Martin.
Received on Wednesday, 23 April 1997 12:08:21 UTC