Re: I18N Concensus - Generic Syntax Document from Roy T. Fielding on 1997-03-07 (uri@w3.org from March 1997)

From: Roy T. Fielding <fielding@kiwi.ICS.UCI.EDU>
Date: Fri, 07 Mar 1997 01:37:25 -0800
To: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
Cc: URI List <uri@bunyip.com>
Message-Id: <9703070137.aa29868@paris.ics.uci.edu>

>+ It is recommended that UTF-8 [RFC 2044] be used to represent characters
>+ with octets in URLs, wherever possible.
>
>+ For schemes where no single character->octet encoding is specified,
>+ a gradual transition to UTF-8 can be made by servers make resources
>+ available with UTF-8 names on their own, on a per-server or a
>+ per-resource basis. Schemes and mechanisms that use a well-
>+ defined character->octet encoding which is however not UTF-8 should
>+ define the mapping between this encoding and UTF-8, because generic
>+ URL software is unlikely to be aware of and to be able to handle
>+ such specific conventions.

Here is where you lose me.  I have no desire to add a UTF-8 character
mapping table to our server.  An HTTP server doesn't need one -- its URLs are
either composed by computation (in which case knowing the charset is not
possible) or by derivation from the filesystem (in which case it will use
whatever charset the filesystem uses, and in any case has no way of
determining whether or not that charset is UTF-8).  The server doesn't care
and should not care.  It is therefore inappropriate to suggest that it should
add such a table when doing so would only bloat the server and slow-down
the URL<->resource mapping process.

>>    Data corresponding to excluded characters must be escaped in order
>>    to be properly represented within a URL.  However, there do exist
>>    some systems that allow characters from the "unwise" and "national"
>>    sets to be used in URL references (section 3); a robust
>>    implementation should be prepared to handle those characters when
>>    it is possible to do so.
>
>Change to:
>
>There exist some systems that allow characters/octets from the
>"unwise" and "others" sets to be used in URL references (section 3).
>Until a uniform representation for characters within URLs is firmly
>established, such practice is not stable with respect to transcoding
>and therefore should be avoided.
>However, robust implementations should be prepared to handle those
>octet values when it is possible to do so.

No thanks -- the existing paragraph is far better.  Transcoding is
not an issue unless they are already violating the specification,
in which case they are prepared to suffer the consequences.
The purpose of the paragraph is to prevent an implementer from
interpreting the spec too literally and crashing on a non-urlc
character.

.....Roy

Received on Friday, 7 March 1997 04:41:59 UTC