Re: revised "generic syntax" internet draft from Roy T. Fielding on 1997-04-14 (uri@w3.org from April 1997)

From: Roy T. Fielding <fielding@kiwi.ICS.UCI.EDU>
Date: Sun, 13 Apr 1997 22:55:24 -0700
To: Francois Yergeau <yergeau@alis.com>
Cc: uri@bunyip.com
Message-Id: <9704132255.aa02069@paris.ics.uci.edu>
>>The only question that matters is whether or not the draft as it
>>currently exists is a valid representation of what the existing
>>practice is
>
>The current spec doesn't do that.  Non-ASCII characters are routinely
>rolled into URLs, yet the spec doesn't define the mapping.  IMHO, the spec
>is not worthy of becoming a Draft Standard, in fact it doesn't even meet
>one the requirements for Proposed Standard (from RFC 2026):
>
>   A Proposed Standard should have no known technical omissions
>   with respect to the requirements placed upon it.

Draft 04 does not have any such omission.  Non-ASCII characters are not
allowed in URLs, period.  Any application that transmits a URL in
non-ASCII characters is declared non-compliant.  There is no grey area.

However, do not confuse HTML href attributes (CDATA) with a URL.
That data does not become a URL until after it is processed by the
HTML/SGML parser, which includes its own charset mapping requirements
that are outside the scope of this specification.  Likewise, entering
a URL into a browser's entry dialog will also involve a charset
mapping which must take place before the data can be considered a URL.

>> and what the vendor community agrees is needed in the
>>future to support interoperability.
>
>I'm not aware that the Internet standards process excludes non-vendors.

It doesn't exclude anybody.  However, it isn't possible to claim
"rough consensus" for any feature that nobody wants to implement.
In the Internet standards process, "vendors" means people and organizations
intending to ship implementations of the specified protocol.

>>Since it is my opinion that it is NEVER desirable
>>to show a URL in the unencoded form given in Francois' examples,
>>you cannot claim to hold anything even remotely like consensus.
>
>A bit preposterous, isn't it?  *Your* opinion alone is enough to break any
>consensus?

Yes, it is.  That is the difference between "consensus" (what Martin
was claiming) and "rough consensus".

>I also happen to disagree with this particular opinion.  ASCII characters
>are not the only ones worth displaying.  User-friendliness should not be
>the exclusive apanage of ASCII users.

As it states quite clearly in the draft,

   These design concerns are not always in alignment.  For example, it
   is often the case that the most meaningful name for a URL component
   would require characters which cannot be typed on most keyboards.
   The ability to transcribe the resource location from one medium to
   another was considered more important than having its URL consist
   of the most meaningful of components.  In local and regional
   contexts and with improving technology, users might benefit from
   being able to use a wider range of characters.  However, such use
   is not guaranteed to work, and should therefore be avoided.

Your comments have done nothing to change the conclusions already
represented within the draft.

>>IF you can persuade the creators of URLs to always use UTF-8, which
>>is definitely not the case today (Apache, NCSA, and CERN servers all
>>use whatever charset is used by the underlying filesystem, which on
>>most Unix-based systems is iso-8859-1 or iso-2022-*), ...
>
>It is interesting that you should use this argument.  Yes, Apache, NCSA and
>CERN all use the platform's charset for mapping filenames to URLs (which
>can be remedied by a simple script, BTW).
>
>But these three also transmit documents in the charset that is found in the
>document (transparency, no transcoding), yet you claimed loudly in the HTTP
>WG that they somehow defaulted to ISO 8859-1, and insisted strongly that
>this fictitious default charset remain in the HTTP/1.1 spec.

They do default to ISO-8859-1.  Look at the Apache code.  Look at the
NCSA code.  Look at the CERN code.  As one of the Apache developers,
I can say unequivocally that any text/* response lacking an explicit
charset means that the server intends the file to be treated as
charset=iso-8859-1.  For example, that is how Apache determines which
variant is best when negotiating on Accept-Charset.

It is certainly true that some users, by mistake or by following incorrect
advice from others, have failed to assign the correct mime type to the
files which they distribute.  It is also true that some implementations
have failed to implement the standard correctly.  However, it is also
true that systems that have implemented the standard correctly will
interoperate correctly, and that is the criteria that matters.

.....Roy
Received on Monday, 14 April 1997 01:59:57 UTC