embedding the charset in a URI

Roy T. Fielding (fielding@kiwi.ics.uci.edu)
Fri, 04 Sep 1998 16:54:03 -0700


To: Sam Sun <ssun@CNRI.Reston.VA.US>
cc: URI distribution list <uri@Bunyip.Com>
In-reply-to: Your message of "Fri, 04 Sep 1998 17:24:47 EDT."
             <063d01bdd84a$71a944a0$1c1e1b0a@ssun.CNRI.Reston.Va.US> 
Date: Fri, 04 Sep 1998 16:54:03 -0700
From: "Roy T. Fielding" <fielding@kiwi.ics.uci.edu>
Message-ID:  <9809041654.aa15403@paris.ics.uci.edu>
Subject: embedding the charset in a URI

>We are in the middle of drafting the URL Syntax for the Handle System, which
>is the syntax for Handles referenced in HTML document, not the syntax for
>Handles transmitted over the wire (the later uses UTF-8).

Any references in an HTML document will be in the CDATA of the character
set of document, not in any specific syntax that you might want to define.
Define the syntax according to the value of the reference attribute
after it has been processed by the HTML parser.

>Because not every
>customer of the Handle System uses UTF-8 capable platform, we have to define
>the syntax so that they can enter the Handles in terms of their native
>encoding. (Customers will not enter hex encoded name because one, they won't
>know the encoding, two, they don't look right.) The Handle System Resolver
>will have to do the job of translating the native encoding to the protocol
>encoding. For the Resolver to do this, the native encoding has to be
>specified with the Handle reference. Otherwise, if we simply default the
>encoding of the Handle reference to be the encoding of the surrounding
>context of the HTML document, the Resolve will have to pass the document.

Sorry, but that is a poor design.  You are centralizing the aspects
of the system that are inherently decentralized -- forcing the resolver
to be capable of handling all current and future potential encodings,
rather than requiring the application to know both the native and UTF-8
translation.  Your rationale just doesn't make sense.

Furthermore, even if you did follow that design, there is no need
for you to wedge the encoding identifier inside the URI.  You can
just as easily (and far more reliably given current parsers) include
it as a separate parameter of the resolution process which is provided
automatially by the viewer of the document, since that is the only
application that really does need to know the charset.

>Also, it might break if user copy and paste the reference from one browser
>window to another.

If the cut and paste is not capable of character translation between
buffers with different character encodings, then the only characters
that will survive the cut-n-paste are the ones with the same encoding
for both.  The others will suddenly "look different" to the user,
at which point the user will go in and replace them without changing the
now incorrect character encoding symbol wedged into the URI.
OTOH, if the cut and paste is capable of character translation between
buffers, then the characters will be automatically corrected without
changing the now incorrect character encoding symbol wedged into the URI.
In other words, your design fails both scenarios.

....Roy