Re: Using UTF-8 for non-ASCII Characters in URLs

> > > <A HREF="this-is-the-URL">this-is-what-the-user-sees</A>
> > > 
> > > The URL in the 'this-is-the-URL' part should use hex-encoded-UTF8,
> > > no matter what the user sees.
> > > 
> > 
> > If you use hex-encoding, yes. But NOT if you use the native character set
> > of the document. In that case, the 'this-is-the-URL' part must
> > use the same character set as the rest of the html document. Raw UTF-8
> > may only be used in a UTF-8 encoded html document, not in a iso 8859-1
> > encoded document.
> 
> The document character set for HTML 2.0 and 3.2 was iso 8859-1.
> The document character set for HTML 4.0 and XML will be iso 10646.
As iso 8859-1 is a true subset of iso 10646 I assume that html 4.0
will also handle iso 8859-1 encoded documents, otherwise it will break
a lot of html pages and software of today.

> > A large amount of html documents are hand written in a text editor. A user
> > can not be expected to use a different encoding when typing the URLs
> > in a document.
> 
> But they might have to use a different encoding when saving the file
> to disk. And the document itself might be converted as it is saved
> to disk. These are common functions in a multibyte plain text editor,
> just as intelligent cut and paste functions are needed in a shared 
> desktop environment.
> 
> I think your point about "authoring URLs" within HTML documents with
> a "plain text editor" is that the user will have a local input 
> method for entering native characters (e.g., compose key sequences,
> virtual keyboard, radical composition, etc.) which will be operating
> in the same manner for document text and for URL characters. Since the
> authoring tools did not offer a means of recording the character encoding
> information, it is not possible for a web server to make a distinction
> when a document is transmitted on the wire.

>From another mail:
>> 
>> In general, URLs used without a context that defines the characters used,
>> should be encoded using UTF-8. URLs used within a context where the
>> meaning of the characters is defined should use the character encoding
>> of the context.
>
>I'm not sure that it is a good idea to tie the URL encoding
>interpretation to its immediate context. If I attempt to
>"Save" a document from the browser (or a spider agent is gathering
>documents automatically from the web), then the characters of the URL
>are often used to form a local file system name for the fetched object.
>So I fetch a SJIS named file via an http server and save it in my
>~/public_html EUC-jp file system.  Using UTF8 on the wire (if
>prearranged) allows both sites to use meaningful names for their local
>resources and to safely share the public handles for the information.

Maybe I was unclear. Text that is handled on a system does normally have
a defined character set. If I do cut/copy the text that is copied does
have a known character set and will be converted into a new if pasted into a
document of a different character set (if used on a system that handles
different character sets at the same time). If an editor edits the
characters in ISO 10646, it can save them in a totally different
character set by converting the character the other character set.

When I edit a html document with a text editor, it is just text. URLs
enbedded in the text is written using the same character set that
all other text is in, if I paste a filename from a file listning in an
other tool, the filename will end up in the same character set as
all other characters in the text. URLs I write will contain 8-bit
characters using the same character set as the rest of the text.

When I use a web browser it will fetch html documents containing URLs.
If I click on a link the browser need to extract the URL from the text,
translate it into UTF-8 and send it to a web server.
If I "Save" a document I fetched, the filename proposed will be in the
character set of my filesystem. All this if the browser is international
UTF-8 URL aware. Otherwise only %XX encoded URLs will work for sure.
UTF-8 should be used on the wire when the protocol says: here is a URL.
If the protocol says: here is a html document, the document need not
be in UTF-8, it may be in iso 8859-1, UCS-2, UCS-4 and embedded URLs
will be in the same character set. It is a simple matter for a web browser
to extract the embedded URLs and translate them into UTF-8 for the wire,
it is a very hevy burden for a web server to parse every html document
and translate the embedded URLs into UTF-8.

I think it is important that the document text and the URL "text" of
in the document embedded URLs are of the same character set. If you have
a system with SJIS encoded documents and EUC-jp for file names, I assume
that editors in that system knows that when you save a document to a
file it will use EUC-jp for the filename and if you copy a piece of the
SJIS text into the file name dialog field, it will convert the text from SJIS
to EUC-jp. No problem then with extrating text from a document and using
if for something with a different character set.

Is it clear now that URLs (and file names) typed in a document need to
be in the same character set as the document?

   Dan

Received on Friday, 2 May 1997 05:53:19 UTC