Transcribing non-ascii URLs [was: revised "generic syntax" internet draft]

Dan Connolly (connolly@w3.org)
Mon, 14 Apr 1997 12:54:41 -0500


Message-Id: <33526F61.622A4B12@w3.org>
Date: Mon, 14 Apr 1997 12:54:41 -0500
From: Dan Connolly <connolly@w3.org>
To: Francois Yergeau <yergeau@alis.com>
Cc: uri@bunyip.com, bert@w3.org
Subject: Transcribing non-ascii URLs [was: revised "generic syntax" internet draft]

Francois Yergeau wrote:
> >  Any application that transmits a URL in
> >non-ASCII characters is declared non-compliant.
> 
> You are confusing characters and bytes.  While you may want to restrict the
> transmitted bytes to 7 bits (but again, why?), you cannot restrict the
> range of characters.  Hence a full mapping is required, not ASCII-only.
> The current spec omits that mapping.

I have been shooting from the hip on this I18N/URL stuff for a while,
but some folks at WWW6 wanted the full weight of W3C behind it, so
I've been trying to think more carefully.

And this issue of transcribing non-ascii URLs particularly concerns me.

On the one hand, it makes a lot of sense that if a user creates
a file and gives it a hebrew or arabic or CJK name, and then exports
the file via an HTTP server, that the Address: field in a web
browser should show the hebrew or arabic or ... characters faithfully.

On the other hand, suppose that address is to be printed and put
in an advertisement or a magazine article. Should it print the
hebrew/arabic/CJK characters using those glyphs?
Or should it print ASCII glyphs corresponding to the characters
of the %xx encoding of the original characters?

If the former, then reliability suffers: the odds that a random
person on the globe can faithfully key in a hebrew/arabic/CJK
name seem considerably lower than the odds that they can key
in an ASCII name. (though the odds of correctly transcribing
a long sequence of %xx codes is vanishingly small too...)

(I'm not saying that everybody knows english, but rather
that a person using a computer connected to the internet
has a farily high probablility of being able to match
the 'a' character on a peice of paper to the 'a' character
on the keyboard.)

If the latter, then the system is very much biased to
the *American* Standard Code of Information Interchange.

It seems to me that the minimally constraining
thing to do is to specify both
and allow folks to choose: specify how Unicode strings
fit into URLs, and then advise folks to use a small
subset of Unicode if their audience is international
(and at the same time, add a few more notes: perhaps advise folks that
mixing upper and lowercase increases the risk of
transcription errors).

What's the conventional wisdom among the DNS folks? Surely
they face the same issue.

Regarding process, it seems clear (based on Larry M and John K's
input) that specifying how Unicode
strings fit into URLs is not the sort of thing one adds to
a proposed standard to make it a draft standard.

But I'm not terribly interested in a draft standard that doesn't
address this issue -- even if only to say "we thought about encoding
Unicode in URLs, but decided against it for the following reasons... ."

In either case, a separate internet draft on the subject seems
like a perfectly good idea. I don't think the risk of "incompatible
standards" is unmageable.

Larry has asked for implementation experience. Such experience
seems to be growing. None of the implementors has reported
any problems (as far as I can see).

Regarding Jigsaw and Amaya... Support in Jigsaw should be easy.
I'll look into it. Anybody want to do it for me? Should
be a quick hack.

Support in Amaya would be more work. I don't think we've
crossed the hurdle of getting non-western fonts working
in Amaya, not to mention internationalized input.

-- 
Dan Connolly, W3C Architecture Domain Lead
<connolly@w3.org> +1 512 310-2971
http://www.w3.org/People/Connolly/
PGP:EDF8 A8E4 F3BB 0F3C FD1B 7BE0 716C FF21