Re: Transcribing non-ascii URLs [was: revised "generic syntax" internet draft]

On Mon, 14 Apr 1997, Dan Connolly wrote:

> I have been shooting from the hip on this I18N/URL stuff for a while,
> but some folks at WWW6 wanted the full weight of W3C behind it, so
> I've been trying to think more carefully.

We all have tried to do that, but we are glad for every help
we can get.


> And this issue of transcribing non-ascii URLs particularly concerns me.
> 
> On the one hand, it makes a lot of sense that if a user creates
> a file and gives it a hebrew or arabic or CJK name, and then exports
> the file via an HTTP server, that the Address: field in a web
> browser should show the hebrew or arabic or ... characters faithfully.

Yes, definitely.


> On the other hand, suppose that address is to be printed and put
> in an advertisement or a magazine article. Should it print the
> hebrew/arabic/CJK characters using those glyphs?

If it's a hebrew/arabic/CJK magazine, then definitely yes.


> Or should it print ASCII glyphs corresponding to the characters
> of the %xx encoding of the original characters?

That would be the fallback for an English (or otherwise
scriptwise unrelated) publication.


> If the former, then reliability suffers: the odds that a random
> person on the globe can faithfully key in a hebrew/arabic/CJK
> name seem considerably lower than the odds that they can key
> in an ASCII name. (though the odds of correctly transcribing
> a long sequence of %xx codes is vanishingly small too...)

Quite true. But the chances that a random person keys in
a hebrew/arabic/CJK URL are rather small. I don't know if
you speak Japanese, but assume you don't, how often have
you typed in an URL you have found in a Japanese publication,
for a Japanese page? If you take the weighted average of
users for each kind of URL, the reliability increases
dramatically, because there is a very strong correlation.


> (I'm not saying that everybody knows english, but rather
> that a person using a computer connected to the internet
> has a farily high probablility of being able to match
> the 'a' character on a peice of paper to the 'a' character
> on the keyboard.)

The average non-Latin-native user definitely has a higher
probability to match a character from the Latin alphabet
when compared to matching a character from a randomly
choosen foreign alphabet. But matching a character
in the native alphabet should always be easier.


> If the latter, then the system is very much biased to
> the *American* Standard Code of Information Interchange.

It's not so much the fact that the Americans made that
standard that is the problem. Unicode also has a very
strong American influence.


> It seems to me that the minimally constraining
> thing to do is to specify both
> and allow folks to choose:

Exactly.

> specify how Unicode strings
> fit into URLs, and then advise folks to use a small
> subset of Unicode if their audience is international
> (and at the same time, add a few more notes: perhaps advise folks that
> mixing upper and lowercase increases the risk of
> transcription errors).

There are more things you can add to the notes. For example
that you shouldn't use URLs like 0O0O0o0o.html. Some of
these things can get a little tricky if you have lots of
characters, but for ASCII, nobody up to now cared to
actually write such notes. It's probably more a problem
of computer literacy than of standards specs.


> What's the conventional wisdom among the DNS folks? Surely
> they face the same issue.

No. Or let's say not yet. DNS is strictestly case-folded ASCII.
But see draft-duerst-dns-i18n-00.txt for an idea for a way out.


> Regarding process, it seems clear (based on Larry M and John K's
> input) that specifying how Unicode
> strings fit into URLs is not the sort of thing one adds to
> a proposed standard to make it a draft standard.
> 
> But I'm not terribly interested in a draft standard that doesn't
> address this issue -- even if only to say "we thought about encoding
> Unicode in URLs, but decided against it for the following reasons... ."
> 
> In either case, a separate internet draft on the subject seems
> like a perfectly good idea. I don't think the risk of "incompatible
> standards" is unmageable.

I don't see that problem either. But having two documents that
are not connected is also not a good idea. What I thought would
be the best way to go, and what I think Dan Connolly has a lot
of experience with and can certainly advise us on, would be
a solution similar to the one we had with RFC1866 (HTML 2.0)
and RFC2070 (HTML I18N) and the issue of ISO 10646 as the
document character set. That's about two years ago, but it
has turned out to be a very nice idea.

This would mean to add a clear hint to the current draft
about the basic issue of character<->octet mapping, showing
where we are going, without having to rewrite a now very
well-done document. The text of Roy can serve as a base,
but we can change it to fit your needs.

The more extended aspects of beyond-ASCII URLs would then
be discussed in a separate draft. We already have a lot
of text from this discussion, and from Francois' web
page. I already volunteered as an author/editor.


> Larry has asked for implementation experience. Such experience
> seems to be growing. None of the implementors has reported
> any problems (as far as I can see).
> 
> Regarding Jigsaw and Amaya... Support in Jigsaw should be easy.
> I'll look into it. Anybody want to do it for me? Should
> be a quick hack.

If you need some help, please tell me (I don't volunteer
for the full job, though).


> Support in Amaya would be more work. I don't think we've
> crossed the hurdle of getting non-western fonts working
> in Amaya, not to mention internationalized input.

Amaya, as far as I remember, is based on Motiv and X11.
If that's the case, I would definitely advice you not
to spend your time on it. Many browser vendors are
doing this work anyway, as we have seen.

Regards,	Martin.

Received on Monday, 14 April 1997 16:35:38 UTC