- From: Aaron Irvine <airvine@corp.phone.com>
- Date: Thu, 02 Mar 2000 14:25:33 +0000
- To: Larry Masinter <LM@att.com>
- CC: Benedict Wee Tee Wei <benewee@ida.gov.sg>, "Rogers, Paul" <progers@vignette.com>, uri@w3.org, idn@ops.ietf.org, duerst@w3.org
> > > * hex-encoded characters in URLs. I just tried surfing to
> > > www.%79%61%68%6f%6f.com, and on IE5, it takes me to www.yahoo.com, but
> > > Netscape Navigator 4.6 can't find the server.
>
> It's interesting that it works! The question is whether it should.
>
> Larry
> --
> http://larry.masinter.net
Hi all,
Yes I believe it should work.
I think:
that human visible (typing into browsers, adverts on radio, etc.maybe in hrefs
too) escaped Unicode should be consistent with URI path escaped Unicode (i.e.
%hh escaped utf8),
and that URI-authorities like www.%79%61%68%6f%6f.com [works in IE5] and
schemes like k%C3%A1va [RFC2324] are IMHO the correct way to _present URI's_
to end users
however within the net we have to _encode URI's_:
scheme = alpha *( alpha | digit | "+" | "-" | "." ) ;[RFC 2396]
domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum ;[RFC 2396]
labels 63 'septets' max each, dns 255 'septets' max,
possibly a desire not to change (immediately) the dns infrastructure,
and I also note:
hyphen hyphen and hyphen hyphen hyphen are allowed but rarely (never?) used in
practice, hence free for our use...
So at the very top of the stack, use %hh escaped UTF-8. But deeper, utilise
somehow the hyphen to encode characters above ASCII. One possibility I here
suggest could be:
* triple-hyphened UTF-5 for when a scheme/username/domainlabel contains one or
more characters above Latin extended B
* double-hyphened UTF-8 otherwise
where:
* triple-hyphened UTF-5 means convert to UTF5 then insert "---" after first
letter
* double-hyphened UTF-8 means covert %XY to "X--Y"
* and note a bare(trailing) hyphen never occurs in these
* if in the unlikley event the original contains -- (or ---) then this is
encoded as "----2" (or "----3")
Examples:
nihongo.jp
M---5E5M72COA9E.jp (is in triple-hyphened UTF-5; note translation done
on per label basis)
www.{alpha=\u3B1}{beta=\u3B2}.gr
www.J---B1JB2.gr
{oe=\u0153}uf.fr
For universal typing: %C5%93uf.fr
For the network itself: C--59--3uf.fr (rather than H---53N5M6.fr)
feli{^c=\u0109}ulo
For universal typing: feli%C4%89ulo (or even %66%65%6C%69%C4%89%75%6C%6F also
allowed)
For the network itself: feliC--48--9ulo (rather than the longer
M---6M5MCM9H09N5MCMF)
ridanta-feli{^c=\u0109}ulo@{oe=\u0153}uf.fr
ridanta-feliC--48--9ulo@C--59--3uf.fr
(BTW, will toplabel ever need Unicode? If .store .web etc then yes)
(BTW, rather than these two methods could we just use double-hyphened UTF-5 or
would this not be compact enough for Latin languages?)
Comments welcome please. Regards,
Aaron Irvine
(Belfast, Northern Ireland)
--
-----------------------------------------------------
Aaron Irvine
mailto:airvine@corp.phone.com
-----------------------------------------------------
Received on Thursday, 2 March 2000 09:26:12 UTC