W3C home > Mailing lists > Public > uri@w3.org > July 2004

RE: 255 character limit in reg-name

From: Martin Duerst <duerst@w3.org>
Date: Sat, 17 Jul 2004 11:08:01 +0900
Message-Id: <4.2.0.58.J.20040717095032.05782718@localhost>
To: Larry Masinter <LMM@acm.org>, "'Dave McAlpin'" <Dave.McAlpin@epok.net>, "'Roy T. Fielding'" <fielding@gbiv.com>
Cc: uri@w3.org

[I wrote all this before I saw the additional exchange between Dave
and Roy.]

There is one specific reason why we may need to remove this limit:
Internationalized Domain Names (IDNs). With IDNs, the resulting
punycode that is sent to a DNS server of course cannot be longer
than 255 octets/US-ASCII characters. However, because of the
compression properties of punycode, it is easy to e.g. construct
a domain name from a script that uses three octets per character
in UTF-8, but is relatively small so that punycode may compress
it to one or two US-ASCII characters per input character.
There are a lot of scripts like these, starting with the series
of Indic Scripts (Devanagari,...), Sinhala, Thai, Lao, Tibetan,
Myanmar, Georgian, Ethiopic, Cherkoee, Khmer, Mongolian, also
Japanese Katakana or Hiragana-only domain names.

Here is an example (just one label, some silly text saying
"areallylongsillydomainnamewehavetomakeitevenlongerotherwiseitisnotenough"
all in hiragana (choosing that simply because both me and my mailer can
handle it :-):
ほんとうにながいわけのわからないどめいんめいのらべるまだながくしないとたりな 
い.com
This label contains 39 hiragana characters. Converted to UTF-8 and
percent-escaped, this gives
%E3%81%BB%E3%82%93%E3%81%A8%E3%81%86%E3%81%AB%E3%81%AA%E3%81%8C%E3%81%84%E3% 
82%8F%E3%81%91%E3%81%AE%E3%82%8F%E3%81%8B%E3%82%89%E3%81%AA%E3%81%84%E3%81%A 
9%E3%82%81%E3%81%84%E3%82%93%E3%82%81%E3%81%84%E3%81%AE%E3%82%89%E3%81%B9%E3 
%82%8B%E3%81%BE%E3%81%A0%E3%81%AA%E3%81%8C%E3%81%8F%E3%81%97%E3%81%AA%E3%81% 
84%E3%81%A8%E3%81%9F%E3%82%8A%E3%81%AA%E3%81%84.com
The label contains 39*3 = 117 pct-escaped constructs (note that the
ABNF indicates the number of pct-escaped constructs, not the number
of actual characters, which in this case is 39*9 = 351 characters.

Converted to punycode, this reads:
xn--n8jaaaaai5bhf7as8fsfk3jnknefdde3fg11amb5gzdb4wi9bya3kc6lra.com
The label is 62 characters long. This means that even including the
xn-- prefix, less than 2 US-ASCII characters are used per hiragana
character in the input. Punycode at work!

Using some such labels, we can easily construct a case where the
reg-name is more than 255 'pct-escaped' long, but still refers to
a totally legal IDN.

At 09:00 04/07/16 -0700, Larry Masinter wrote:

>Those who want to increase or remove the limit need
>to demonstrate that the widely deployed URI software
>does not assume the limit in order to function
>properly.

My guess is that browsers that implement IDN and pct-escaped would
check the limit after the conversion to punycode, not before. But
this is currently only a guess. I could try to get something set up
for testing. But it may take time, because we just have started a
long weekend in Japan.

Regards,    Martin.
Received on Friday, 16 July 2004 22:08:29 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 13 January 2011 12:15:34 GMT