W3C home > Mailing lists > Public > uri@w3.org > July 2004

RE: 255 character limit in reg-name

From: Martin Duerst <duerst@w3.org>
Date: Wed, 21 Jul 2004 20:30:54 +0900
Message-Id: <4.2.0.58.J.20040721201514.05dbe380@localhost>
To: Larry Masinter <LMM@acm.org>, "'Dave McAlpin'" <Dave.McAlpin@epok.net>, "'Roy T. Fielding'" <fielding@gbiv.com>
Cc: uri@w3.org, public-iri@w3.org

A colleague of mine has set up some example names, and I have added
some tests for very long labels and very long overall domain names.
These are tests http://www.w3.org/2004/04/uri-rel-test.html#L121.
I have only done a very limited amount of testing; Opera 7.5 passes
the first test (121), but not the second (122).
The length of the labels and the overall length of test 122 are
not fully exercising the possible limits of the standards, but
they are getting quite close.

Regards,   Martin.

At 11:08 04/07/17 +0900, Martin Duerst wrote:

>[I wrote all this before I saw the additional exchange between Dave
>and Roy.]
>
>There is one specific reason why we may need to remove this limit:
>Internationalized Domain Names (IDNs). With IDNs, the resulting
>punycode that is sent to a DNS server of course cannot be longer
>than 255 octets/US-ASCII characters. However, because of the
>compression properties of punycode, it is easy to e.g. construct
>a domain name from a script that uses three octets per character
>in UTF-8, but is relatively small so that punycode may compress
>it to one or two US-ASCII characters per input character.
>There are a lot of scripts like these, starting with the series
>of Indic Scripts (Devanagari,...), Sinhala, Thai, Lao, Tibetan,
>Myanmar, Georgian, Ethiopic, Cherkoee, Khmer, Mongolian, also
>Japanese Katakana or Hiragana-only domain names.
>
>Here is an example (just one label, some silly text saying
>"areallylongsillydomainnamewehavetomakeitevenlongerotherwiseitisnotenough"
>all in hiragana (choosing that simply because both me and my mailer can
>handle it :-):
>ほんとうにながいわけのわからないどめいんめいのらべるまだながくしないとたり 
>な い.com
>This label contains 39 hiragana characters. Converted to UTF-8 and
>percent-escaped, this gives
>%E3%81%BB%E3%82%93%E3%81%A8%E3%81%86%E3%81%AB%E3%81%AA%E3%81%8C%E3%81%84%E3 
>% 
>82%8F%E3%81%91%E3%81%AE%E3%82%8F%E3%81%8B%E3%82%89%E3%81%AA%E3%81%84%E3%81% 
>A 
>9%E3%82%81%E3%81%84%E3%82%93%E3%82%81%E3%81%84%E3%81%AE%E3%82%89%E3%81%B9%E 
>3 
>%82%8B%E3%81%BE%E3%81%A0%E3%81%AA%E3%81%8C%E3%81%8F%E3%81%97%E3%81%AA%E3%81 
>% 84%E3%81%A8%E3%81%9F%E3%82%8A%E3%81%AA%E3%81%84.com
>The label contains 39*3 = 117 pct-escaped constructs (note that the
>ABNF indicates the number of pct-escaped constructs, not the number
>of actual characters, which in this case is 39*9 = 351 characters.
>
>Converted to punycode, this reads:
>xn--n8jaaaaai5bhf7as8fsfk3jnknefdde3fg11amb5gzdb4wi9bya3kc6lra.com
>The label is 62 characters long. This means that even including the
>xn-- prefix, less than 2 US-ASCII characters are used per hiragana
>character in the input. Punycode at work!
>
>Using some such labels, we can easily construct a case where the
>reg-name is more than 255 'pct-escaped' long, but still refers to
>a totally legal IDN.
>
>At 09:00 04/07/16 -0700, Larry Masinter wrote:
>
>>Those who want to increase or remove the limit need
>>to demonstrate that the widely deployed URI software
>>does not assume the limit in order to function
>>properly.
>
>My guess is that browsers that implement IDN and pct-escaped would
>check the limit after the conversion to punycode, not before. But
>this is currently only a guess. I could try to get something set up
>for testing. But it may take time, because we just have started a
>long weekend in Japan.
>
>Regards,    Martin.
Received on Wednesday, 21 July 2004 07:30:52 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 13 January 2011 12:15:34 GMT