- From: Martin Duerst <duerst@w3.org>
- Date: Sat, 17 Jul 2004 11:08:01 +0900
- To: Larry Masinter <LMM@acm.org>, "'Dave McAlpin'" <Dave.McAlpin@epok.net>, "'Roy T. Fielding'" <fielding@gbiv.com>
- Cc: uri@w3.org
[I wrote all this before I saw the additional exchange between Dave and Roy.] There is one specific reason why we may need to remove this limit: Internationalized Domain Names (IDNs). With IDNs, the resulting punycode that is sent to a DNS server of course cannot be longer than 255 octets/US-ASCII characters. However, because of the compression properties of punycode, it is easy to e.g. construct a domain name from a script that uses three octets per character in UTF-8, but is relatively small so that punycode may compress it to one or two US-ASCII characters per input character. There are a lot of scripts like these, starting with the series of Indic Scripts (Devanagari,...), Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Ethiopic, Cherkoee, Khmer, Mongolian, also Japanese Katakana or Hiragana-only domain names. Here is an example (just one label, some silly text saying "areallylongsillydomainnamewehavetomakeitevenlongerotherwiseitisnotenough" all in hiragana (choosing that simply because both me and my mailer can handle it :-): ほんとうにながいわけのわからないどめいんめいのらべるまだながくしないとたりな い.com This label contains 39 hiragana characters. Converted to UTF-8 and percent-escaped, this gives %E3%81%BB%E3%82%93%E3%81%A8%E3%81%86%E3%81%AB%E3%81%AA%E3%81%8C%E3%81%84%E3% 82%8F%E3%81%91%E3%81%AE%E3%82%8F%E3%81%8B%E3%82%89%E3%81%AA%E3%81%84%E3%81%A 9%E3%82%81%E3%81%84%E3%82%93%E3%82%81%E3%81%84%E3%81%AE%E3%82%89%E3%81%B9%E3 %82%8B%E3%81%BE%E3%81%A0%E3%81%AA%E3%81%8C%E3%81%8F%E3%81%97%E3%81%AA%E3%81% 84%E3%81%A8%E3%81%9F%E3%82%8A%E3%81%AA%E3%81%84.com The label contains 39*3 = 117 pct-escaped constructs (note that the ABNF indicates the number of pct-escaped constructs, not the number of actual characters, which in this case is 39*9 = 351 characters. Converted to punycode, this reads: xn--n8jaaaaai5bhf7as8fsfk3jnknefdde3fg11amb5gzdb4wi9bya3kc6lra.com The label is 62 characters long. This means that even including the xn-- prefix, less than 2 US-ASCII characters are used per hiragana character in the input. Punycode at work! Using some such labels, we can easily construct a case where the reg-name is more than 255 'pct-escaped' long, but still refers to a totally legal IDN. At 09:00 04/07/16 -0700, Larry Masinter wrote: >Those who want to increase or remove the limit need >to demonstrate that the widely deployed URI software >does not assume the limit in order to function >properly. My guess is that browsers that implement IDN and pct-escaped would check the limit after the conversion to punycode, not before. But this is currently only a guess. I could try to get something set up for testing. But it may take time, because we just have started a long weekend in Japan. Regards, Martin.
Received on Friday, 16 July 2004 22:08:29 UTC