RE: IDN problem.... :( from Martin Duerst on 2005-02-17 (www-international@w3.org from January to March 2005)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 17 Feb 2005 18:26:36 +0900
To: "Kane, Pat" <pkane@verisign.com>, "'Erik van der Poel'" <erik@vanderpoel.org>
Cc: www-international@w3.org
Message-Id: <6.0.0.20.2.20050217082315.08cab430@localhost>
Hello Pat,

There is no good way to create *one* table for a language that has
been written with three scripts over the years. But there is a very
easy solution: Create three tables. You can see the registration
for a similar case, Azerbaijani, at
http://www.iana.org/assignments/language-tags. There is also
work underway to allow generative inclusion of scripts
(see http://www.ietf.org/internet-drafts/draft-phillips-langtags-10.txt
and http://www.ietf.org/internet-drafts/draft-phillips-langmatching-00.txt).

It is very important to realize that there are several different
ways in which a language can use more than one script:

- Mixing as part of basic text functionality: e.g. Japanese

- Use of different scripts to write the same language at different
   times, in different places,...: e.g. Azerbaijani

- Borrowing of a few characters from another script to use as additional
   characters in a script: examples here are Henqeminem, Heiltsuk,
   Sliammon, Penobscot, and others. (for details, follow the discussion
   on the Unicode mailing list)

There are probably other classes, and finer distinctions can be made,
but it's important to understand that in terms of IDN, these classes
have to be treated differently.

Regards,    Martin.


At 22:27 05/02/16, Kane, Pat wrote:
 >
 >Erik,
 >
 >We are not "catering" to anybody, we just recognize that there is no good
 >way to create a table for a language that has gone from Arabic to Latin to
 >Cyrillic and back to Latin in about 100 years.  It is a stretch comparison,
 >but look at the rules for mapping Simplified Chinese and Traditional
 >Chinese.  I do sympathize, that is why we monitor script mixing and actual
 >registrations that contain multiple scripts.  What I would like to see is a
 >language table for all languages that we permit from a tag standpoint.
 >
 >
 >Pat
 >
 >-----Original Message-----
 >From: Erik van der Poel [mailto:erik@vanderpoel.org]
 >Sent: Wednesday, February 16, 2005 2:29 AM
 >To: Kane, Pat
 >Cc: www-international@w3.org
 >Subject: Re: IDN problem.... :(
 >
 >Hello Pat,
 >
 >Thank you for chiming in. This is exactly the kind of info that I need,
 >to understand the problem a bit better. I think it's great that you seem
 >to want to cater to these former Soviet Union nations with special
 >needs, but I wonder, do you also sympathize with those in other parts of
 >the world that might be duped by spoofed characters? And, if so, do you
 >have a proposed solution?
 >
 >I'm beginning to think that maybe the only way out of this mess is to
 >fold the homographs into single codes (as has been proposed by several
 >people). I.e. you may start with Latin small letter 'a' and Cyrillic
 >small letter 'a' (with distinct codes), but after nameprepping they
 >would no longer be distinct and would have the same code (probably the
 >code for Latin small letter 'a' in order to be compatible with legacy
 >DNS names).
 >
 >Of course, this wouldn't be Nameprep as it is currently defined. It
 >would be a different prep, e.g. BetterPrep.
 >
 >And of course, it would be really difficult to come up with a spec for
 >this betterprep that satisfies everyone. A pair of glyphs that look
 >similar to one person might look different to another.
 >
 >But perhaps a good starting point is the "cmap" of a popular font, say,
 >Arial. If Arial's cmap has the same glyph index for a pair of
 >characters, then we enter this pair into the betterprep spec draft. Then
 >we look at the other glyphs and decide whether or not they ought to be
 >considered homographs (homoglyphs? :-)
 >
 >At the end of this laborious and controversial process, we have the next
 >Maturity Level of the IDN RFCs, i.e. Draft Standard, and they are given
 >a new prefix (i.e. something other than xn--). At this point, the
 >registries go through their existing xn-- names, decode them, run them
 >through betterprep, and resolve any conflicts in the same way that
 >trademark grievances are addressed.
 >
 >After that, we have one final iteration for, you guessed it, the
 >Standard Maturity Level, with yet another prefix and a prep called
 >BestPrep, the final version. It may be that BestPrep is almost the same
 >as BetterPrep, or even identical, in which case we don't need a new prefix.
 >
 >Of course, we will have to give the applications some time to prepare
 >for the new prefixes, so there would be a certain amount of time between
 >publication of the RFCs and recoding of the registries.
 >
 >Thoughts?
 >
 >Erik
 >
 >Kane, Pat (by way of Martin Duerst <duerst@w3.org>) wrote:
 >>
 >> Commingling of scripts certainly is the issue here but they must be
 >> permitted for certain communities as their languages utilize multiple
 >> characters from multiple scripts.  During the development of ICANN痴 IDN
 >> guidelines I presented the details about the script mixing within com
 >> and net.  There were very few issues around the mixing of Latin and
 >> other scripts with the exception of Cyrillic and Greek.  However, there
 >> are several former Soviet Union nations that originally used Latin
 >> characters then converted to Cyrillic characters and who are now
 >> returning to Latin that need just this commingling.  Yes, there are very
 >> few registrations that come from Tajikistan, but these are the types of
 >> communities that IDNs were developed for.
Received on Thursday, 17 February 2005 09:46:30 UTC