Re: draft-newman-i18n-collation-09.txt just posted from Mark Davis on 2006-05-16 (public-ietf-collation@w3.org from May 2006)

From: Mark Davis <mark.davis@icu-project.org>
Date: Tue, 16 May 2006 12:37:54 -0700
To: "Arnt Gulbrandsen" <arnt@gulbrandsen.priv.no>
Cc: ietf-imapext@imc.org, ietf-mta-filters@imc.org, public-ietf-collation@w3.org
Message-ID: <30b660a20605161237x7bc5ba51v68f527a6e149c64b@mail.gmail.com>
Thanks. I'm at the UTC meeting right now, and we were just talking about
this. In general the committee is quite happy with the changes and
direction.

The remaining serious issue is the combinatorics. You have a question on the
open issue asking if the new DTD solves the issue, but as far as I can tell
it does not.

The combinatorics are very large for CLDR. (See
http://www.unicode.org/draft/reports/tr35/tr35.html#Locale_IDs). Here is a
back-of-the-envelope calculation:
- There are a couple of hundred locales and growing (
http://unicode.org/cldr/apps/survey)
- Many have variants; so all of the German ones can have "phonebook" for
example. That probably adds another 50 combinations, so call it around 300.
- All can be parameterized: with the following numbers of settings:
 5 strength, 2 alternates, 2 directions, 2 normalizations, 2 caseLevels, 3
caseFirsts, 2 hirgana, 2 numeric, for a total of 960 combinations.
 Since these are parameters, they can each be combined with all the locales,
so that gives about 30,000 different registrations.
- This doesn't account for the variableTop parameter, which takes a string,
and is at least theoretically, unbounded.

I'm sure what we don't want to see is 30,000 entries in the form discussed
in 7.5. Example Initial Registry Summary

What I suggest be done instead is that a set of parameters be registered.
Thus we could have the equivalent of:

CLDR-Parameters:
  locale=aa, aa_DJ, aa_ER, aa_ER_SAAHO, aa_ET, ..., zu_ZA
  collation=phonebook, ..., gb2312han
  colStrength=primary, secondary, tertiary, quaternary, identical
  ...
  variableTop=(u<unicodeCodePoint>)+

Then the corresponding line in 7.5 could be:

     i;basic;uca=5.0.0;uv=5.0.0;CLDR-Parameters    e, o, s   i18n

BTW, I'm going to be gone until June 7, so won't be able to respond until
then.

There are a few other areas where I have remarks on the language. In
particular, the handling of the error strings needs some further fixes. For
example, the following statement is false:

4.2.3.  Ordering
...
It MUST be transitive and trichotomous.

The way to handle this is to say that a collation is defined over a domain
of strings. Any string outside of that domain will is called "invalid", and
return an error value when used with any operation. Then you can truely say:

For all strings in its domain, it MUST it be transitive and trichotomous.

Mark


On 5/16/06, Arnt Gulbrandsen <arnt@gulbrandsen.priv.no> wrote:
>
> Arnt Gulbrandsen writes:
> > Mark Davis writes:
> >> ...
> >> At a quick glance, it appears that a number of comments have been
> >> incorporated.
> >
> > It is possible that some of my changes don't satisfy you. I had
> > conflicting requests for many things. Feel free to repeat, rephrase
> > or add arguments.
>
> In -10 (which I'll send off once I finish work this evening) I've made
> another few changes.
>
> >>>       > 2.4 Sort Keys
> >>>
> >>> The use of the term "collation canonicalization" to refer to sort
> >>> keys is very misleading. ...
> >
> > Changed; the text now speaks of sort keys. I'm afraid there still are
> > instances of the old term around, I found one today.
>
> In -10, all should be dead.
>
> >>> The term 'error' is also problematic, since what is really at issue
> >>> is a question of domain. For all those strings in the domain,
> >>> either 'equal' or 'not_equal' should be returned from the equality
> >>> function. For any string not in the domain, 'undefined' should be
> >>> returned.
> >
> > Not changed. Back in February, I agreed that "error" was not ideal,
> > but did not see "undefined" as better, and could not find a really
> > apt term. The collations were a little too well-defined in the
> > "undefined" cases then.
> >
> > However, in -10, I think they really will be undefined outside their
> > domain, so I'll change to using "undefined" instead of "error". (I'm
> > removing the bits that fall back to i;octet.)
>
> Changed. The fallback to i;octet is now in the server, if the protocol
> requires it.
>
> This means that if a server can escape implementing i;octet, it can keep
> all its strings in UCS-2 or UCS-4 internally, even as it implements
> collations which are defined in terms of octets.
>
> Arnt
>
Received on Tuesday, 16 May 2006 20:43:40 UTC