W3C home > Mailing lists > Public > public-ietf-collation@w3.org > May 2006

Re: draft-newman-i18n-collation-09.txt just posted

From: Arnt Gulbrandsen <arnt@gulbrandsen.priv.no>
Date: Wed, 10 May 2006 15:08:53 +0200
Message-Id: <zDbwsk8gz+oQmEpZqtQpeQ.md5@libertango.oryx.com>
To: ietf-imapext@imc.org, ietf-mta-filters@imc.org
Cc: Mark Davis <mark.davis@icu-project.org>, public-ietf-collation@w3.org

Mark Davis writes:
> The release of this is timely (we didn't get notified of a 07 or 08 
> draft), since the Unicode Technical Committee is meeting next week, 
> and can discuss it.
>
> Could you indicate which of the items raised in the email of 
> 2006-02-21 from the Unicode Technical Committee have been addressed 
> in this release (and if not accepted then why)? That would help 
> greatly with the review. (I couldn't find any archive for discussion 
> of draft-newman-i18n-comparator where that email could be publicly 
> linked from, so I am appending it at the end of this message.) At a 
> quick glance, it appears that a number of comments have been 
> incorporated.

Lots. Some not. See below.

It is possible that some of my changes don't satisfy you. I had 
conflicting requests for many things. Feel free to repeat, rephrase or 
add arguments.

> Mark
>
> BTW, despite the subject of the message, the document is at 
> http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-09.txt. 
> It helps to send out a link, especially if the name (comparator vs 
> collation) is wrong ;-)

Mea culpa. My apologies.

...
>> To:   Network Working Group
>> Re:   draft-newman-i18n-comparator
>> Date:         2006-02-21
>> From:         Unicode Technical Committee
>>
>> The Unicode Technical Committee has reviewed the document 
>> http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-06.txt. 
>> While UTC is in favor of the goal, there are a number of problems 
>> with the document. The main problems are outlined below. Once these 
>> are addressed, then further review can continue.
>>
>>     Details
>>
>>       > 2.1 Definitions
>>
>>         Content
>>
>> The document needs to include the definitions of the technical terms 
>> used in the document,  including all those that may not be familiar 
>> to implementers, such as "trichotomous" and "collation identifiers". 
>> In particular, the notion of a substring is /prima facie/ quite 
>> simple, but there are complications that require a clear definition. 
>> The text in the document does not make clear that there may be more 
>> than one match for a substring in a string, and that the matches can 
>> overlap. It says "the starting offset", for example, when there may 
>> be multiple ones.

Changed.

>> Moreover, language sensitive matches have additional complications 
>> which need to be called out. For more information, see 
>> http://www.unicode.org/reports/tr10/#Searching

Not really changed. As I recall, I added a little bit of text.

>>         Format
>>
>> If there is a "Definitions" section, readers have a reasonable 
>> expectation that that section should contain all the required 
>> definitions. However, a number of definitions are scattered within 
>> the text. One of two approaches should be taken
>>
>>    1. Move all the definitions into this section.
>>    2. Remove the definitions section, but clearly call out in the text
>>       the definitions of  each terms on its own line.
>>
>> Mixing these two styles is needlessly confusing for readers.

Not changed; I'm going by what confuses reviewers.

>>       > 2.4 Sort Keys
>>
>> The use of the term "collation canonicalization" to refer to sort 
>> keys is very misleading. ...

Changed; the text now speaks of sort keys. I'm afraid there still are 
instances of the old term around, I found one today.

>> > 3.2
>>
>> This specifies that clients that support disconnected operation 
>> should not use wildcards while clients that provide collation 
>> operations only when connected to the server may use wildcards.

This restrinction has been lifted.

>> The EBNF syntax shown in section 3.2 says that the collation-wild 
>> must not exceed 255 characters total while the section 3.1 specifies 
>> that the collation name must not exceed 254 characters.

Brought into sync.

>> It seems having the same maximum possible length for both collation 
>> name and wildcard string would be desirable for actual 
>> implementations.

I picked 254, not 255, but I confess I cannot remember why.

>>       > 4.2.1 Equality
>>
>> It needs to be made clear that the return values are not physically 
>> the strings "match", etc. but enumerated values such as /equal/ and  
>> /not_equal/.

Changed. Also other similar changes.

>> One extremely important point is that for a given comparator, the 
>> equality function must be synchronized with the ordering function.

I've done this and all the other equivalences/connections/implications I 
could see.

>> The term 'error' is also problematic, since what is really at issue 
>> is a question of domain. For all those strings in the domain, either 
>> 'equal' or 'not_equal' should be returned from the equality 
>> function. For any string not in the domain, 'undefined' should be 
>> returned.

Not changed. Back in February, I agreed that "error" was not ideal, but 
did not see "undefined" as better, and could not find a really apt 
term. The collations were a little too well-defined in the "undefined" 
cases then.

However, in -10, I think they really will be undefined outside their 
domain, so I'll change to using "undefined" instead of "error". (I'm 
removing the bits that fall back to i;octet.)

>> There is a typo at the 4'th line of the second paragraph of the 
>> section 4.2 saying "... For example, an collation" which should be 
>> changed to "... For example, a collation" instead.

Fixed.

>>       > 4.2.2 Substring
>>
>> Prefix and suffix matching are not fully spelled out.

I think they are now.

>> The operations and their results must be clarified. And as noted 
>> before, it is very important to precisely define the substring 
>> operations, especially the starting offset and ending offset. It 
>> also must be clarified whether what is being asked for is the first 
>> possible matching location in the string, the last, or the nth one.

Partly changed. I didn't do the bits you ask for in the last sentence. I 
can add an open issue.

>>       > 4.3.3 Ordering
>>
>> > It MUST be transitive and trichotomous.
>>
>> As above, these should be defined.

I did not, since I think this document is the wrong place to define 
these terms.

>> The exposition in this section would be simpler if you also defined 
>> "reversible", whereby f(a,b) = less iff f(b,a) = greater.

The exposition changed enough as a result of other commens that I 
isregarded this comment.

>> An 'undefined' value can be allowed if, as per equality above, it 
>> means that at least one of the operands is outside of the domain. 
>> The function then imposes a total order on all strings in the 
>> domain; moreover, a wrapper can easily convert the function to a 
>> total order over all strings by putting all items outside the domain 
>> either below or above the ones in the domain -- or even excluding 
>> them,/ at its choice./

I'm doing something like this in -10. (Removing the fallback to i;octet.)

>> [Note: it is very important to avoid the confusion between 
>> "identical" and "equal". According to a caseless compare, "Mark" and 
>> "mark" are equal; however, the strings are not identical.]

Changed all over the place.

>> [Either 'ordering function' or 'comparison function' should be used 
>> consistently, not sometimes 'collations'].

Changed.

>>       > 4.3.  Internal Canonicalization Algorithm
>>
>> This section is difficult to understand.

Changed; I hope the new text is better.

>>       > 4.4.  Use of Lookup Tables
>>
>> It is not at all clear what is meant by "customizable lookup tables".

Clarified and partly removed.

>>       > 4.5.  Multi-Value Attributes
>>
>> This is very unclear.

Deleted.

>> This is a very important feature that needs to be spelled out in 
>> detail, and clearly reflected in the template for registration. In 
>> particular, the template should have provision for multiple 
>> attributes, with the ability to specify the acceptable operands for 
>> that attribute. (See below). The specification of the operands could 
>> be either a list of values, or a regular expression (with the former 
>> preferred). Suggested regular expression syntax would be Perl or XML 
>> Schema.

I asked Martin Dürst and you to provide a new DTD. Martin said okay, I 
don't remember whether you answered. I think the DTD should come before 
this.

>>       > 5.1Character Encoding
>>
>>    The protocol specification has to make sure that it is clear on which
>>    characters (rather than just octets) the collations are used.  This
>>    can be done by specifying the protocol itself in terms of characters
>>    (e.g. in the case of a query language), by specifying a single
>>    character encoding for the protocol (e.g.  UTF-8 [3]), or by
>>    carefully describing the relevant issues of character encoding
>>    labeling and conversion.  In the later case, details to consider
>>    include how to handle unknown charsets, any charsets which are
>>    mandatory-to-implement, any issues with byte-order that might apply,
>>    and any transfer encodings which need to be supported.
>>
>> If a collation is able to advertise itself as being able to handle, 
>> say, SJIS and UTF-8, then there should a required description of a 
>> protocol for indicating that and for communicating which encodings 
>> are handled, and how it handles error conditions (such as a charset 
>> outside of those it can handle. Otherwise, it is difficult to 
>> understand how this paragraph would be applied in practice.
>>
>>       > 5.3
>>
>> The section 5.3 specifies:
>>
>>     The protocol MUST specify how comparisons behave in the absence of
>>     explicit collation negotiation or when a collation of "*" is
>>     requested. The protocol MAY specify that the default collation
>>     used in such circumstances is sensitive to server configuration.
>>
>> and the section 3.2 specifies:
>>
>>     ... If the wildcard string matches multiple collations, the server
>>     SHOULD select the collation with the broadest scope (preferably
>>     international scope), the most recent table versions and the
>>     greatest number of supported operations. A single wildcard
>>     character ("*") refers to the application protocol collation
>>     behavior that would occur if no explicit negotiation were used.
>>
>> These appear inconsistent.

Changed.

>>       7.5.  Example Initial Registry Summary
>>
>> The sample registry would suffer a combinatorial explosion if 
>> parameters are not handled differently.
...

This is the DTD issue.

>> > 11.  Security Considerations
>>
>> This is insufficient. It should at least point to the problems 
>> related in UCA and in 
>> http://www.unicode.org/reports/tr36/tr36-4.html (note that that 
>> document has been approved by the UTC and will be posted as an 
>> approved version soon.)

It now refers.

>>     General
>>
>> One of the real problems with the IANA character registry is that the 
>> entries are underspecified. It quite often occurs that two vendors 
>> implement the same IANA charset conversion different ways, leading 
>> to significant interoperability problems and text corruption. See, 
>> for example, 
>> http://www.w3.org/Submission/japanese-xml/#ambiguity_of_yen.
>>
>> We have the real concern that this registry could lead down the same path.

Noted.

>> > collation, it has to say so
>>
>> There are places where the text should be clarified, as to whether a 
>> MUST or SHOULD is implied; this is just an example.
>>
>> > "comparator" vs "collator"
>>
>> Either one term or the other should be used consistently.

Collator, now.

>> > Unicode 3.2
>>
>> Unicode 3.2 is obsolete; the the reference versions for the Collation 
>> Registry should be Unicode 5.0 and UCA 5.0, since those will be 
>> approved and published by the time the Internet Application Protocol 
>> Collation Registry has completed its review and been approved.

I'll update to the then-current versions immediately before submitting 
the final draft as an RFC.

>> Because of the use of NamePrep, it is probably the case that Unicode 
>> 3.2 also needs to be included, but strongly recommended for usage 
>> only by protocols or systems dependent on NamePrep. Note that as of 
>> UCA 4.0 and beyond, the version number of UCA is guaranteed to be 
>> identical with the version number of Unicode that it is defined for.
>>
>> > Versioning
>>
>> This is tricky, and should be clarified. In many instances, it is 
>> sufficient to use an unversioned collator, such as simply "UCA". In 
>> other cases, there are requirements to use a specific version, or a 
>> version of at least X. This needs to be described.

IETF documents should have only immutable references. Thus, I can 
reference "UCAv14", but not "UCA", because the latter moves to v15, v16 
and onwards.

Arnt
Received on Wednesday, 10 May 2006 13:05:46 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:12:54 GMT