- From: Arnt Gulbrandsen <arnt@gulbrandsen.priv.no>
- Date: Wed, 10 May 2006 15:08:53 +0200
- To: ietf-imapext@imc.org, ietf-mta-filters@imc.org
- Cc: Mark Davis <mark.davis@icu-project.org>, public-ietf-collation@w3.org
Mark Davis writes:
> The release of this is timely (we didn't get notified of a 07 or 08
> draft), since the Unicode Technical Committee is meeting next week,
> and can discuss it.
>
> Could you indicate which of the items raised in the email of
> 2006-02-21 from the Unicode Technical Committee have been addressed
> in this release (and if not accepted then why)? That would help
> greatly with the review. (I couldn't find any archive for discussion
> of draft-newman-i18n-comparator where that email could be publicly
> linked from, so I am appending it at the end of this message.) At a
> quick glance, it appears that a number of comments have been
> incorporated.
Lots. Some not. See below.
It is possible that some of my changes don't satisfy you. I had
conflicting requests for many things. Feel free to repeat, rephrase or
add arguments.
> Mark
>
> BTW, despite the subject of the message, the document is at
> http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-09.txt.
> It helps to send out a link, especially if the name (comparator vs
> collation) is wrong ;-)
Mea culpa. My apologies.
...
>> To: Network Working Group
>> Re: draft-newman-i18n-comparator
>> Date: 2006-02-21
>> From: Unicode Technical Committee
>>
>> The Unicode Technical Committee has reviewed the document
>> http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-06.txt.
>> While UTC is in favor of the goal, there are a number of problems
>> with the document. The main problems are outlined below. Once these
>> are addressed, then further review can continue.
>>
>> Details
>>
>> > 2.1 Definitions
>>
>> Content
>>
>> The document needs to include the definitions of the technical terms
>> used in the document, including all those that may not be familiar
>> to implementers, such as "trichotomous" and "collation identifiers".
>> In particular, the notion of a substring is /prima facie/ quite
>> simple, but there are complications that require a clear definition.
>> The text in the document does not make clear that there may be more
>> than one match for a substring in a string, and that the matches can
>> overlap. It says "the starting offset", for example, when there may
>> be multiple ones.
Changed.
>> Moreover, language sensitive matches have additional complications
>> which need to be called out. For more information, see
>> http://www.unicode.org/reports/tr10/#Searching
Not really changed. As I recall, I added a little bit of text.
>> Format
>>
>> If there is a "Definitions" section, readers have a reasonable
>> expectation that that section should contain all the required
>> definitions. However, a number of definitions are scattered within
>> the text. One of two approaches should be taken
>>
>> 1. Move all the definitions into this section.
>> 2. Remove the definitions section, but clearly call out in the text
>> the definitions of each terms on its own line.
>>
>> Mixing these two styles is needlessly confusing for readers.
Not changed; I'm going by what confuses reviewers.
>> > 2.4 Sort Keys
>>
>> The use of the term "collation canonicalization" to refer to sort
>> keys is very misleading. ...
Changed; the text now speaks of sort keys. I'm afraid there still are
instances of the old term around, I found one today.
>> > 3.2
>>
>> This specifies that clients that support disconnected operation
>> should not use wildcards while clients that provide collation
>> operations only when connected to the server may use wildcards.
This restrinction has been lifted.
>> The EBNF syntax shown in section 3.2 says that the collation-wild
>> must not exceed 255 characters total while the section 3.1 specifies
>> that the collation name must not exceed 254 characters.
Brought into sync.
>> It seems having the same maximum possible length for both collation
>> name and wildcard string would be desirable for actual
>> implementations.
I picked 254, not 255, but I confess I cannot remember why.
>> > 4.2.1 Equality
>>
>> It needs to be made clear that the return values are not physically
>> the strings "match", etc. but enumerated values such as /equal/ and
>> /not_equal/.
Changed. Also other similar changes.
>> One extremely important point is that for a given comparator, the
>> equality function must be synchronized with the ordering function.
I've done this and all the other equivalences/connections/implications I
could see.
>> The term 'error' is also problematic, since what is really at issue
>> is a question of domain. For all those strings in the domain, either
>> 'equal' or 'not_equal' should be returned from the equality
>> function. For any string not in the domain, 'undefined' should be
>> returned.
Not changed. Back in February, I agreed that "error" was not ideal, but
did not see "undefined" as better, and could not find a really apt
term. The collations were a little too well-defined in the "undefined"
cases then.
However, in -10, I think they really will be undefined outside their
domain, so I'll change to using "undefined" instead of "error". (I'm
removing the bits that fall back to i;octet.)
>> There is a typo at the 4'th line of the second paragraph of the
>> section 4.2 saying "... For example, an collation" which should be
>> changed to "... For example, a collation" instead.
Fixed.
>> > 4.2.2 Substring
>>
>> Prefix and suffix matching are not fully spelled out.
I think they are now.
>> The operations and their results must be clarified. And as noted
>> before, it is very important to precisely define the substring
>> operations, especially the starting offset and ending offset. It
>> also must be clarified whether what is being asked for is the first
>> possible matching location in the string, the last, or the nth one.
Partly changed. I didn't do the bits you ask for in the last sentence. I
can add an open issue.
>> > 4.3.3 Ordering
>>
>> > It MUST be transitive and trichotomous.
>>
>> As above, these should be defined.
I did not, since I think this document is the wrong place to define
these terms.
>> The exposition in this section would be simpler if you also defined
>> "reversible", whereby f(a,b) = less iff f(b,a) = greater.
The exposition changed enough as a result of other commens that I
isregarded this comment.
>> An 'undefined' value can be allowed if, as per equality above, it
>> means that at least one of the operands is outside of the domain.
>> The function then imposes a total order on all strings in the
>> domain; moreover, a wrapper can easily convert the function to a
>> total order over all strings by putting all items outside the domain
>> either below or above the ones in the domain -- or even excluding
>> them,/ at its choice./
I'm doing something like this in -10. (Removing the fallback to i;octet.)
>> [Note: it is very important to avoid the confusion between
>> "identical" and "equal". According to a caseless compare, "Mark" and
>> "mark" are equal; however, the strings are not identical.]
Changed all over the place.
>> [Either 'ordering function' or 'comparison function' should be used
>> consistently, not sometimes 'collations'].
Changed.
>> > 4.3. Internal Canonicalization Algorithm
>>
>> This section is difficult to understand.
Changed; I hope the new text is better.
>> > 4.4. Use of Lookup Tables
>>
>> It is not at all clear what is meant by "customizable lookup tables".
Clarified and partly removed.
>> > 4.5. Multi-Value Attributes
>>
>> This is very unclear.
Deleted.
>> This is a very important feature that needs to be spelled out in
>> detail, and clearly reflected in the template for registration. In
>> particular, the template should have provision for multiple
>> attributes, with the ability to specify the acceptable operands for
>> that attribute. (See below). The specification of the operands could
>> be either a list of values, or a regular expression (with the former
>> preferred). Suggested regular expression syntax would be Perl or XML
>> Schema.
I asked Martin Dürst and you to provide a new DTD. Martin said okay, I
don't remember whether you answered. I think the DTD should come before
this.
>> > 5.1Character Encoding
>>
>> The protocol specification has to make sure that it is clear on which
>> characters (rather than just octets) the collations are used. This
>> can be done by specifying the protocol itself in terms of characters
>> (e.g. in the case of a query language), by specifying a single
>> character encoding for the protocol (e.g. UTF-8 [3]), or by
>> carefully describing the relevant issues of character encoding
>> labeling and conversion. In the later case, details to consider
>> include how to handle unknown charsets, any charsets which are
>> mandatory-to-implement, any issues with byte-order that might apply,
>> and any transfer encodings which need to be supported.
>>
>> If a collation is able to advertise itself as being able to handle,
>> say, SJIS and UTF-8, then there should a required description of a
>> protocol for indicating that and for communicating which encodings
>> are handled, and how it handles error conditions (such as a charset
>> outside of those it can handle. Otherwise, it is difficult to
>> understand how this paragraph would be applied in practice.
>>
>> > 5.3
>>
>> The section 5.3 specifies:
>>
>> The protocol MUST specify how comparisons behave in the absence of
>> explicit collation negotiation or when a collation of "*" is
>> requested. The protocol MAY specify that the default collation
>> used in such circumstances is sensitive to server configuration.
>>
>> and the section 3.2 specifies:
>>
>> ... If the wildcard string matches multiple collations, the server
>> SHOULD select the collation with the broadest scope (preferably
>> international scope), the most recent table versions and the
>> greatest number of supported operations. A single wildcard
>> character ("*") refers to the application protocol collation
>> behavior that would occur if no explicit negotiation were used.
>>
>> These appear inconsistent.
Changed.
>> 7.5. Example Initial Registry Summary
>>
>> The sample registry would suffer a combinatorial explosion if
>> parameters are not handled differently.
...
This is the DTD issue.
>> > 11. Security Considerations
>>
>> This is insufficient. It should at least point to the problems
>> related in UCA and in
>> http://www.unicode.org/reports/tr36/tr36-4.html (note that that
>> document has been approved by the UTC and will be posted as an
>> approved version soon.)
It now refers.
>> General
>>
>> One of the real problems with the IANA character registry is that the
>> entries are underspecified. It quite often occurs that two vendors
>> implement the same IANA charset conversion different ways, leading
>> to significant interoperability problems and text corruption. See,
>> for example,
>> http://www.w3.org/Submission/japanese-xml/#ambiguity_of_yen.
>>
>> We have the real concern that this registry could lead down the same path.
Noted.
>> > collation, it has to say so
>>
>> There are places where the text should be clarified, as to whether a
>> MUST or SHOULD is implied; this is just an example.
>>
>> > "comparator" vs "collator"
>>
>> Either one term or the other should be used consistently.
Collator, now.
>> > Unicode 3.2
>>
>> Unicode 3.2 is obsolete; the the reference versions for the Collation
>> Registry should be Unicode 5.0 and UCA 5.0, since those will be
>> approved and published by the time the Internet Application Protocol
>> Collation Registry has completed its review and been approved.
I'll update to the then-current versions immediately before submitting
the final draft as an RFC.
>> Because of the use of NamePrep, it is probably the case that Unicode
>> 3.2 also needs to be included, but strongly recommended for usage
>> only by protocols or systems dependent on NamePrep. Note that as of
>> UCA 4.0 and beyond, the version number of UCA is guaranteed to be
>> identical with the version number of Unicode that it is defined for.
>>
>> > Versioning
>>
>> This is tricky, and should be clarified. In many instances, it is
>> sufficient to use an unversioned collator, such as simply "UCA". In
>> other cases, there are requirements to use a specific version, or a
>> version of at least X. This needs to be described.
IETF documents should have only immutable references. Thus, I can
reference "UCAv14", but not "UCA", because the latter moves to v15, v16
and onwards.
Arnt
Received on Wednesday, 10 May 2006 13:05:46 UTC