Re: sec 4.11 from Tex Texin on 2004-05-11 (public-i18n-ws@w3.org from May 2004)

From: Tex Texin <tex@xencraft.com>
Date: Tue, 11 May 2004 01:20:10 -0400
To: Mark Davis <mark.davis@jtcsv.com>
Cc: "Addison Phillips [wM]" <aphillips@webmethods.com>, Web Services <public-i18n-ws@w3.org>
Message-ID: <40A0628A.D48B6322@xencraft.com>
Hi Mark and thanks,

1) ok on "suggestive". I had in mind that someone might guess if the locale was
Spain vs. Latin America, whether to use Modern or Trad., so it was indicative
of collation in the same "loose" way PRC vs ROC is indicative of Simplified vs
Trad. Chinese. It is not only that it is not specified but it can't (or
shouldn't) be inferred.

But your suggestion is fine and we can address the issue of bad practices due
to language identifiers being inadequate some other time and place.

2)  ok on "unfortunately". As it is sometimes a good thing, I might qualify it
as "unintentional or unexpected" differences, but it's ok for now.

3) ok on "strength". I had considered "strength" and thought it might be too
specific to UCA and/or ICU and not generally recognized, whereas case and
accent sensitivity are concepts most people have seen. But I did a search on
"collation strength" and the term is used by Java and XML, thanks to their
adoption of Unicode and so it has become more generally known. Addison, maybe
we can add a definition of strength in a footnote. I like Mark's parenthetical
use of "(inverse of sensitivity)". I think we should also add the reference to
tn9- http://www.unicode.org/notes/tn9/.

Mark, You make an interesting distinction that strength makes a feature more or
less important not necessarily ignorable.
There probably should be a separate name for that dimension of collation
(ignoring characters, or indeterminacy) to distinguish it from strength.
Most of the sorts I have worked with don't implement strength fully and leave
the sort indeterminate.

ok on replacing "text elements" too.

4) on merging fields, we already reference UCA, but its ok to highlight the
coverage of this point in the UCA.

5) on abbreviations, I am not sure how often expansion is used, I do know it is
done. If by limited environment you mean with respect to certain fields rather
than applied to all fields, yes of course.

on abbreviations in a multilingual context: well yes, collation is language
sensitive and I was looking to indicate there are problems due to using narrow
language-sensitive operations in a multilingual space. Maybe you can suggest
some better examples?
I think this is an area where Web Services may get tripped up.

6) ok on searching and loose matching.

7) on using more sensitive collation for large number of records, I did not
mean to dynamically adjust the collation.

But in weighing the performance cost etc., decisions about strength or ignoring
characters and leaving the sort indeterminate, are factors in selecting the
collation algorithm to begin with. I would use a different routine to sort a
drop-down list which is always small and where some indeterminacy is tolerable,
compared to sorting database records and where indeterminacy might force users
to scan several screenfuls of data to look for a value and greater determinacy
would narrow that scan.

8) ok on "preordained". And me an atheist... ;-)

===========
Addison, I am easy about who makes the changes for Mark's suggestions. I can't
work on this until at least tomorrow night though.
If someone can get to it sooner its fine with me.
I am happy to work on it further, I am just crunched right now and can't do it
immediately. What's our sched for publishing now?

tex




Mark Davis wrote:
> 
> Nice document. Some comments.
> 
> > A language identifier may be suggestive as to whether a requester expects a
> Traditional or a Modern sort ordering (in Spanish for example) but it may not be
> definitive.
> 
> I don't think this is the right way to put it. Language (or locale) identifiers
> typically aren't 'suggestive' or not. Rather:
> 
> A language or locale identifier might not specify the particular variant of
> collation (e.g. traditional or modern sort ordering in Spanish, for example)
> that the requestor would expect.
> 
> > Different components or subsystems which are used by a software process may
> employ different sort orderings.
> 
> This gives the wrong impression going into the paragraph. Sounds like it's
> supposed to be a good thing! Start with "Unfortunately, ..."
> 
> > Knowing the language of the requester does not prescribe how sensitive the
> collation should be. Should text elements that are different by case or accent
> be treated as distinct? Should certain characters be ignored? For example,
> hyphens are often ignored so that "e-mail" and "email" sort together.
> 
> "Sensitive" is an awkward word, both here and below. And "text elements" is a
> jargon term. The strength (inverse of sensitivity) usually specifies how
> important a given feature is, not whether it is ignored completely or not. For
> example, even when ignore-punctuation is ON, you still normally get a reasonable
> sort order: email < e-mail < emails < e-mails. But see also
> http://www.unicode.org/notes/tn9/.
> 
> >This can be improved by treating the fields as merged and then sorting.
> 
> This is true (and an excellent point!), but there are some gotchas people should
> be aware of before trying this at home. See UCA.
> 
> >In addition, some collations map abbreviations to their expanded form. "St.
> Botolph St." (in Boston) would expand to "Saint Botolph Street". (Well on a good
> day, given the ambiguity of "St.".)
> 
> In practice, only in very limited environments do people expand abbreviations.
> Too ambiguous.
> 
> >> Of course, in a multilingual environment, many of the above decisions become
> more complex, and an approach that works for one language may interfere with
> others. For example, expansion of abbreviations.
> 
> Not really relevant. Collation is inherently language-sensitive, so of course if
> you mix the abbreviation expansion for Swedish with a German collation, you'll
> get trouble.
> 
> >Searching for text in a document may use less sensitive collations for ease of
> use and so the requester doesn't have to make repeated queries with alternative
> spellings.
> 
> You need to make the point that collation is strongly related to searching
> first. Then use the term 'loose matching' instead of 'sensitive collation'.
> 
> > An application sorting large numbers of records which are similar may use a
> more sensitive collation to order the records determinately.
> 
> Nobody ever does this, to my knowledge. Or if they do, they shouldn't.
> 
> > Often the performance of an application is impacted by collation.
> Good point. Actually the worst case is doing a SELECT against data that doesn't
> have the same sort order.
> 
> > Requests that can use the preordained collation of the index have a
> significant performance advantage over requests that either cannot use indexes
> or must resort results.
> 
> Probably want to stay away from the religious language (remember Bush and
> "Crusade" ;-)
> 
> Mark
Received on Tuesday, 11 May 2004 01:21:06 UTC