Re: sec 4.11 from Mark Davis on 2004-05-11 (public-i18n-ws@w3.org from May 2004)

From: Mark Davis <mark.davis@jtcsv.com>
Date: Tue, 11 May 2004 16:41:54 -0700
To: "Tex Texin" <tex@xencraft.com>
Cc: "Addison Phillips [wM]" <aphillips@webmethods.com>, "Web Services" <public-i18n-ws@w3.org>
Message-ID: <05ba01c437b1$8ace6e20$74d52b09@DAVIS1>
more comments below.

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message ----- 
From: "Tex Texin" <tex@xencraft.com>
To: "Mark Davis" <mark.davis@jtcsv.com>
Cc: "Addison Phillips [wM]" <aphillips@webmethods.com>; "Web Services"
<public-i18n-ws@w3.org>
Sent: Mon, 2004 May 10 22:20
Subject: Re: sec 4.11


>
> Hi Mark and thanks,
>
> 1) ok on "suggestive". I had in mind that someone might guess if the locale
was
> Spain vs. Latin America, whether to use Modern or Trad., so it was indicative
> of collation in the same "loose" way PRC vs ROC is indicative of Simplified vs
> Trad. Chinese. It is not only that it is not specified but it can't (or
> shouldn't) be inferred.

hmmm. Still uneasy. What happens in fact is that whenever we (or anyone else)
gets an underspecified request, we *have* to do something. So for example, if we
get a naked 'zh' on a resource lookup, then we have to pick either 'zh-Hant' or
'zh-Hans' (logically, if not physically). If we get es-BO on a collation
request, then we have to choose *some* sorting. And the default sort may be
different for different locales, that is true. But "suggestive" is not
indicative of the true process.

Of course, where there is a handshake, one can communicate back: "Don't know
what you mean by "es-BO": pick either "es-BO-x-collationtraditional" or
"es-BO-x-collationmodern". But normally that option n'exist pas.

>
> But your suggestion is fine and we can address the issue of bad practices due
> to language identifiers being inadequate some other time and place.
>
> 2)  ok on "unfortunately". As it is sometimes a good thing, I might qualify it
> as "unintentional or unexpected" differences, but it's ok for now.
>
> 3) ok on "strength". I had considered "strength" and thought it might be too
> specific to UCA and/or ICU and not generally recognized, whereas case and
> accent sensitivity are concepts most people have seen. But I did a search on
> "collation strength" and the term is used by Java and XML, thanks to their
> adoption of Unicode and so it has become more generally known. Addison, maybe
> we can add a definition of strength in a footnote. I like Mark's parenthetical
> use of "(inverse of sensitivity)". I think we should also add the reference to
> tn9- http://www.unicode.org/notes/tn9/.
>
> Mark, You make an interesting distinction that strength makes a feature more
or
> less important not necessarily ignorable.
> There probably should be a separate name for that dimension of collation
> (ignoring characters, or indeterminacy) to distinguish it from strength.
> Most of the sorts I have worked with don't implement strength fully and leave
> the sort indeterminate.

And most should. A fully determinate sort is not generally worth the cost (see
tn9). However, when sorting, at at least the case and accent level, plus
punctuation (if applicable), they should not be ignored in sorting, although may
be ignored in searching.

The "ignore punctuation" option really means "make punctuation a really weak
strength".

>
> ok on replacing "text elements" too.
>
> 4) on merging fields, we already reference UCA, but its ok to highlight the
> coverage of this point in the UCA.
>
> 5) on abbreviations, I am not sure how often expansion is used, I do know it
is
> done. If by limited environment you mean with respect to certain fields rather
> than applied to all fields, yes of course.

It is done, but what I mean by "limited environment" is that "replace all
acronyms" makes no sense unless you have a very limited vocabulary, in a very
restricted domain, like "Dental Supplies". How are you supposed to sort "ICU"
when it could have the following expansions???

http://www.acronymfinder.com/af-query.asp?String=exact&Acronym=ICU

>
> on abbreviations in a multilingual context: well yes, collation is language
> sensitive and I was looking to indicate there are problems due to using narrow
> language-sensitive operations in a multilingual space. Maybe you can suggest
> some better examples?
> I think this is an area where Web Services may get tripped up.

I think it just needs to be removed. Mixing operations from different languages
will always get you into trouble; this has nothing really to do with the example
you want to make.

>
> 6) ok on searching and loose matching.
>
> 7) on using more sensitive collation for large number of records, I did not
> mean to dynamically adjust the collation.
>
> But in weighing the performance cost etc., decisions about strength or
ignoring
> characters and leaving the sort indeterminate, are factors in selecting the
> collation algorithm to begin with. I would use a different routine to sort a
> drop-down list which is always small and where some indeterminacy is
tolerable,
> compared to sorting database records and where indeterminacy might force users
> to scan several screenfuls of data to look for a value and greater determinacy
> would narrow that scan.

Well, you're the only one ;-). On a short list, the performance overhead to
being precise is miniscule -- I would never recommend being sloppy just for
that!!

>
> 8) ok on "preordained". And me an atheist... ;-)

Me too. Same denomination, perhaps.

>
> ===========
> Addison, I am easy about who makes the changes for Mark's suggestions. I can't
> work on this until at least tomorrow night though.
> If someone can get to it sooner its fine with me.
> I am happy to work on it further, I am just crunched right now and can't do it
> immediately. What's our sched for publishing now?
>
> tex
>
>
>
>
> Mark Davis wrote:
> >
> > Nice document. Some comments.
> >
> > > A language identifier may be suggestive as to whether a requester expects
a
> > Traditional or a Modern sort ordering (in Spanish for example) but it may
not be
> > definitive.
> >
> > I don't think this is the right way to put it. Language (or locale)
identifiers
> > typically aren't 'suggestive' or not. Rather:
> >
> > A language or locale identifier might not specify the particular variant of
> > collation (e.g. traditional or modern sort ordering in Spanish, for example)
> > that the requestor would expect.
> >
> > > Different components or subsystems which are used by a software process
may
> > employ different sort orderings.
> >
> > This gives the wrong impression going into the paragraph. Sounds like it's
> > supposed to be a good thing! Start with "Unfortunately, ..."
> >
> > > Knowing the language of the requester does not prescribe how sensitive the
> > collation should be. Should text elements that are different by case or
accent
> > be treated as distinct? Should certain characters be ignored? For example,
> > hyphens are often ignored so that "e-mail" and "email" sort together.
> >
> > "Sensitive" is an awkward word, both here and below. And "text elements" is
a
> > jargon term. The strength (inverse of sensitivity) usually specifies how
> > important a given feature is, not whether it is ignored completely or not.
For
> > example, even when ignore-punctuation is ON, you still normally get a
reasonable
> > sort order: email < e-mail < emails < e-mails. But see also
> > http://www.unicode.org/notes/tn9/.
> >
> > >This can be improved by treating the fields as merged and then sorting.
> >
> > This is true (and an excellent point!), but there are some gotchas people
should
> > be aware of before trying this at home. See UCA.
> >
> > >In addition, some collations map abbreviations to their expanded form. "St.
> > Botolph St." (in Boston) would expand to "Saint Botolph Street". (Well on a
good
> > day, given the ambiguity of "St.".)
> >
> > In practice, only in very limited environments do people expand
abbreviations.
> > Too ambiguous.
> >
> > >> Of course, in a multilingual environment, many of the above decisions
become
> > more complex, and an approach that works for one language may interfere with
> > others. For example, expansion of abbreviations.
> >
> > Not really relevant. Collation is inherently language-sensitive, so of
course if
> > you mix the abbreviation expansion for Swedish with a German collation,
you'll
> > get trouble.
> >
> > >Searching for text in a document may use less sensitive collations for ease
of
> > use and so the requester doesn't have to make repeated queries with
alternative
> > spellings.
> >
> > You need to make the point that collation is strongly related to searching
> > first. Then use the term 'loose matching' instead of 'sensitive collation'.
> >
> > > An application sorting large numbers of records which are similar may use
a
> > more sensitive collation to order the records determinately.
> >
> > Nobody ever does this, to my knowledge. Or if they do, they shouldn't.
> >
> > > Often the performance of an application is impacted by collation.
> > Good point. Actually the worst case is doing a SELECT against data that
doesn't
> > have the same sort order.
> >
> > > Requests that can use the preordained collation of the index have a
> > significant performance advantage over requests that either cannot use
indexes
> > or must resort results.
> >
> > Probably want to stay away from the religious language (remember Bush and
> > "Crusade" ;-)
> >
> > Mark
>
>
Received on Tuesday, 11 May 2004 19:42:01 UTC