Re: sec 4.11 from Mark Davis on 2004-05-11 (public-i18n-ws@w3.org from May 2004)

From: Mark Davis <mark.davis@jtcsv.com>
Date: Mon, 10 May 2004 19:28:45 -0700
To: "Tex Texin" <tex@xencraft.com>, "Addison Phillips [wM]" <aphillips@webmethods.com>, "Web Services" <public-i18n-ws@w3.org>
Message-ID: <004601c436ff$af456ea0$74d52b09@DAVIS1>
Nice document. Some comments.

> A language identifier may be suggestive as to whether a requester expects a
Traditional or a Modern sort ordering (in Spanish for example) but it may not be
definitive.

I don't think this is the right way to put it. Language (or locale) identifiers
typically aren't 'suggestive' or not. Rather:

A language or locale identifier might not specify the particular variant of
collation (e.g. traditional or modern sort ordering in Spanish, for example)
that the requestor would expect.

> Different components or subsystems which are used by a software process may
employ different sort orderings.

This gives the wrong impression going into the paragraph. Sounds like it's
supposed to be a good thing! Start with "Unfortunately, ..."

> Knowing the language of the requester does not prescribe how sensitive the
collation should be. Should text elements that are different by case or accent
be treated as distinct? Should certain characters be ignored? For example,
hyphens are often ignored so that "e-mail" and "email" sort together.

"Sensitive" is an awkward word, both here and below. And "text elements" is a
jargon term. The strength (inverse of sensitivity) usually specifies how
important a given feature is, not whether it is ignored completely or not. For
example, even when ignore-punctuation is ON, you still normally get a reasonable
sort order: email < e-mail < emails < e-mails. But see also
http://www.unicode.org/notes/tn9/.

>This can be improved by treating the fields as merged and then sorting.

This is true (and an excellent point!), but there are some gotchas people should
be aware of before trying this at home. See UCA.

>In addition, some collations map abbreviations to their expanded form. "St.
Botolph St." (in Boston) would expand to "Saint Botolph Street". (Well on a good
day, given the ambiguity of "St.".)

In practice, only in very limited environments do people expand abbreviations.
Too ambiguous.

>> Of course, in a multilingual environment, many of the above decisions become
more complex, and an approach that works for one language may interfere with
others. For example, expansion of abbreviations.

Not really relevant. Collation is inherently language-sensitive, so of course if
you mix the abbreviation expansion for Swedish with a German collation, you'll
get trouble.

>Searching for text in a document may use less sensitive collations for ease of
use and so the requester doesn't have to make repeated queries with alternative
spellings.

You need to make the point that collation is strongly related to searching
first. Then use the term 'loose matching' instead of 'sensitive collation'.

> An application sorting large numbers of records which are similar may use a
more sensitive collation to order the records determinately.

Nobody ever does this, to my knowledge. Or if they do, they shouldn't.

> Often the performance of an application is impacted by collation.
Good point. Actually the worst case is doing a SELECT against data that doesn't
have the same sort order.

> Requests that can use the preordained collation of the index have a
significant performance advantage over requests that either cannot use indexes
or must resort results.

Probably want to stay away from the religious language (remember Bush and
"Crusade" ;-)

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message ----- 
From: "Tex Texin" <tex@xencraft.com>
To: "Addison Phillips [wM]" <aphillips@webmethods.com>; "Web Services"
<public-i18n-ws@w3.org>
Sent: Mon, 2004 May 10 16:44
Subject: sec 4.11


> attached
>
> -- 
> -------------------------------------------------------------
> Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
> Xen Master                          http://www.i18nGuy.com
>
> XenCraft             http://www.XenCraft.com
> Making e-Business Work Around the World
> -------------------------------------------------------------


--------------------------------------------------------------------------------


> 4.11 Ordering, Grouping, and Collation
>
> The ordering of textual data items is in general a significant concern for
internationalization of software. The problem is exacerbated when the data can
be multilingual in nature. For Web Services, in scenarios where the ordering of
textual data is critical to its correct utilization, it can be extremely
difficult to identify the appropriate collation rules to use with sufficient
precision and to insure those rules are either followed by any services that
operate on the data or that appropriate action is taken to compensate for any
services that do not use the desired collation rules. (For example, re-sorting
the data downstream).
>
> A brief list of these collation issues are described here. More details on
internationalization of collation can be found in many of the books on
internationalization. An important reference is UTR10
> http://www.unicode.org/reports/tr10/
>
> Some types of internationally sensitive processing cannot be inferred solely
from a language identifier or a locale. For example, the identifier does not
indicate which sort ordering should be used. Types of sort orderings include:
telephone, dictionary, phonetic, binary, stroke-radical or radical-stroke. In
the latter two cases, the reference for stroke count may also need to be cited.
A language identifier may be suggestive as to whether a requester expects a
Traditional or a Modern sort ordering (in Spanish for example) but it may not be
definitive.
>
> Different components or subsystems which are used by a software process may
employ different sort orderings. For example, a User Agent may provide a
drop-down list which sorts the elements of the list at run-time differently from
the other components of the UA. Information retrieved from a database may be
ordered by an index which has no correlation with the requester's requirements.
When different components or subsystems or Web Services use different collation
rules, then errors can occur. They are not always hard errors (i.e. generate
faults) but the resulting data, operations, or events, may be incorrect.
>
> In the case of services that might use a binary collation (ordering by the
code points of text data) there can be differences in ordering introduced by
different components using UTF-8 vs. UTF-16 internally.
>
> Knowing the language of the requester does not prescribe how sensitive the
collation should be. Should text elements that are different by case or accent
be treated as distinct? Should certain characters be ignored? For example,
hyphens are often ignored so that "e-mail" and "email" sort together.
>
> Where case is considered distinct, it may be important to describe whether all
lowercase characters precede all uppercase characters, vice versa, or whether
they should be intermixed.
>
> Other considerations include:
> Some languages treat two letters as a single unit for sorting. Spanish for
example does so with "CH" and "LL". In addition, some collations map
abbreviations to their expanded form. "St. Botolph St." (in Boston) would expand
to "Saint Botolph Street". (Well on a good day, given the ambiguity of "St.".)
>
> Similarly, strings with digits may be collated by their textual representation
(e.g. 1, 12, 2, A, AB, B) or they may be collated by their numeric value
> (e.g. 1, 2, 12, A, AB, B).
>
> If the textual data is coming from different fields, that may also need to be
accounted for. Consider for example, a lastname followed by a firstname field.
If sorting is performed sequentially field by field,  then significant
differences in the firstname field are swamped by minute differences in the
lastname field. This can be improved by treating the fields as merged and then
sorting.
>
> Of course, in a multilingual environment, many of the above decisions become
more complex, and an approach that works for one language may interfere with
others. For example, expansion of abbreviations.
>
> In a  multilingual environment, the relative ordering of scripts must be
decided.
>
> The choice of rules for collation may also be application dependent. Looking
up customers that access a service by voice may require a phonetic collation.
Searching customer names entered by keyboard would use a different collation.
Searching for text in a document may use less sensitive collations for ease of
use and so the requester doesn't have to make repeated queries with alternative
spellings.
>
> An application sorting large numbers of records which are similar may use a
more sensitive collation to order the records determinately.
>
> Often the performance of an application is impacted by collation. For example,
if a service returns results in an unknown ordering, the requester may have to
sort the results set to its collation. This can consume resources and delay the
further use of the results until the entire set can be collated. Alternatively,
if results are returned in the order needed by the requester, then the requester
can begin to process the first records returned without waiting for the
remaining records to arrive.
>
> Of course, collation can be performed at different stages of data processing
and timing can be an important consideration. Database indexes are updated as
the data is added to the database, not at the time a request arrives. Requests
that can use the preordained collation of the index have a significant
performance advantage over requests that either cannot use indexes or must
resort results.
>
> Section 4.2.3 provides an example where collation makes a difference to web
services.
Received on Monday, 10 May 2004 22:30:38 UTC