Re: sec 4.11 from Martin Duerst on 2004-05-11 (public-i18n-ws@w3.org from May 2004)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 11 May 2004 14:32:51 +0900
To: "Mark Davis" <mark.davis@jtcsv.com>, "Tex Texin" <tex@xencraft.com>, "Addison Phillips [wM]" <aphillips@webmethods.com>, "Web Services" <public-i18n-ws@w3.org>
Message-Id: <4.2.0.58.J.20040511143201.04405758@localhost>
Hello Mark,

I'm affraid we won't be able to consider your comments in the
imminent publication, but we will certainly consider them later.

Regards,    Martin.

At 19:28 04/05/10 -0700, Mark Davis wrote:

>Nice document. Some comments.
>
> > A language identifier may be suggestive as to whether a requester expects a
>Traditional or a Modern sort ordering (in Spanish for example) but it may 
>not be
>definitive.
>
>I don't think this is the right way to put it. Language (or locale) 
>identifiers
>typically aren't 'suggestive' or not. Rather:
>
>A language or locale identifier might not specify the particular variant of
>collation (e.g. traditional or modern sort ordering in Spanish, for example)
>that the requestor would expect.
>
> > Different components or subsystems which are used by a software process may
>employ different sort orderings.
>
>This gives the wrong impression going into the paragraph. Sounds like it's
>supposed to be a good thing! Start with "Unfortunately, ..."
>
> > Knowing the language of the requester does not prescribe how sensitive the
>collation should be. Should text elements that are different by case or accent
>be treated as distinct? Should certain characters be ignored? For example,
>hyphens are often ignored so that "e-mail" and "email" sort together.
>
>"Sensitive" is an awkward word, both here and below. And "text elements" is a
>jargon term. The strength (inverse of sensitivity) usually specifies how
>important a given feature is, not whether it is ignored completely or not. For
>example, even when ignore-punctuation is ON, you still normally get a 
>reasonable
>sort order: email < e-mail < emails < e-mails. But see also
>http://www.unicode.org/notes/tn9/.
>
> >This can be improved by treating the fields as merged and then sorting.
>
>This is true (and an excellent point!), but there are some gotchas people 
>should
>be aware of before trying this at home. See UCA.
>
> >In addition, some collations map abbreviations to their expanded form. "St.
>Botolph St." (in Boston) would expand to "Saint Botolph Street". (Well on 
>a good
>day, given the ambiguity of "St.".)
>
>In practice, only in very limited environments do people expand abbreviations.
>Too ambiguous.
>
> >> Of course, in a multilingual environment, many of the above decisions 
> become
>more complex, and an approach that works for one language may interfere with
>others. For example, expansion of abbreviations.
>
>Not really relevant. Collation is inherently language-sensitive, so of 
>course if
>you mix the abbreviation expansion for Swedish with a German collation, you'll
>get trouble.
>
> >Searching for text in a document may use less sensitive collations for 
> ease of
>use and so the requester doesn't have to make repeated queries with 
>alternative
>spellings.
>
>You need to make the point that collation is strongly related to searching
>first. Then use the term 'loose matching' instead of 'sensitive collation'.
>
> > An application sorting large numbers of records which are similar may use a
>more sensitive collation to order the records determinately.
>
>Nobody ever does this, to my knowledge. Or if they do, they shouldn't.
>
> > Often the performance of an application is impacted by collation.
>Good point. Actually the worst case is doing a SELECT against data that 
>doesn't
>have the same sort order.
>
> > Requests that can use the preordained collation of the index have a
>significant performance advantage over requests that either cannot use indexes
>or must resort results.
>
>Probably want to stay away from the religious language (remember Bush and
>"Crusade" ;-)
>
>Mark
>__________________________________
>http://www.macchiato.com
>笆コ 爨カ爨ソ爨キ爭財、ッ爨セ爨ヲ爨ソ爨壟・財、巵・�、、爭財、ェ�、ー爨セ爨憫、ッ爨ョ爭�笳�
>----- Original Message -----
>From: "Tex Texin" <tex@xencraft.com>
>To: "Addison Phillips [wM]" <aphillips@webmethods.com>; "Web Services"
><public-i18n-ws@w3.org>
>Sent: Mon, 2004 May 10 16:44
>Subject: sec 4.11
>
>
> > attached
> >
> > --
> > -------------------------------------------------------------
> > Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
> > Xen Master                          http://www.i18nGuy.com
> >
> > XenCraft             http://www.XenCraft.com
> > Making e-Business Work Around the World
> > -------------------------------------------------------------
>
>
>--------------------------------------------------------------------------- 
>-----
>
>
> > 4.11 Ordering, Grouping, and Collation
> >
> > The ordering of textual data items is in general a significant concern for
>internationalization of software. The problem is exacerbated when the data can
>be multilingual in nature. For Web Services, in scenarios where the 
>ordering of
>textual data is critical to its correct utilization, it can be extremely
>difficult to identify the appropriate collation rules to use with sufficient
>precision and to insure those rules are either followed by any services that
>operate on the data or that appropriate action is taken to compensate for any
>services that do not use the desired collation rules. (For example, re-sorting
>the data downstream).
> >
> > A brief list of these collation issues are described here. More details on
>internationalization of collation can be found in many of the books on
>internationalization. An important reference is UTR10
> > http://www.unicode.org/reports/tr10/
> >
> > Some types of internationally sensitive processing cannot be inferred 
> solely
>from a language identifier or a locale. For example, the identifier does not
>indicate which sort ordering should be used. Types of sort orderings include:
>telephone, dictionary, phonetic, binary, stroke-radical or radical-stroke. In
>the latter two cases, the reference for stroke count may also need to be 
>cited.
>A language identifier may be suggestive as to whether a requester expects a
>Traditional or a Modern sort ordering (in Spanish for example) but it may 
>not be
>definitive.
> >
> > Different components or subsystems which are used by a software process may
>employ different sort orderings. For example, a User Agent may provide a
>drop-down list which sorts the elements of the list at run-time 
>differently from
>the other components of the UA. Information retrieved from a database may be
>ordered by an index which has no correlation with the requester's 
>requirements.
>When different components or subsystems or Web Services use different 
>collation
>rules, then errors can occur. They are not always hard errors (i.e. generate
>faults) but the resulting data, operations, or events, may be incorrect.
> >
> > In the case of services that might use a binary collation (ordering by the
>code points of text data) there can be differences in ordering introduced by
>different components using UTF-8 vs. UTF-16 internally.
> >
> > Knowing the language of the requester does not prescribe how sensitive the
>collation should be. Should text elements that are different by case or accent
>be treated as distinct? Should certain characters be ignored? For example,
>hyphens are often ignored so that "e-mail" and "email" sort together.
> >
> > Where case is considered distinct, it may be important to describe 
> whether all
>lowercase characters precede all uppercase characters, vice versa, or whether
>they should be intermixed.
> >
> > Other considerations include:
> > Some languages treat two letters as a single unit for sorting. Spanish for
>example does so with "CH" and "LL". In addition, some collations map
>abbreviations to their expanded form. "St. Botolph St." (in Boston) would 
>expand
>to "Saint Botolph Street". (Well on a good day, given the ambiguity of "St.".)
> >
> > Similarly, strings with digits may be collated by their textual 
> representation
>(e.g. 1, 12, 2, A, AB, B) or they may be collated by their numeric value
> > (e.g. 1, 2, 12, A, AB, B).
> >
> > If the textual data is coming from different fields, that may also need 
> to be
>accounted for. Consider for example, a lastname followed by a firstname field.
>If sorting is performed sequentially field by field,  then significant
>differences in the firstname field are swamped by minute differences in the
>lastname field. This can be improved by treating the fields as merged and then
>sorting.
> >
> > Of course, in a multilingual environment, many of the above decisions 
> become
>more complex, and an approach that works for one language may interfere with
>others. For example, expansion of abbreviations.
> >
> > In a  multilingual environment, the relative ordering of scripts must be
>decided.
> >
> > The choice of rules for collation may also be application dependent. 
> Looking
>up customers that access a service by voice may require a phonetic collation.
>Searching customer names entered by keyboard would use a different collation.
>Searching for text in a document may use less sensitive collations for ease of
>use and so the requester doesn't have to make repeated queries with 
>alternative
>spellings.
> >
> > An application sorting large numbers of records which are similar may use a
>more sensitive collation to order the records determinately.
> >
> > Often the performance of an application is impacted by collation. For 
> example,
>if a service returns results in an unknown ordering, the requester may have to
>sort the results set to its collation. This can consume resources and 
>delay the
>further use of the results until the entire set can be collated. 
>Alternatively,
>if results are returned in the order needed by the requester, then the 
>requester
>can begin to process the first records returned without waiting for the
>remaining records to arrive.
> >
> > Of course, collation can be performed at different stages of data 
> processing
>and timing can be an important consideration. Database indexes are updated as
>the data is added to the database, not at the time a request arrives. Requests
>that can use the preordained collation of the index have a significant
>performance advantage over requests that either cannot use indexes or must
>resort results.
> >
> > Section 4.2.3 provides an example where collation makes a difference to web
>services.
Received on Tuesday, 11 May 2004 01:38:23 UTC