- From: Martin Duerst <duerst@w3.org>
- Date: Tue, 11 May 2004 14:32:51 +0900
- To: "Mark Davis" <mark.davis@jtcsv.com>, "Tex Texin" <tex@xencraft.com>, "Addison Phillips [wM]" <aphillips@webmethods.com>, "Web Services" <public-i18n-ws@w3.org>
Hello Mark, I'm affraid we won't be able to consider your comments in the imminent publication, but we will certainly consider them later. Regards, Martin. At 19:28 04/05/10 -0700, Mark Davis wrote: >Nice document. Some comments. > > > A language identifier may be suggestive as to whether a requester expects a >Traditional or a Modern sort ordering (in Spanish for example) but it may >not be >definitive. > >I don't think this is the right way to put it. Language (or locale) >identifiers >typically aren't 'suggestive' or not. Rather: > >A language or locale identifier might not specify the particular variant of >collation (e.g. traditional or modern sort ordering in Spanish, for example) >that the requestor would expect. > > > Different components or subsystems which are used by a software process may >employ different sort orderings. > >This gives the wrong impression going into the paragraph. Sounds like it's >supposed to be a good thing! Start with "Unfortunately, ..." > > > Knowing the language of the requester does not prescribe how sensitive the >collation should be. Should text elements that are different by case or accent >be treated as distinct? Should certain characters be ignored? For example, >hyphens are often ignored so that "e-mail" and "email" sort together. > >"Sensitive" is an awkward word, both here and below. And "text elements" is a >jargon term. The strength (inverse of sensitivity) usually specifies how >important a given feature is, not whether it is ignored completely or not. For >example, even when ignore-punctuation is ON, you still normally get a >reasonable >sort order: email < e-mail < emails < e-mails. But see also >http://www.unicode.org/notes/tn9/. > > >This can be improved by treating the fields as merged and then sorting. > >This is true (and an excellent point!), but there are some gotchas people >should >be aware of before trying this at home. See UCA. > > >In addition, some collations map abbreviations to their expanded form. "St. >Botolph St." (in Boston) would expand to "Saint Botolph Street". (Well on >a good >day, given the ambiguity of "St.".) > >In practice, only in very limited environments do people expand abbreviations. >Too ambiguous. > > >> Of course, in a multilingual environment, many of the above decisions > become >more complex, and an approach that works for one language may interfere with >others. For example, expansion of abbreviations. > >Not really relevant. Collation is inherently language-sensitive, so of >course if >you mix the abbreviation expansion for Swedish with a German collation, you'll >get trouble. > > >Searching for text in a document may use less sensitive collations for > ease of >use and so the requester doesn't have to make repeated queries with >alternative >spellings. > >You need to make the point that collation is strongly related to searching >first. Then use the term 'loose matching' instead of 'sensitive collation'. > > > An application sorting large numbers of records which are similar may use a >more sensitive collation to order the records determinately. > >Nobody ever does this, to my knowledge. Or if they do, they shouldn't. > > > Often the performance of an application is impacted by collation. >Good point. Actually the worst case is doing a SELECT against data that >doesn't >have the same sort order. > > > Requests that can use the preordained collation of the index have a >significant performance advantage over requests that either cannot use indexes >or must resort results. > >Probably want to stay away from the religious language (remember Bush and >"Crusade" ;-) > >Mark >__________________________________ >http://www.macchiato.com >$Bcv%3(B $B`&%+`&%=`&%-`':b!"%C`&%;`&%r`&%=`&Tb!&:b!"Vb!&.b!"!"`':b!"%'_!"!<`&%;`&Xb!"%C`&%g`'9cw'î(B >----- Original Message ----- >From: "Tex Texin" <tex@xencraft.com> >To: "Addison Phillips [wM]" <aphillips@webmethods.com>; "Web Services" ><public-i18n-ws@w3.org> >Sent: Mon, 2004 May 10 16:44 >Subject: sec 4.11 > > > > attached > > > > -- > > ------------------------------------------------------------- > > Tex Texin cell: +1 781 789 1898 mailto:Tex@XenCraft.com > > Xen Master http://www.i18nGuy.com > > > > XenCraft http://www.XenCraft.com > > Making e-Business Work Around the World > > ------------------------------------------------------------- > > >--------------------------------------------------------------------------- >----- > > > > 4.11 Ordering, Grouping, and Collation > > > > The ordering of textual data items is in general a significant concern for >internationalization of software. The problem is exacerbated when the data can >be multilingual in nature. For Web Services, in scenarios where the >ordering of >textual data is critical to its correct utilization, it can be extremely >difficult to identify the appropriate collation rules to use with sufficient >precision and to insure those rules are either followed by any services that >operate on the data or that appropriate action is taken to compensate for any >services that do not use the desired collation rules. (For example, re-sorting >the data downstream). > > > > A brief list of these collation issues are described here. More details on >internationalization of collation can be found in many of the books on >internationalization. An important reference is UTR10 > > http://www.unicode.org/reports/tr10/ > > > > Some types of internationally sensitive processing cannot be inferred > solely >from a language identifier or a locale. For example, the identifier does not >indicate which sort ordering should be used. Types of sort orderings include: >telephone, dictionary, phonetic, binary, stroke-radical or radical-stroke. In >the latter two cases, the reference for stroke count may also need to be >cited. >A language identifier may be suggestive as to whether a requester expects a >Traditional or a Modern sort ordering (in Spanish for example) but it may >not be >definitive. > > > > Different components or subsystems which are used by a software process may >employ different sort orderings. For example, a User Agent may provide a >drop-down list which sorts the elements of the list at run-time >differently from >the other components of the UA. Information retrieved from a database may be >ordered by an index which has no correlation with the requester's >requirements. >When different components or subsystems or Web Services use different >collation >rules, then errors can occur. They are not always hard errors (i.e. generate >faults) but the resulting data, operations, or events, may be incorrect. > > > > In the case of services that might use a binary collation (ordering by the >code points of text data) there can be differences in ordering introduced by >different components using UTF-8 vs. UTF-16 internally. > > > > Knowing the language of the requester does not prescribe how sensitive the >collation should be. Should text elements that are different by case or accent >be treated as distinct? Should certain characters be ignored? For example, >hyphens are often ignored so that "e-mail" and "email" sort together. > > > > Where case is considered distinct, it may be important to describe > whether all >lowercase characters precede all uppercase characters, vice versa, or whether >they should be intermixed. > > > > Other considerations include: > > Some languages treat two letters as a single unit for sorting. Spanish for >example does so with "CH" and "LL". In addition, some collations map >abbreviations to their expanded form. "St. Botolph St." (in Boston) would >expand >to "Saint Botolph Street". (Well on a good day, given the ambiguity of "St.".) > > > > Similarly, strings with digits may be collated by their textual > representation >(e.g. 1, 12, 2, A, AB, B) or they may be collated by their numeric value > > (e.g. 1, 2, 12, A, AB, B). > > > > If the textual data is coming from different fields, that may also need > to be >accounted for. Consider for example, a lastname followed by a firstname field. >If sorting is performed sequentially field by field, then significant >differences in the firstname field are swamped by minute differences in the >lastname field. This can be improved by treating the fields as merged and then >sorting. > > > > Of course, in a multilingual environment, many of the above decisions > become >more complex, and an approach that works for one language may interfere with >others. For example, expansion of abbreviations. > > > > In a multilingual environment, the relative ordering of scripts must be >decided. > > > > The choice of rules for collation may also be application dependent. > Looking >up customers that access a service by voice may require a phonetic collation. >Searching customer names entered by keyboard would use a different collation. >Searching for text in a document may use less sensitive collations for ease of >use and so the requester doesn't have to make repeated queries with >alternative >spellings. > > > > An application sorting large numbers of records which are similar may use a >more sensitive collation to order the records determinately. > > > > Often the performance of an application is impacted by collation. For > example, >if a service returns results in an unknown ordering, the requester may have to >sort the results set to its collation. This can consume resources and >delay the >further use of the results until the entire set can be collated. >Alternatively, >if results are returned in the order needed by the requester, then the >requester >can begin to process the first records returned without waiting for the >remaining records to arrive. > > > > Of course, collation can be performed at different stages of data > processing >and timing can be an important consideration. Database indexes are updated as >the data is added to the database, not at the time a request arrives. Requests >that can use the preordained collation of the index have a significant >performance advantage over requests that either cannot use indexes or must >resort results. > > > > Section 4.2.3 provides an example where collation makes a difference to web >services.
Received on Tuesday, 11 May 2004 01:38:23 UTC