- From: Igor Hersht <igorh@ca.ibm.com>
- Date: Wed, 9 Jun 2004 16:33:01 -0400
- To: "Michael Kay" <mhk@mhk.me.uk>
- Cc: ashokmalhotra@alum.mit.edu, public-qt-comments@w3.org, Stephen.Buxton@oracle.com
> Our collations aren't restricted to conform to the >Unicode Collation Algorithm. This is a much more >abstract definition of what a collation is. I agree that a collation definition should not be restricted to the Unicode collation (there is no such restriction in my proposal). In my opinion the definition should also: 1. Include the Unicode collation. 2. Be implementable within a reasonable time limits. 3. Have reasonable performance. >I don't think there's any question that any conceivable ordering >of strings can be achieved by mapping the strings to sequences of integers >(indeed, single integers would do perfectly well) Of course, The only problem here is performance. One would need to recalculate all sequence elements on any string added. We wouldn?t need such recalculations if we used sorting keys (which is a byte array). > The question that arises is whether this also supports >contains. I would argue that it does, by construction. That is where I disagree. The only mapping from a string to a sequence of integers (which is defined in the Unicode Collation Algorithm and implemented in the Java JDK and the ICU) is mapping from this sting to the collation elements. Implementing our own mapping would mean rewriting the Unicode collation functionality and cannot be implemented within a reasonable time limits. Using collation elements (not sorting key) for comprising and string matching has fundamental problems I mentioned in my previous note => cannot be implemented within a reasonable time limits. Just anoter example from the Unicode specs which theoretically cannot be implemented (for contains or any other collation function) using just collation elements. ?Characters may also have different case mappings, depending on the context. For example, U+03A3 capital sigma lowercases to U+03C3 small sigma if it is followed by another letter, but lowercases to U+03C2 small final sigma if it is not.? I assume here Greek collation with case-order ignored. >These are presumably not the existing functions >called fn:compare and fn:match? Are these standard functions that we >define,or are they functions peculiar to each collation? I think it could be good idea to have such functions exposed publicly. But it is not that important. This is just 2 functions with 3 parameters. > I can't infer from your definition any constraints on what a collation >produces, e.g. any relationship between equals and contains. contains is always true if equals is true. Igor Hersht XSLT Development IBM Canada Ltd., 8200 Warden Avenue, Markham, Ontario L6G 1C7 Office D2-260, Phone (905)413-3240 ; FAX (905)413-4839 "Michael Kay" <mhk@mhk.me.uk> To 06/08/2004 07:57 Igor Hersht/Toronto/IBM@IBMCA PM cc <public-qt-comments@w3.org>, <ashokmalhotra@alum.mit.edu>, <Stephen.Buxton@oracle.com> Subject RE: PLease define 'collation' > > >A collation is a mapping from strings to sequences of integers, > >referred to as collation units. This mapping can be described as a > >function C(xs:string)->xs:integer*. > > >Two strings are considered equal if they map to the same > >sequence of collation units. > > Problems. > > The definition does not correspond the Unicode Collation > Algorithm (see > http://www.unicode.org/unicode/reports/tr10/#Main_Algorithm). > The algorithm consist of 3 steps > Step 1. Produce a normalized form of each input string. > Step 2. The collation element array is built > Step 3. The sort key is formed. > String comprising and matching based on the calculated values of > the sort key. Our collations aren't restricted to conform to the Unicode Collation Algorithm. This is a much more abstract definition of what a collation is. > > As far as I understand the term ?collation unit? should corresponds > the Unicode term ?collation element?. The step 3 is missing in the > collation > definition . Doing comprising based only on the collation > element is incorrect. The main problem is that collation elements > represent just part of the string. The sort key represents > whole string. I think this is partly because UCA is a parameterisable algorithm, whereas our concept of a collation has all parameters already bound. I don't think there's any question that any conceivable ordering of strings can be achieved by mapping the strings to sequences of integers (indeed, single integers would do perfectly well) and then sorting the numeric representations. The question that arises is whether this also supports contains. I would argue that it does, by construction. > > Some of the problems (e.g. Contextual Sensitivity see 1.3 Contextual > Sensitivity) > would require very serious rework of the existing collation > implementations > (the rework could be more time consuming than the rest of > XSLT2.0 and XPath > functionality). Just an example from this chapter (which > works fine with > the > ICU Unicode Collation Algorithm based implementation). > ?In French and a few other languages, however, it is the last accent > difference > that determines the order? > Normal Accent Ordering {'c','o','t', 0x00EA} < {'c',0x00F4,'t', 'e'} > French Accent Ordering {'c','o','t', 0x00EA} > {'c',0x00F4,'t', 'e'} I think you are looking at this too much from an implementation perspective. I was trying to find an abstract definition of what a collation is. As I said, for equality and ordering it would be quite enough to say that a collation is a mapping from strings to integers (the position in the final sequence). I was hoping that by saying it is a mapping to a sequence of integers then we can also imply some properties of functions like contains(), for example that contains(A,B) is true if A=B is true, and that startswith(A, B) implies A <= B. The question emerged recently of whether a collation is guaranteed to implement equality in a way that is transitive. Under my definition, it is. We haven't previously had a definition that imposed this constraint. > > There is another simple solution which is to create sort key from the > collation > elements and map the key to an integer sequence. I don?t > think that the > collation > definition make sense here. > > > Solution > > 2. Common collation definitions > A collation is a parameter for 2 functions fn:compare and fn:match. I'm not following you. These are presumably not the existing functions called fn:compare and fn:match? Are these standard functions that we define, or are they functions peculiar to each collation? I can't infer from your definition any constraints on what a collation produces, e.g. any relationship between equals and contains. Michael Kay > > fn:compare( $arg1 as xs:string?, $arg2 as xs:string?, > $collation as xs:string) as xs:integer > Two strings (arg1 and arg2) are considered equal if > fn:compare returns 0. > A string arg1 is considered greater than a string arg2 if fn:compare() > return is more than 0. > > fn:match( $arg1 as xs:string?, $arg2 as xs:string?, $collation as > xs:string) as xs:integer* > fn:match returns 2 integer sequence (S(start,end)) or an > empty ESsequence. > > Definitions of fn:contains, fn:starts-with, fn:ends-with, > fn:substring-before and fn:substring-after are quiet obvious in terms > of fn:match. > > fn:contains return true if fn:match returns not empty sequence. > fn:starts-with return true if fn:match returns not empty sequence > and start = 0. > fn:ends-with return true if fn:match returns not empty sequence > and end = arg1.length() -1. > fn:substring-before an empty string if fn:match returns an > empty sequence. > Otherwise fn:substring-before returns arg1.substring(0,start). > fn:substring-afret an empty string if fn:match returns an > empty sequence. > Otherwise fn:substring-before returns > arg1.substring(end, arg1.length()). > > 2.1 Specific collations > Specific collations should be defined as definitions for > fn:compare and fn:match functions. > > > 2.1.1 Unicode collation > fn:compare and fn:match for the Unicode collation. > defined in http://www.unicode.org/unicode/reports/tr10. > The Unicode collation define different kinds of matching. > I think we can use an ambiguous -minimal one the(see 8 Searching and > Matching). > It could be good idea to define the maximal one (seems to > corresponds a > common > sense). The match is maximal if for all positive i and j, > there is no match > at Q[s-i,e+j]. > > 2.1.2 The Unicode codepoint collation > fn:compare and fn:match is obvious here and corresponds and > corresponds to > the > Michael Kay's one. > > 2.1.3 Other collations. > Other collations are implementation defined. > > > > Igor Hersht > XSLT Development > IBM Canada Ltd., 8200 Warden Avenue, Markham, Ontario L6G 1C7 > Office D2-260, Phone (905)413-3240 ; FAX (905)413-4839 >
Received on Wednesday, 9 June 2004 16:33:43 UTC