- From: Ashok Malhotra <ashokma@microsoft.com>
- Date: Mon, 6 Oct 2003 15:57:21 -0700
- To: "Martin Duerst" <duerst@w3.org>, <public-qt-comments@w3.org>
- Cc: <w3c-i18n-ig@w3.org>
Thank you for your comment [54] on Characters and Collation Units. We have clarified the wording and moved all the functions that use this special feature of collations into a special section. We had added an optional error message that a system can invoke if it finds that a collation is unsuitable for substring matching. All the best, Ashok > -----Original Message----- > From: public-qt-comments-request@w3.org [mailto:public-qt-comments- > request@w3.org] On Behalf Of Martin Duerst > Sent: Monday, July 07, 2003 2:06 PM > To: public-qt-comments@w3.org > Cc: w3c-i18n-ig@w3.org > Subject: I18N last call comments on XQuery/XPath Fun/Op (first part) > > > Dear XML Query WG and XSL WG, > > Below please find the I18N WGs comments on your last call document > "XQuery 1.0 and XPath 2.0 Functions and Operators" > (http://www.w3.org/TR/2003/WD-xpath-functions-20030502/). > > Please note the following: > - Please address all replies to these comments to the I18N IG mailing > list (w3c-i18n-ig@w3.org), not just to me. > - All i18n-relevant comments are marked with ***. There are also general > comments on the spec which we hope you will find useful. > - We have not yet reviewed the other documents, such as XQuery 1.0 > or XSLT 2.0, and so we might be unaware of i18n issues that appear > in these specs but may have to be traced back to functions and > operators. > There are also cases where we have identified an i18n issue here, > but we are not sure exactly what the best solution will be, and which > document it will have to be addressed in. Also, there are issues that > have been raised in comments to you about a different document but > that apply to this document, too. Sometimes, this is mentioned below, > but not always. > - Our comments are numbered in square brackets [nn]. > > We look forward to further discussion with you to find the best > solution on these issues. > > > [1] Status of this document: The 'last call' information should be > close to the start, not the end, of this section. > > [2] 1.1 implementation defined: > "Possibly differing between implementations, but specified by the > implementor for each particular implementation." > better: "Possibly differing between implementations, but specified > and > documented by the implementor for each particular implementation." > > [3] 1.2: *** re. anyAtomicType/untypedAtomic, see data model comments > > [4] 1.2: ur types -> urtypes > > [5] 1.2: "Diagram courtesy Asir Vedamuthu, webMethods and Jim Melton, > Oracle" > Without any disrespect to these two hard-working gentlemen, this is > the > first time we have seen such an in-place acknowledgement in a W3C > spec, > it does not seem appropriate, in particular because the image is > mostly > a copy from XML Schema. If necessary, an ack in the ack section > should do. > > [6] 1.2: ***yearMonthDuration and dayTimeDuration: > glad to see that you are fixing > a well-known XML Schema problem. > > [7] 1.3.2: ***the motivation for untypedAtomic seems unclear. There may be > many > other cases where one wants to indicate that subtypes/derived types > are not acceptable. If this is considered important, then there > should > be a general solution. See also our other comments on this type. > > [8] 1.4: ***Regarding associating Z with dates/times without timezone, see > our comments on data model. > > [9] 1.4: Please provide a pointer to the normative spec for the mappings > (that there are not enough pointers, and it's often not clear what > is the normative specification of something, is a general issue > with the current set of specs) > > [10] 1.6: This speaks about constructors. The issues list for the data > model > says constructors are gone. This is confusing. > > [11] 1.6: "For most functions there is a paragraph describing what the > function does followed by semantic rules. These rules are meant to be > followed in the order that they appear in this document." > This seems to imply that some details in an earlier (in the spec) > function definition can influence some details in a later > (in the spec) definition. Please clarify. > > [12] 1.6: Qname is defined in XML namespaces, not XML 1.0. > > [13] 1.6, end: It would be good to add a small explanation about multiple > parameters and sequences as parameters. (and that this is or is > not working the same way as in Perl) > > [14] 1.7: "Are in the XML Schema namespace": Please always give the > namespace URI in such cases. > > [15] 1.7: "datatypes described in this document": Are there any? Where are > the descriptions? > > [16] 2.3: What is the difference between fn:string() and fn:string(as > item?)? > If there is no item, then that's the same as the first variant, or > not? > > [17] 2.3: *** here and elsewhere: anyURI is not exactly an URI! > > [18] 2.4: Why does this function work on sequences, but others don't? > This is a general issue. > > [19] 2.5: What about the case that the base URI is found in the node's > grandparent, or even higher up? > > [20] 2.5: What is 'static context'? > > [21] 3. ***Error function: How can error messages be localized? > > [22] 4. .implementation defined. : These dots look strange. > > [23] 4. Ordering implementation defined: This should be that the ordering > is > according to execution order, which may be implementation defined. > Anything less diminishes the value of the error mechanism > considerably. > > [24] 5.1: xs:TYP: shouldn't this be xs:TYPE? > > [25] 5.1: ***The return type of anyURI as QName seems strange. 17.14 does > not > seem to explain why this special constructor appears in 5.1. > > [26] 5.2: Rules are defined in the same way: What rules exactly? > > [27] 6.2: Subtype substitution and type promotion should be described > normatively in this spec. > > [28] 6.2.6: (a/b)*b+(a mod b) -> (a idiv b) * b + (a mod b) > > [29] 6.2.7: Why do we need unary plus? If there is a need for a 'noop' > function, it may be better to have a general one. > > [30] 6.4: ***Rounding functions: We are not sure that floor, ceiling, > round, > and 'round-half-to-even' cover the rounding conventions around > the world in a balanced way. We need to look into this in more > detail. (for some examples, please see > http://www.xencraft.com/resources/multi-currency.html#rounding). > Also, what is 'round-half-to-even' used for? > > [31] 7.1: ***String types: It is very important to integrate > anyType/anySimpleType > (and if kept, untypedAtomic) and text nodes here. > > > [32] *** 7.1: "except for the range #xD800 to #xDFFF inclusive, which is > the range reserved for surrogate pairs" > "surrogate pairs" -> "surrogates" > > [33] *** 7.1: What is the difference between 'code point' and 'character'? > There are a few code points that are not characters (e.g. #xFFFE and > #xFFFF), but that can hardly be the point of having two definitions. > > [34] *** 7.2.1 and 7.2.2: Please make sure that there are tests for > these functions that include surrogate pairs/codepoints above #xFFFF. > > [35] 7.3: It would be good to have a dedicated subsection 7.3 about > collations,..., and then section 7.4 with the actual functions. > (or any other structure that gives collation it's own subsection) > > [36] *** 7.3: "the comparisons are inherently performed according to some > collation (even if that collation is defined entirely on code point values > or on the binary representations of the characters of the string)" > are "code point values" and "binary representation" alternatives or > are they supposed to be the same? (the later would not be true in > the case of UTF-16, which should be pointed out) > > [37] *** 7.3: "Strings can be compared character-by-character or in a > logical manner, as defined by the collation." > This seems to imply that 'character-by-character' is not logical. > better change 'logical' to 'linguistically adequate' or so. > > [38] *** 7.3 "For alignment with the [Character Model for the World Wide > Web 1.0], applications may choose collations that treat unnormalized > strings as though they were normalized (that is, that implicitly normalize > the strings)." > This is somewhat misleading, in that early uniform normalization > should avoid having to compare strings that differ only in > normalization. > This should be reworded carefully. Please also point out that using > highest collation strength does not imply string normalization. > > [39] *** 7.3: Using anyURIs to identify collations is the right thing to > do. > but there are several problems in how this is done in detail. > > [40] *** 7.3: Having an anyURI to identify a collation by codepoint order > (http://www.w3.org/2003/05/xpath-functions/collation/codepoint) is > good. But there should at the minimum also be a predefined anyURI > for identifying the Unicode collation algorithm (without any special > tailoring). Please note that this not necessarily means that > implementations > have to support this algorithm (which we definitely would not object > to), > but it at least means that different implementations that all > implement > that algorithm can interoperate. > > [41] 7.3: "The XQuery/XPath static context includes A provision" > > [42] *** 7.3: Rules for what collation to use: The current rules, which > allow only one collation to be specified, raise an error if the > collation > is not supported, and use anyURIs to identify collations without any > mechanism for giving anyURIs to well-known collations, are bound to > lead to interoperability problems. Collations should not be the major > source of interoperability problems. With the current design, even > vendors who want to be interoperable have no chance of doing so. > It will often be the case that e.g. a user wants just 'a French > collation'. How can this be indicated. > [we will send a separate mail about this issue to several lists > because we think that there is a potential for useful coordination] > > [43] *** 7.3: Context defines a single collation. But it may often be > desirable to have two 'default collations', in two different senses: > a) different collations for matching (less precision) and sorting > (best precision possible) > b) different collations for internal operations and user-oriented > operations > > [44] *** 7.3: "There might be several ways in which a collation might be > specified in the XQuery/XPath static context. For example, XQuery might > provide syntax that specifies a default collation as part of the query > prolog." > good ideas! > > *** 7.3: The current proposal groups three different things into a > new concept called 'collation', which is different from what is > usually > thought of as a collation. These are: > 1) A collation in the traditional sense: every binary difference > leads > to a non-equal match. > 2) The use of different 'collations' to identify different strengths > of the same collation (i.e. case-insensitive, accent- > insensitive,...) > 3) The use of 'collations' to identify character combinations that > serve > as single units in some respect in some languages (e.g. 'll' and > 'ch' > in traditional Spanish) > This has several problems: > [45] - The difference between 1) and 2), and the general use of > highest > collation strength for sorting (to have deterministic sorting) > and potentially lower for matching should be pointed out > carefully. > [46] - Including 3) may make coordination with other efforts somewhat > difficult > [47] - Including 3) may make implementations somewhat difficult. > Although a given system (e.g. OS) may provide a range of > collations (in the 1) or 2) sense), the functionality in 3) > may not be available (e.g. via an API) > [48] - Including 3) may make specifications somewhat difficult. > There may be cases where it is unclear what the clusters > should be. > For example, for French sorting with accents in the reverse, > are > the clusters the base letters followed by the accents? Or is > the > reverse consideration of accents ignored for this purpose? > [49] - This is way too important to be relegated to a note. > > [50] *** 7.3: Using multiple-letter units for functions such as 'starts- > with' > (independent of its definition via a collation), while appropriate in > some cases, may not be that well established and tested. It should be > carefully considered whether this functionality may not be too > 'bleeding-edge', > and may not confuse users more than help, because there are too many > cases where it is wrong, i.e. where character combinations are taken > as single letters even if they are not, e.g. in foreign loanwords,... > For example, a possible solution may be to use codepoint matching > when > no collation is specified in the function rather than using the > default > collation. > > > [51] *** 7.3: "Some data management environments allow collations to be > associated with the definition of string items (that is, with the metadata > that describes items whose type is string). While such association may be > appropriate for use in environments in which data is held in a repository > tightly bound to its descriptive metadata, it is not appropriate in the > XML > environment in which different documents being processed by a single query > may be described by differing schemas." > This is a very good point, but should also mention the fact that a > data-based > collation is not adequate for user needs. > > > [52] *** 7.3: "Some data management environments allow collations to be > associated with the definition of string items (that is, with the metadata > that describes items whose type is string). While such association may be > appropriate for use in environments in which data is held in a repository > tightly bound to its descriptive metadata, it is not appropriate in the > XML > environment in which different documents being processed by a single query > may be described by differing schemas." > This very much applies to sorting, but it looks somewhat out of place > for 'character clusters' such as the traditional Spanish 'll' and > 'ch', > because that seems to imply a linguistic unit, which means that it > would be wrong to claim that 'el' is not an initial substring of > 'elle' > if the later is not indeed Spanish. > > [53] *** 7.3: Is there any case where e.g. element or attribute names can > be treated as strings? How would they be sorted? > > [54] *** 7.3: "It is possible to define collations that do not have this > property, for example a collation that attempts to sort "ISO 8859" before > "ISO 10646", or "January" before "February". Such collations may fail, or > give unexpected results, when used with functions such as fn:contains()." > "this property": What property? Why 'fail'? That 'January' does not > contain > 'ary' would be a consequence of the definition, not a failure. Maybe > it > would be better to say that for fn:contains and friends, using the > codepoint collation in most cases produces more predictible results > than > using a specific collation. > > [55] *** 7.3.1.1, last example: While in the preceeding example, 'equates' > is the right term, this is not necessary here. The only thing that is > necessary is that the collation treats differences between 'ss' and > sharp-s with less strength than differences in base letters (such as > the final n). (there is also the case where the 'ss' <-> sharp-s > difference is at least as strong as the final 'n', and sharp-s < > 'ss'). > > [56] *** 7.4 "The following functions are defined on these string types." > which string types? As many as possible (up to anySimple and text > node, > hopefully). > > [57] *** 7.4.1: Is there a concat operation that includes normalization > splicing at the contact point? This would be very helpful, and > ideally > should be the default, because this may be the most efficient way to > maintain a certain normalization. The same applies to string-join > and potentially other operations. > > [57] *** 7.4.12/13: What about other transforms, such as > katakana<->hiragana,... > > [58] *** 7.4.6: "The first character of a string is located > at position 1, not position 0." > It is a pity that XQuery was not able to fix this problem. It would > be good if there were a special section at the start of 7.4, or > somewhere else in an adequate place, that clearly explained the > issue of 1 origin for strings. Also, rather than writing things > such "beginning at position", it should always be "beginning at > character position", to make it easier for users to understand > that e.g. substrings are not identified by indicating boundary > positions between characters. > > [59] 7.4.8/9: For continuity, fn:substring-before($string, "") should > return the empty string, and fn:substring-after($string, "") > should return $string. Rationale: The shorter a string is, the > earlier it generally matches. Thus, the empty string matches at > the start of a string. > > [60] *** 7.4.11: There should be a reference to Unicode Standard Annex #15 > for the various normalization forms. > > [61] *** 7.4.11: In the current Character Model, W3C normalization can be > tested, but is not defined as a function. This probably can be fixed > by specifying that the W3C normalization function first uses NFC, > and then prepends a space if the result is not yet in W3C > normalization form. > > [62] *** 7.4.11: In the light of the above, W3C normalization should also > be made required to support. > > [63] *** 7.4.11: What other normalization forms might be supported? > How would they be identified? How will the space of identifiers > be managed, to avoid conflicts? > > [64] *** 7.4.12/13: The Unicode case mappings are now superseeded by > Unicode 4, see http://www.unicode.org/unicode/reports/tr21/. > > [65] *** 7.4.12/13: It should be pointed out that mappings can change > the length of a string, and that lower(upper(s)) == s is not > guaranteed, nor is upper(lower(s)) == s, and that these functions > may not always be linguistically appropriate (e.g. Turkish i > without dot) or appropriate for the application (e.g. titlecase). > It should be said that in cases such as Turkish, a simple translation > should be used first. It would be much better to have an uniform > approach for all languages, rather than having to special-case a few > languages. > > [66] 7.4.15: string-pad seems to be the wrong name, because it suggests > padding to a certain length, e.g. string-pad("XMLQuery", 15) would > return "XMLQueryXMLQuer", i.e. 15 characters. Something like > 'multiply' seems more adequate. > > [67] *** 7.4.16: It is unclear what these functions are supposed to do > with anyURIs/IRIs. > > [68] *** 7.4.16: It is unclear what the case of $escape-reserved == false > is supposed to do. The example shows this as the identity function, > which is probably not the point. Please give us a better example, > so that we can understand what this is used for. It may be better > ultimately to separate this into two functions, and remove the > $escape-reserved argument. > > [69] 7.4.17: "The % character itself is escaped only if it is not followed > by two hexadecimal digits": This seems dangerous. If "%20" is a plain > payload, then the '%' has to be escaped, like everything else. > > [70] 7.4.16: Gopher examples are very outdated. We suggest replacement > with something more people are familiar with. > > [71] 7.5.1: 'reluctant' quantifiers: It may be better to use the Perl > term 'minimal'. > > [72] 7.5.1: "In the absence of these quantifiers, the regular expression > matches the longest possible substring." > no, not in the absence of these quantifiers, but if the other > (maximal) > set of quantifiers is used ('*?' is a quantifier, but in '*?', '?' is > not a quantifer) > > [73] *** 7.5.2: "Regular expression matching is defined on the basis of > Unicode code-points; it takes no account of collations." > It seems somewhat inadequate that fn:contains and friends use > collations, > but regular expressions don't. But we are not sure which way the > right > solutions is. > > [74] 7.5.3: "An error is raised ("Invalid replacement string") if the > value > of $replacement contains a "$" character that is not immediately followed > by a digit 1-9 and not immediately preceded by a "/"." > "/" -> "\" > > [75] *** 7.5.3: We have been told that there is a provision to replace > character strings with markup, but this is not discussed here. > We want to make sure this is available, because it is relevant > to get people away from using the PUA. > > [76] 7.5.3: Why does fn:replace not provide for single replacements? > > [77] 8.2.2/3: Boolean less-than and greater-than seem strange. On the > other > hand, 'or' and 'and' seem to be badly missing. > > > > This concludes the first part of these comments. We will send the > rest of our comments tomorrow (planned around 3pm EDT July 8th). > > > Regards, Martin. >
Received on Monday, 6 October 2003 18:57:56 UTC