I18N last call comments on XQuery/XPath Fun/Op (first part) from Martin Duerst on 2003-07-07 (public-qt-comments@w3.org from July 2003)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 07 Jul 2003 17:06:04 -0400
To: public-qt-comments@w3.org
Cc: w3c-i18n-ig@w3.org
Message-Id: <4.2.0.58.J.20030707141941.04c804d8@localhost>
Dear XML Query WG and XSL WG,

Below please find the I18N WGs comments on your last call document
"XQuery 1.0 and XPath 2.0 Functions and Operators"
(http://www.w3.org/TR/2003/WD-xpath-functions-20030502/).

Please note the following:
- Please address all replies to these comments to the I18N IG mailing
   list (w3c-i18n-ig@w3.org), not just to me.
- All i18n-relevant comments are marked with ***. There are also general
   comments on the spec which we hope you will find useful.
- We have not yet reviewed the other documents, such as XQuery 1.0
   or XSLT 2.0, and so we might be unaware of i18n issues that appear
   in these specs but may have to be traced back to functions and operators.
   There are also cases where we have identified an i18n issue here,
   but we are not sure exactly what the best solution will be, and which
   document it will have to be addressed in. Also, there are issues that
   have been raised in comments to you about a different document but
   that apply to this document, too. Sometimes, this is mentioned below,
   but not always.
- Our comments are numbered in square brackets [nn].

We look forward to further discussion with you to find the best
solution on these issues.


[1] Status of this document: The 'last call' information should be
     close to the start, not the end, of this section.

[2] 1.1 implementation defined:
     "Possibly differing between implementations, but specified by the 
implementor for each particular implementation."
     better: "Possibly differing between implementations, but specified and 
documented by the implementor for each particular implementation."

[3] 1.2: *** re. anyAtomicType/untypedAtomic, see data model comments

[4] 1.2: ur types -> urtypes

[5] 1.2: "Diagram courtesy Asir Vedamuthu, webMethods and Jim Melton, Oracle"
     Without any disrespect to these two hard-working gentlemen, this is the
     first time we have seen such an in-place acknowledgement in a W3C spec,
     it does not seem appropriate, in particular because the image is mostly
     a copy from XML Schema. If necessary, an ack in the ack section should do.

[6] 1.2: ***yearMonthDuration and dayTimeDuration:
     glad to see that you are fixing
     a well-known XML Schema problem.

[7] 1.3.2: ***the motivation for untypedAtomic seems unclear. There may be many
     other cases where one wants to indicate that subtypes/derived types
     are not acceptable. If this is considered important, then there should
     be a general solution. See also our other comments on this type.

[8] 1.4: ***Regarding associating Z with dates/times without timezone, see
    our comments on data model.

[9] 1.4: Please provide a pointer to the normative spec for the mappings
    (that there are not enough pointers, and it's often not clear what
     is the normative specification of something, is a general issue
     with the current set of specs)

[10] 1.6: This speaks about constructors. The issues list for the data model
    says constructors are gone. This is confusing.

[11] 1.6: "For most functions there is a paragraph describing what the 
function does followed by semantic rules. These rules are meant to be 
followed in the order that they appear in this document."
     This seems to imply that some details in an earlier (in the spec)
     function definition can influence some details in a later
     (in the spec) definition. Please clarify.

[12] 1.6: Qname is defined in XML namespaces, not XML 1.0.

[13] 1.6, end: It would be good to add a small explanation about multiple
    parameters and sequences as parameters. (and that this is or is
    not working the same way as in Perl)

[14] 1.7: "Are in the XML Schema namespace": Please always give the
    namespace URI in such cases.

[15] 1.7: "datatypes described in this document": Are there any? Where are
      the descriptions?

[16] 2.3: What is the difference between fn:string() and fn:string(as item?)?
     If there is no item, then that's the same as the first variant, or not?

[17] 2.3: *** here and elsewhere: anyURI is not exactly an URI!

[18] 2.4: Why does this function work on sequences, but others don't?
      This is a general issue.

[19] 2.5: What about the case that the base URI is found in the node's
    grandparent, or even higher up?

[20] 2.5: What is 'static context'?

[21] 3. ***Error function: How can error messages be localized?

[22] 4. .implementation defined. : These dots look strange.

[23] 4. Ordering implementation defined: This should be that the ordering is
    according to execution order, which may be implementation defined.
    Anything less diminishes the value of the error mechanism considerably.

[24] 5.1: xs:TYP: shouldn't this be xs:TYPE?

[25] 5.1: ***The return type of anyURI as QName seems strange. 17.14 does not
      seem to explain why this special constructor appears in 5.1.

[26] 5.2: Rules are defined in the same way: What rules exactly?

[27] 6.2: Subtype substitution and type promotion should be described
    normatively in this spec.

[28] 6.2.6: (a/b)*b+(a mod b) -> (a idiv b) * b + (a mod b)

[29] 6.2.7: Why do we need unary plus? If there is a need for a 'noop'
    function, it may be better to have a general one.

[30] 6.4: ***Rounding functions: We are not sure that floor, ceiling, round,
     and 'round-half-to-even' cover the rounding conventions around
     the world in a balanced way. We need to look into this in more
     detail. (for some examples, please see
     http://www.xencraft.com/resources/multi-currency.html#rounding).
     Also, what is 'round-half-to-even' used for?

[31] 7.1: ***String types: It is very important to integrate 
anyType/anySimpleType
    (and if kept, untypedAtomic) and text nodes here.


[32] *** 7.1: "except for the range #xD800 to #xDFFF inclusive, which is 
the range reserved for surrogate pairs"
     "surrogate pairs" -> "surrogates"

[33] *** 7.1: What is the difference between 'code point' and 'character'?
     There are a few code points that are not characters (e.g. #xFFFE and
     #xFFFF), but that can hardly be the point of having two definitions.

[34] *** 7.2.1 and 7.2.2: Please make sure that there are tests for
     these functions that include surrogate pairs/codepoints above #xFFFF.

[35] 7.3: It would be good to have a dedicated subsection 7.3 about
     collations,..., and then section 7.4 with the actual functions.
     (or any other structure that gives collation it's own subsection)

[36] *** 7.3: "the comparisons are inherently performed according to some 
collation (even if that collation is defined entirely on code point values 
or on the binary representations of the characters of the string)"
     are "code point values" and "binary representation" alternatives or
     are they supposed to be the same? (the later would not be true in
     the case of UTF-16, which should be pointed out)

[37] *** 7.3: "Strings can be compared character-by-character or in a 
logical manner, as defined by the collation."
     This seems to imply that 'character-by-character' is not logical.
     better change 'logical' to 'linguistically adequate' or so.

[38] *** 7.3 "For alignment with the [Character Model for the World Wide 
Web 1.0], applications may choose collations that treat unnormalized 
strings as though they were normalized (that is, that implicitly normalize 
the strings)."
     This is somewhat misleading, in that early uniform normalization
     should avoid having to compare strings that differ only in normalization.
     This should be reworded carefully. Please also point out that using
     highest collation strength does not imply string normalization.

[39] *** 7.3: Using anyURIs to identify collations is the right thing to do.
      but there are several problems in how this is done in detail.

[40] *** 7.3: Having an anyURI to identify a collation by codepoint order
     (http://www.w3.org/2003/05/xpath-functions/collation/codepoint) is
     good. But there should at the minimum also be a predefined anyURI
     for identifying the Unicode collation algorithm (without any special
     tailoring). Please note that this not necessarily means that 
implementations
     have to support this algorithm (which we definitely would not object to),
     but it at least means that different implementations that all implement
     that algorithm can interoperate.

[41] 7.3: "The XQuery/XPath static context includes A provision"

[42] *** 7.3: Rules for what collation to use: The current rules, which
     allow only one collation to be specified, raise an error if the collation
     is not supported, and use anyURIs to identify collations without any
     mechanism for giving anyURIs to well-known collations, are bound to
     lead to interoperability problems. Collations should not be the major
     source of interoperability problems. With the current design, even
     vendors who want to be interoperable have no chance of doing so.
     It will often be the case that e.g. a user wants just 'a French
     collation'. How can this be indicated.
     [we will send a separate mail about this issue to several lists
      because we think that there is a potential for useful coordination]

[43] *** 7.3: Context defines a single collation. But it may often be
     desirable to have two 'default collations', in two different senses:
     a) different collations for matching (less precision) and sorting
        (best precision possible)
     b) different collations for internal operations and user-oriented 
operations

[44] *** 7.3: "There might be several ways in which a collation might be 
specified in the XQuery/XPath static context. For example, XQuery might 
provide syntax that specifies a default collation as part of the query prolog."
     good ideas!

*** 7.3: The current proposal groups three different things into a
     new concept called 'collation', which is different from what is usually
     thought of as a collation. These are:
     1) A collation in the traditional sense: every binary difference leads
       to a non-equal match.
     2) The use of different 'collations' to identify different strengths
       of the same collation (i.e. case-insensitive, accent-insensitive,...)
     3) The use of 'collations' to identify character combinations that serve
       as single units in some respect in some languages (e.g. 'll' and 'ch'
       in traditional Spanish)
     This has several problems:
     [45] - The difference between 1) and 2), and the general use of highest
            collation strength for sorting (to have deterministic sorting)
            and potentially lower for matching should be pointed out carefully.
     [46] - Including 3) may make coordination with other efforts somewhat 
difficult
     [47] - Including 3) may make implementations somewhat difficult.
            Although a given system (e.g. OS) may provide a range of
            collations (in the 1) or 2) sense), the functionality in 3)
            may not be available (e.g. via an API)
     [48] - Including 3) may make specifications somewhat difficult.
            There may be cases where it is unclear what the clusters should be.
            For example, for French sorting with accents in the reverse, are
            the clusters the base letters followed by the accents? Or is the
            reverse consideration of accents ignored for this purpose?
     [49] - This is way too important to be relegated to a note.

[50] *** 7.3: Using multiple-letter units for functions such as 'starts-with'
     (independent of its definition via a collation), while appropriate in
     some cases, may not be that well established and tested. It should be
     carefully considered whether this functionality may not be too 
'bleeding-edge',
     and may not confuse users more than help, because there are too many
     cases where it is wrong, i.e. where character combinations are taken
     as single letters even if they are not, e.g. in foreign loanwords,...
     For example, a possible solution may be to use codepoint matching when
     no collation is specified in the function rather than using the default
     collation.


[51] *** 7.3: "Some data management environments allow collations to be 
associated with the definition of string items (that is, with the metadata 
that describes items whose type is string). While such association may be 
appropriate for use in environments in which data is held in a repository 
tightly bound to its descriptive metadata, it is not appropriate in the XML 
environment in which different documents being processed by a single query 
may be described by differing schemas."
     This is a very good point, but should also mention the fact that a 
data-based
     collation is not adequate for user needs.


[52] *** 7.3: "Some data management environments allow collations to be 
associated with the definition of string items (that is, with the metadata 
that describes items whose type is string). While such association may be 
appropriate for use in environments in which data is held in a repository 
tightly bound to its descriptive metadata, it is not appropriate in the XML 
environment in which different documents being processed by a single query 
may be described by differing schemas."
     This very much applies to sorting, but it looks somewhat out of place
     for 'character clusters' such as the traditional Spanish 'll' and 'ch',
     because that seems to imply a linguistic unit, which means that it
     would be wrong to claim that 'el' is not an initial substring of 'elle'
     if the later is not indeed Spanish.

[53] *** 7.3: Is there any case where e.g. element or attribute names can
     be treated as strings? How would they be sorted?

[54] *** 7.3: "It is possible to define collations that do not have this 
property, for example a collation that attempts to sort "ISO 8859" before 
"ISO 10646", or "January" before "February". Such collations may fail, or 
give unexpected results, when used with functions such as fn:contains()."
     "this property": What property? Why 'fail'? That 'January' does not 
contain
     'ary' would be a consequence of the definition, not a failure. Maybe it
     would be better to say that for fn:contains and friends, using the
     codepoint collation in most cases produces more predictible results than
     using a specific collation.

[55] *** 7.3.1.1, last example: While in the preceeding example, 'equates'
     is the right term, this is not necessary here. The only thing that is
     necessary is that the collation treats differences between 'ss' and
     sharp-s with less strength than differences in base letters (such as
     the final n). (there is also the case where the 'ss' <-> sharp-s
     difference is at least as strong as the final 'n', and sharp-s < 'ss').

[56] *** 7.4 "The following functions are defined on these string types."
     which string types? As many as possible (up to anySimple and text node,
     hopefully).

[57] *** 7.4.1: Is there a concat operation that includes normalization
     splicing at the contact point? This would be very helpful, and ideally
     should be the default, because this may be the most efficient way to
     maintain a certain normalization. The same applies to string-join
     and potentially other operations.

[57] *** 7.4.12/13: What about other transforms, such as 
katakana<->hiragana,...

[58] *** 7.4.6: "The first character of a string is located
     at position 1, not position 0."
     It is a pity that XQuery was not able to fix this problem. It would
     be good if there were a special section at the start of 7.4, or
     somewhere else in an adequate place, that clearly explained the
     issue of 1 origin for strings. Also, rather than writing things
     such "beginning at position", it should always be "beginning at
     character position", to make it easier for users to understand
     that e.g. substrings are not identified by indicating boundary
     positions between characters.

[59] 7.4.8/9: For continuity, fn:substring-before($string, "") should
     return the empty string, and fn:substring-after($string, "")
     should return $string. Rationale: The shorter a string is, the
     earlier it generally matches. Thus, the empty string matches at
     the start of a string.

[60] *** 7.4.11: There should be a reference to Unicode Standard Annex #15
     for the various normalization forms.

[61] *** 7.4.11: In the current Character Model, W3C normalization can be
     tested, but is not defined as a function. This probably can be fixed
     by specifying that the W3C normalization function first uses NFC,
     and then prepends a space if the result is not yet in W3C 
normalization form.

[62] *** 7.4.11: In the light of the above, W3C normalization should also
     be made required to support.

[63] *** 7.4.11: What other normalization forms might be supported?
     How would they be identified? How will the space of identifiers
     be managed, to avoid conflicts?

[64] *** 7.4.12/13: The Unicode case mappings are now superseeded by
     Unicode 4, see http://www.unicode.org/unicode/reports/tr21/.

[65] *** 7.4.12/13: It should be pointed out that mappings can change
     the length of a string, and that lower(upper(s)) == s is not
     guaranteed, nor is upper(lower(s)) == s, and that these functions
     may not always be linguistically appropriate (e.g. Turkish i
     without dot) or appropriate for the application (e.g. titlecase).
     It should be said that in cases such as Turkish, a simple translation
     should be used first. It would be much better to have an uniform
     approach for all languages, rather than having to special-case a few
     languages.

[66] 7.4.15: string-pad seems to be the wrong name, because it suggests
     padding to a certain length, e.g. string-pad("XMLQuery", 15) would
     return "XMLQueryXMLQuer", i.e. 15 characters. Something like
     'multiply' seems more adequate.

[67] *** 7.4.16: It is unclear what these functions are supposed to do
     with anyURIs/IRIs.

[68] *** 7.4.16: It is unclear what the case of $escape-reserved == false
      is supposed to do. The example shows this as the identity function,
      which is probably not the point. Please give us a better example,
      so that we can understand what this is used for. It may be better
      ultimately to separate this into two functions, and remove the
      $escape-reserved argument.

[69] 7.4.17: "The % character itself is escaped only if it is not followed
     by two hexadecimal digits": This seems dangerous. If "%20" is a plain
     payload, then the '%' has to be escaped, like everything else.

[70] 7.4.16: Gopher examples are very outdated. We suggest replacement
     with something more people are familiar with.

[71] 7.5.1: 'reluctant' quantifiers: It may be better to use the Perl
     term 'minimal'.

[72] 7.5.1: "In the absence of these quantifiers, the regular expression 
matches the longest possible substring."
     no, not in the absence of these quantifiers, but if the other (maximal)
     set of quantifiers is used ('*?' is a quantifier, but in '*?', '?' is
     not a quantifer)

[73] *** 7.5.2: "Regular expression matching is defined on the basis of 
Unicode code-points; it takes no account of collations."
     It seems somewhat inadequate that fn:contains and friends use collations,
     but regular expressions don't. But we are not sure which way the right
     solutions is.

[74] 7.5.3: "An error is raised ("Invalid replacement string") if the value 
of $replacement contains a "$" character that is not immediately followed 
by a digit 1-9 and not immediately preceded by a "/"."
      "/" -> "\"

[75] *** 7.5.3: We have been told that there is a provision to replace
     character strings with markup, but this is not discussed here.
     We want to make sure this is available, because it is relevant
     to get people away from using the PUA.

[76] 7.5.3: Why does fn:replace not provide for single replacements?

[77] 8.2.2/3: Boolean less-than and greater-than seem strange. On the other
     hand, 'or' and 'and' seem to be badly missing.



This concludes the first part of these comments. We will send the
rest of our comments tomorrow (planned around 3pm EDT July 8th).


Regards,    Martin.
Received on Monday, 7 July 2003 17:06:30 UTC