I18N last call comments on XQuery/XPath Fun/Op (2nd part)

Dear XML Query WG and XSL WG,

Below please find the second (and final) part of the I18N WGs comments on
your last call document "XQuery 1.0 and XPath 2.0 Functions and Operators"
(http://www.w3.org/TR/2003/WD-xpath-functions-20030502/).

Please note the following:
- Please address all replies to these comments to the I18N IG mailing
   list (w3c-i18n-ig@w3.org), not just to me.
- All i18n-relevant comments are marked with ***. There are also general
   comments on the spec which we hope you will find useful.
- We have not yet reviewed the other documents, such as XQuery 1.0
   or XSLT 2.0, and so we might be unaware of i18n issues that appear
   in these specs but may have to be traced back to functions and operators.
   There are also cases where we have identified an i18n issue here,
   but we are not sure exactly what the best solution will be, and which
   document it will have to be addressed in. Also, there are issues that
   have been raised in comments to you about a different document but
   that apply to this document, too. Sometimes, this is mentioned below,
   but not always.
- Our comments are numbered in square brackets [nn]; the numbering
   continues from the first part.
- Please note that this mail contains a few additional comments on
   sections already commented on in our first part.
- We again apologize for our delay.

We look forward to further discussion with you to find the best
solution on these issues.


[78] 1.7 namespace prefix: op:xxx backs up operators, not directly user 
accessible:
     shouldn't it be the choice of the language using functions and
     operators of whether to expose these operators or not
     (XQuery and XSLT have made their choice to the negative, but
      there might be other languages)

[79] 2.3 "cast as xs:string": there should be a forward reference to this
     notation.

[80] 7.1 (this partly may superseed our issue [33]:
    "This document uses the term "code point" as a synonym for "Unicode
    scalar value". [The Unicode Standard] sometimes spells this term
    "codepoint". Code points range from #x0000 to #x10FFFF inclusive,
    except for the range #xD800 to #xDFFF inclusive, which is the range
    reserved for surrogate pairs. The use of the word 'character' in this
    document is in the sense of production [2] of [XML 1.0 Recommendation
    (Second Edition)]."

     The relationship between code point and scalar value was fuzzy in the
     past. Unicode 4.0 makes it clear, that code point is #x0000 to
     #x10FFFF inclusive, and scalar values are the subset #x0000 to #xD7FF
     plus #xE000 to #x10FFFF inclusive. XML can't represent all code points
     anyway (example: #x0000), so probably best to just use code point. A
     minor wording issue is that #xD800 to #xDFFF is the range reserved for
     surrogate code points, used for surrogate pairs. So, suggested wording
     is:

     "This document uses the term "code point" as defined in [The Unicode
     Standard], ranging from #x0000 to #x10FFFF inclusive. The use of the
     word 'character' in this document is in the sense of production [2] of
     [XML 1.0 Recommendation (Second Edition)], so it may include code
     points which have not yet been assigned to characters."

     The spec should be checked so that it does not use the word codepoint
     anymore when surrogate codepoints are excluded.

[81] 7.4.11 normalize-unicode: As of
     http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-FullyNormalized,
     what is called 'W3C normalized' here has been renamed to
     'fully normalized' in the character model.

[82] 7.4.11 normalize-unicode: 'full normalization' needs a defition of the
     relevant constructs. For strings, the string itself is most
     conveniently the relevant construct, but this shoud be said
     explicitly.

[83] 7.4.11 normalize-unicode: Maybe not as a function, but in any case
     somehow, normalization checking on input and normalization on
     output should be available in both XQuery and XSLT, on full
     XML constructs (with the relevant definitions form XML 1.1)

[84] 7.4.13/14: maybe there should be an attribute for Turkish

[85] 7.4.13/14: an example should use German sharp-s

[86] 7.4.14: two paragraphs are the same (1st and 5th)

[87] 7.4.16: flag to escape non-ascii or not (default should be not to
     unescape them)

[88] 7.4.12 "Otherwise, returns the value of $srcval after translating 
every lower-case letter to its upper-case correspondent. Every lower-case 
letter that does not have an upper-case correspondent, and every character 
that is not a lower-case letter, is included in the returned value in its 
original form. A "lower-case letter" is a character whose Unicode General 
Category class includes "Ll". The corresponding upper-case letter is 
determined using [Unicode Case Mappings]."
     There is a problem here. The set of characters that have uppercases
     and the set of characters that are Ll are disjoint. This should read
     instead:

     "Otherwise, returns the value of $srcval after translating every
     character to its upper-case correspondent. Every character that does
     not have an upper-case correspondent is included in the returned value
     in its original form. The precise mapping is determined using [Unicode
     Case Mappings]."

     and mutatis mutandis for 7.4.13

[89] 7.4 There should also be an fn:title-case function, because titlecasing
     (also called initial caps) is *not* the same as uppercasing the first
     letter in a word and titlecasing.

[90] 9.1 *** There should be a note about the inadvisability of all the types
     with a 'g' prefix.

[91] 9.2: *** "Note: The W3C XML Query Working Group has requested the W3C 
XML Schema Working Group that these two subtypes of xs:duration be included 
in the built-in datatypes described in [XML Schema Part 2: Datatypes]."
     We support this request. This is very much needed!

[92] 9.2.1: <xs:pattern value="[\-]?P[0-9]+(Y([0-9]+M)?|M)"/>
     why not simply <xs:pattern value="-?P[0-9]+(Y([0-9]+M)?|M)"/> ?
     this is legal in Perl, is it not legal in XQuery/XPath?

[93] 9.2.2: Is white space allowed in regular expressions?

[94] 9.2.2: "The designator 'T' ?must? be absent if and only if all of the 
time items are absent."
     This seems to conflict with examples -P35.89S and P4D251M,
     which are said to be allowed.

[95] 9.2.2.3: *** Durations cannot not allow leap seconds in their
     canonical representation.

[96] 9.3: For many types, there is op:foo-equal, op:foo-less-than, and
    op:foo-greater-than. As these are only defined for backing up
    operators, it would be much better to define only a single
    comparison function (similar to string), or at least only
    op:foo-equal and op:foo-less-than. Backup then works easily as follows:
    a eq b <==> foo-equal(a,b)
    a ne b <==> !foo-equal(a,b)
    a lt b <==> foo-less-than(a,b)
    a ge b <==> !foo-less-than(a,b)
    a gt b <==> foo-less-than(b,a)
    a le b <==> !foo-less-than(b,a)

[97] 9.3: *** "If either operand to a comparison function on date or time 
values does not have an explicit timezone then, for the purpose of the 
operation, an implicit timezone, provided by the evaluation context, is 
assumed to be present as part of the value."
     This is used here and in many other places, but we think that it
     is totally unadequate and dangerous. Reasonable use of these types
     should be either using timezoned data only, or data without a timezone
     (or actually with a user-managed, separate indication of the timezone)
     only. The best way to achieve this would be to separate the relevant
     types into with-timezone and without-timezone. Anything else will
     cause more confusion than it will help.
     At a minimum, there should be very clear warnings everywhere
     a 'default timezone' is mentioned.

[98] 9.4: Having separate functions for extraction from dateTime, date, and 
time
    seems to be unnecessarily tedious. Also, the description of the actual
    functions should be shortened.

[99] 9.4.10: *** 
"fn:get-hours-from-dateTime(xs:dateTime("1999-12-31T12:00:00")) returns 17"
     This strange result is just plain weird, and won't help anybody.
     12 is what the user will expect, and what she should get.

[100] 9.4.8 and other time divisions: What is the expected precision?
     Some systems may be able to provide very high precision, but should they?

[101] 9.6: *** "For purposes of timezone adjustment, an xs:date is treated 
as an xs:dateTime with time 00:00:00."
     It is unclear what this means, and it doesn't seem to have been
     'implemented' in the actual function definitions.

[102] 9.6: ***There should be a clear explanation of what 'adjustment' means,
     namely (in general, at least) to change the time notation so that
     it still denotes the same physical time, but uses a different
     timezone to do so.

[103] 9.6: ***The treatment of implicit timezones,... leads to very
    strange discontinuities for these timezone adjustments. In particular,
    while physical time is kept constant for adjustments with time zones,
    it is not when a timezone is missing. This is very dangerous.
    Also there are differences in behavior between adjustments of dateTime
    and date.

[104] 9.6: ***In connection with daylight saving time adjustment, it is often
     necessary to shift times by keeping the same nominal value but
     changing the time zone, in effect shifting the physical time
     with the timezone shift. But there is no operation to do this
     easily.

[105] 9.6: *** "op:subtract-yearMonthDuration-from-dateTime"
    The order of the operands is the wrong way round in the function
    name. This will cause problems for non native English speakers.
    There should at least be a warning, but ideally a fix.

[106] 9.7.1: This should say that the duration is always rounded down
    to full months. There should be an example with more rounding
    (almost a full month).

[107] 9.7.2: *** Examples with part of the operands having implicit timezones
    may be important to document the current design, but are very bad
    usage examples.

[108] 9.7.13 and similar: "This value is added to the normalized value of 
$srcval1 and the result returned."
     It seems that it is important to normalize after the calculation.

[109] 9.7.13: The slack available due to time zones is not used.
     e.g. it might be possible to say that 23:00:00+09:00 + PT5H
     is 23:00:00+04:00 or some such.

[110] 10: *** The general comment about anyURIs and URIs applies here again.

[111] 10.1.1: What about allowing other nodes (e.g. attribute) in second 
position
     for fn:resolve-QName?

[112] 11.1: fn:resolve-uri: This terminology should be cross-checked with
     the new terminology in RFC2396bis.

[113] 11.1: It may be helpful to have fn:resolve-uri(string, node),
     i.e. get the base implicitly.

[114] 11.1: "The second form of this function expects $base to be an 
absolute URI and $relative to be a relative URI."
     The second part of the sentence is misleading, because it can also
     be absolute.

[115] 11.1: *** The 'how to compare URIs' reference is outdated. In final
     form, it should point to the relevant section of the IRI spec.

[116] 12.1.1/2: This is virtually useless. At least a function to compare
     hex with base64 should be available. This would cover the current
     two functions and provide more functionality.

[117] 14. It would be good to have the example doc in actual XML, rather
     than just described.

[118] 14.1: This subsection seems totally pointless. There may be others
     like this.

[119] 14.1.4: ***casting to numeric types: This would case
     <a>1<b>2</b>3</a> to 123, yes? There should be functionality
     that allows to e.g. ignore/remove the <b> element with content,
     or convert each text node,...

[120] 14.1.5: ***fn:lang: There should be a function providing the result of
      (ancestor-or-self::*/@xml:lang)[last()]
     This is a step towards better support of language tagging, but
     we think that other steps will be needed.
     Ideally, this function should be called lang, and the current
     function should be called lang-match, but that may be against
     backwards compatibility.

[121] 14.1.5: *** fn:lang should return true also if $testlang is ""
     (i.e. matching for any language)

[122] 14.1.5: *** there are only four examples, not five. There should
     be some examples with false results.

[123] 14.1.5: please say explicitly that xml:lang can be taken from an ancestor

[124] 14.1.7/8: again, only one of node-before and node-after is needed
     for backup.

[125] 15.1.1/2/3: The names of these functions should express the testing
     (rather than constructive) nature of these functions.

[126] 15.1.4: There seems to be some hickup in: "The singleton xs:string 
value "". (the zero-length string). The expression cast as xs:boolean 
($srcval) returns false if $srcval is "0" and true if $srcval is "1"."

[126] ***15.1.7 and others: It would be a good idea to list all the functions
    that potentially take a collatior or are affected by collations
    in the collation section.

[127] 15.1.12: Changing this from insert-before to insert-after will at least
    bring this function in line with usual indexing practice (i.e. the
    position before the first item is 0, after the first is 1, and so on).

[128] 15.1.15: "This function takes a sequence or more typically, an 
expression, that evaluates to a sequence, and indicates that the result 
sequence may be returned in any order."
    This should explicitly say that the same sequence (except for order)
    as the argument is returned.

[129] 15.2: 'union', 'intersect', and 'except' are badly alligned grammatically
     (a noun, a verb, and a preposition)

[130] 15.2.1 *** is there any collation default for deep-equal?

[131] 15.2.1: *** "If the type of the items in $parameter1 and $parameter2 
is not xs:string  and $collationLiteral is specified, the collation is 
ignored."
     what about text nodes?

[132] 15.2.1.1: *** Why are namespace nodes compared with a collation?
     namespaces should be compared codepoint-by-codepoint.

[133] 15.2.1.1: "Note: The result of fn:deep-equal(1, current-dateTime()) 
is false; it does not raise an error."
     What does this want to say? That even the weirdest type combination
     is not an error?

[134] 15.2.1.1, code segments: These code segments use stuff that is not
     defined in this spec, such as 'eq'. There should be a pointer
     to a definition of 'eq'. This is one instance of a general problem
     already pointed out.

[135] 15.3.3 and others: *** the examples seem to imply that collation is
     an optional argument, but the signature shows it as mandatory.

[136] 15.3: rather than the very few aggregation functions provided here,
     it seems to be crucial to have adequate and easy-to-use second-
     order functions.

[137] 15.3.4/5: *** for strings, collations are used. What about subtypes
     of strings? what about anySimple and anyAtomic?

[138] 15.4.2: what 'substitution'?

[139] 15.4.2/3: *** that collations are not used for ID/IDREF is good

[140] 15.4.4: *** the text speaks about URIs, but this should be anyURIs.

[141] 15.4.4: guaranteeing 'doc("foo.xml") is doc("foo.xml")' may lead
     to problems for queries or transformations that run for a very
     long time (e.g. days,...) [not that they necessarily take that
     much time to compute, but that they are e.g. tuned to return
     a series of elements or documents at a certain pace.

[142] 15.4.4: "If two calls on this function supply different absolute 
URIs, the same document node may be returned if the implementation can 
determine that the two URIs refer to the same resource."
     A short explanation of how an implementation would do that may help.

[143] 15.4.5: fn:collection: How can a single URI return a collection of 
documents?
     Is this e.g. the result of a multiple-choice reply? or what?
     Again, a short explanation listing a few possibilities may help.

[144] 15.4.6: What is the 'input sequence'?

[145] dates/times in general: The examples should vary the default time 
zone, not
     always use the same one, so that people get more aware of the
     arbitrariness of the calculations.

[146] 16.4.1/5.1: The example should be more realistic, with seconds and
     fractional parts.

[147] 16.7: *** Would it not be better to return the codepoint collation
     if no default is set?

[148] 17: In the first three lines of this section, five different terms 
are used:
     "casting function", "cast function", "cast operator", "constructor
     function", "cast expression". Please clean up terminology and clearly
     explain the terms that you use.

[149] 17: Why are there two syntaxes for casts, one being a substring of 
the other?
     One syntax should be enough (probably the shorter one)

[150] 17.1: "and "M" indicates that a conversion from values of the type to 
which the row applies to the type to which the column applies *may* be 
supported, subject to restrictions discussed in this section."
     Does the 'may' mean 'implementations *may* support this kind of casting'?
     Or 'this cast *will* work for a subset of the values of the source type,
     in all implementations'?
     Please clarify.

[151] 17.1: abbreviations: We suggest using them only for the columns, but 
to use
     the rows with the full name and the abbreviation at the same time.
     That way, everything is contained in a single printout.

[152] 17.1: *** Any type that starts with a 'g' for 'Gregorian' should keep 
this 'g'
     in the shortcut.

[153] 17.1: *** Why is there an 'M' for anySimple to untypedAtomic? If this is
     because untypedAtomic cannot contain spaces, then the 'M' would also
     apply to str->untypedAtomic, because strings obviously can contain spaces.

[154] 17.1: "In the following table, the notation "S\T" indicates that the 
source ("S") of the conversion is indicated in the column below the 
notation and that the target ("T") is indicated in the row to the right of 
the notation."
     This sounds utterly helpless. Better change to Source\Target and be
     done with it, or use special long-range row and column outside the current
     table to indicate source and target. (by separating words into letters,
     vertically elongated table cells can easily be filled with text; newer
     browsers may even support adequate styling properties for vertical text.

[155] 17.4: Again a monetary type restricted to two digits after the 
decimal point.
     Here, please add a warning that this won't cover all currencies.

[156] 17.7: *** Casting from string to anyURI: Why is space replaced by %20?
     Please note that the newest IRI draft does not allow space, nor the
     other characters in ascii but not allowed in URIs.

[157] 17.7: *** "To cast to xs:anySimpleType or xdt:untypedAtomic the value 
is cast to xs:string, as described above, and the type annotation changed 
to xs:anySimpleType or xdt:untypedAtomic, respectively."
     These types are so extremely close that we think actual casts
     should not actually be needed (i.e. wherever a string goes, so goes
     an anySimple or an untypedAtomic, and vice versa.

[158] 17.7: *** casting from strings should include casing from text nodes.

[159] 17.8: "the xs:float value TV" -> "the xs:float TV"

[160] 17.8: "if SV is 1 or true": The value is only one of these, there just
     (unfortunately) happen to be two notations. Same for "0 or false".

[161] 17.9: The semi-formal description in terms of castings to strings and
     back is difficult to follow. An informal description in terms of
     components, followed maybe by a fully formulatic description of
     each conversion in a single formula, would be clearer.

[162] 17.9.5-9: The instructions for both dateTime and date are exactly
     the same. Please just say "If ST is dateTime or date, then..."

[163] 17.13: *** Again, please make sure you use anyURI, not URI.

[164] 17.15: Is xs:Notation($notation) allowed, or not? The table seems
     to suggest yes, the text seems to suggest no.

[165] References: "The Unicode Standard" should be:
     "The Unicode Standard
        The Unicode Consortium. The Unicode Standard, Version 4.0
     (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1)

[166] References: [Unicode Case Mappings] should be:
     Defined in The Unicode Standard, Section 3.13."

[167] References: You may want to add a reference to the ISO equivalent
     of the Unicode Collation Algorithm, ISO 14651. Like in the case
     of Unicode and 10646, the UCA is an extension of ISO 14651 --
     but a very substantial extension. This should be said in an
     explanatory note.

[168] C: It should be possible to have an XSLT implementation use functions
     defined with XQuery and vice versa, or that the WGs provide some
     at least proof-of-concept quality test software that can do the
     conversion. Also, the fact that there need
     to be two different ways to define these suggests that some follow-up
     work on an extensive function library may be highly adequate.

[169] F.2: This should be called Functions and Operators Index.


Regards,    Martin.

Received on Tuesday, 8 July 2003 15:31:07 UTC