RE: I18N last call comments on XQuery/XPath Fun/Op (first part) [54] from Ashok Malhotra on 2003-10-06 (public-qt-comments@w3.org from October 2003)

From: Ashok Malhotra <ashokma@microsoft.com>
Date: Mon, 6 Oct 2003 15:57:21 -0700
To: "Martin Duerst" <duerst@w3.org>, <public-qt-comments@w3.org>
Cc: <w3c-i18n-ig@w3.org>
Message-ID: <EDB607C8AC991F40BE646533A1A673E846BF95@RED-MSG-42.redmond.corp.microsoft.com>
Thank you for your comment [54] on Characters and Collation Units.

We have clarified the wording and moved all the functions that use this
special feature of collations into a special section.  We had added an
optional error message that a system can invoke if it finds that a
collation is unsuitable for substring matching.

All the best, Ashok

> -----Original Message-----
> From: public-qt-comments-request@w3.org [mailto:public-qt-comments-
> request@w3.org] On Behalf Of Martin Duerst
> Sent: Monday, July 07, 2003 2:06 PM
> To: public-qt-comments@w3.org
> Cc: w3c-i18n-ig@w3.org
> Subject: I18N last call comments on XQuery/XPath Fun/Op (first part)
> 
> 
> Dear XML Query WG and XSL WG,
> 
> Below please find the I18N WGs comments on your last call document
> "XQuery 1.0 and XPath 2.0 Functions and Operators"
> (http://www.w3.org/TR/2003/WD-xpath-functions-20030502/).
> 
> Please note the following:
> - Please address all replies to these comments to the I18N IG mailing
>    list (w3c-i18n-ig@w3.org), not just to me.
> - All i18n-relevant comments are marked with ***. There are also
general
>    comments on the spec which we hope you will find useful.
> - We have not yet reviewed the other documents, such as XQuery 1.0
>    or XSLT 2.0, and so we might be unaware of i18n issues that appear
>    in these specs but may have to be traced back to functions and
> operators.
>    There are also cases where we have identified an i18n issue here,
>    but we are not sure exactly what the best solution will be, and
which
>    document it will have to be addressed in. Also, there are issues
that
>    have been raised in comments to you about a different document but
>    that apply to this document, too. Sometimes, this is mentioned
below,
>    but not always.
> - Our comments are numbered in square brackets [nn].
> 
> We look forward to further discussion with you to find the best
> solution on these issues.
> 
> 
> [1] Status of this document: The 'last call' information should be
>      close to the start, not the end, of this section.
> 
> [2] 1.1 implementation defined:
>      "Possibly differing between implementations, but specified by the
> implementor for each particular implementation."
>      better: "Possibly differing between implementations, but
specified
> and
> documented by the implementor for each particular implementation."
> 
> [3] 1.2: *** re. anyAtomicType/untypedAtomic, see data model comments
> 
> [4] 1.2: ur types -> urtypes
> 
> [5] 1.2: "Diagram courtesy Asir Vedamuthu, webMethods and Jim Melton,
> Oracle"
>      Without any disrespect to these two hard-working gentlemen, this
is
> the
>      first time we have seen such an in-place acknowledgement in a W3C
> spec,
>      it does not seem appropriate, in particular because the image is
> mostly
>      a copy from XML Schema. If necessary, an ack in the ack section
> should do.
> 
> [6] 1.2: ***yearMonthDuration and dayTimeDuration:
>      glad to see that you are fixing
>      a well-known XML Schema problem.
> 
> [7] 1.3.2: ***the motivation for untypedAtomic seems unclear. There
may be
> many
>      other cases where one wants to indicate that subtypes/derived
types
>      are not acceptable. If this is considered important, then there
> should
>      be a general solution. See also our other comments on this type.
> 
> [8] 1.4: ***Regarding associating Z with dates/times without timezone,
see
>     our comments on data model.
> 
> [9] 1.4: Please provide a pointer to the normative spec for the
mappings
>     (that there are not enough pointers, and it's often not clear what
>      is the normative specification of something, is a general issue
>      with the current set of specs)
> 
> [10] 1.6: This speaks about constructors. The issues list for the data
> model
>     says constructors are gone. This is confusing.
> 
> [11] 1.6: "For most functions there is a paragraph describing what the
> function does followed by semantic rules. These rules are meant to be
> followed in the order that they appear in this document."
>      This seems to imply that some details in an earlier (in the spec)
>      function definition can influence some details in a later
>      (in the spec) definition. Please clarify.
> 
> [12] 1.6: Qname is defined in XML namespaces, not XML 1.0.
> 
> [13] 1.6, end: It would be good to add a small explanation about
multiple
>     parameters and sequences as parameters. (and that this is or is
>     not working the same way as in Perl)
> 
> [14] 1.7: "Are in the XML Schema namespace": Please always give the
>     namespace URI in such cases.
> 
> [15] 1.7: "datatypes described in this document": Are there any? Where
are
>       the descriptions?
> 
> [16] 2.3: What is the difference between fn:string() and fn:string(as
> item?)?
>      If there is no item, then that's the same as the first variant,
or
> not?
> 
> [17] 2.3: *** here and elsewhere: anyURI is not exactly an URI!
> 
> [18] 2.4: Why does this function work on sequences, but others don't?
>       This is a general issue.
> 
> [19] 2.5: What about the case that the base URI is found in the node's
>     grandparent, or even higher up?
> 
> [20] 2.5: What is 'static context'?
> 
> [21] 3. ***Error function: How can error messages be localized?
> 
> [22] 4. .implementation defined. : These dots look strange.
> 
> [23] 4. Ordering implementation defined: This should be that the
ordering
> is
>     according to execution order, which may be implementation defined.
>     Anything less diminishes the value of the error mechanism
> considerably.
> 
> [24] 5.1: xs:TYP: shouldn't this be xs:TYPE?
> 
> [25] 5.1: ***The return type of anyURI as QName seems strange. 17.14
does
> not
>       seem to explain why this special constructor appears in 5.1.
> 
> [26] 5.2: Rules are defined in the same way: What rules exactly?
> 
> [27] 6.2: Subtype substitution and type promotion should be described
>     normatively in this spec.
> 
> [28] 6.2.6: (a/b)*b+(a mod b) -> (a idiv b) * b + (a mod b)
> 
> [29] 6.2.7: Why do we need unary plus? If there is a need for a 'noop'
>     function, it may be better to have a general one.
> 
> [30] 6.4: ***Rounding functions: We are not sure that floor, ceiling,
> round,
>      and 'round-half-to-even' cover the rounding conventions around
>      the world in a balanced way. We need to look into this in more
>      detail. (for some examples, please see
>      http://www.xencraft.com/resources/multi-currency.html#rounding).
>      Also, what is 'round-half-to-even' used for?
> 
> [31] 7.1: ***String types: It is very important to integrate
> anyType/anySimpleType
>     (and if kept, untypedAtomic) and text nodes here.
> 
> 
> [32] *** 7.1: "except for the range #xD800 to #xDFFF inclusive, which
is
> the range reserved for surrogate pairs"
>      "surrogate pairs" -> "surrogates"
> 
> [33] *** 7.1: What is the difference between 'code point' and
'character'?
>      There are a few code points that are not characters (e.g. #xFFFE
and
>      #xFFFF), but that can hardly be the point of having two
definitions.
> 
> [34] *** 7.2.1 and 7.2.2: Please make sure that there are tests for
>      these functions that include surrogate pairs/codepoints above
#xFFFF.
> 
> [35] 7.3: It would be good to have a dedicated subsection 7.3 about
>      collations,..., and then section 7.4 with the actual functions.
>      (or any other structure that gives collation it's own subsection)
> 
> [36] *** 7.3: "the comparisons are inherently performed according to
some
> collation (even if that collation is defined entirely on code point
values
> or on the binary representations of the characters of the string)"
>      are "code point values" and "binary representation" alternatives
or
>      are they supposed to be the same? (the later would not be true in
>      the case of UTF-16, which should be pointed out)
> 
> [37] *** 7.3: "Strings can be compared character-by-character or in a
> logical manner, as defined by the collation."
>      This seems to imply that 'character-by-character' is not logical.
>      better change 'logical' to 'linguistically adequate' or so.
> 
> [38] *** 7.3 "For alignment with the [Character Model for the World
Wide
> Web 1.0], applications may choose collations that treat unnormalized
> strings as though they were normalized (that is, that implicitly
normalize
> the strings)."
>      This is somewhat misleading, in that early uniform normalization
>      should avoid having to compare strings that differ only in
> normalization.
>      This should be reworded carefully. Please also point out that
using
>      highest collation strength does not imply string normalization.
> 
> [39] *** 7.3: Using anyURIs to identify collations is the right thing
to
> do.
>       but there are several problems in how this is done in detail.
> 
> [40] *** 7.3: Having an anyURI to identify a collation by codepoint
order
>      (http://www.w3.org/2003/05/xpath-functions/collation/codepoint)
is
>      good. But there should at the minimum also be a predefined anyURI
>      for identifying the Unicode collation algorithm (without any
special
>      tailoring). Please note that this not necessarily means that
> implementations
>      have to support this algorithm (which we definitely would not
object
> to),
>      but it at least means that different implementations that all
> implement
>      that algorithm can interoperate.
> 
> [41] 7.3: "The XQuery/XPath static context includes A provision"
> 
> [42] *** 7.3: Rules for what collation to use: The current rules,
which
>      allow only one collation to be specified, raise an error if the
> collation
>      is not supported, and use anyURIs to identify collations without
any
>      mechanism for giving anyURIs to well-known collations, are bound
to
>      lead to interoperability problems. Collations should not be the
major
>      source of interoperability problems. With the current design,
even
>      vendors who want to be interoperable have no chance of doing so.
>      It will often be the case that e.g. a user wants just 'a French
>      collation'. How can this be indicated.
>      [we will send a separate mail about this issue to several lists
>       because we think that there is a potential for useful
coordination]
> 
> [43] *** 7.3: Context defines a single collation. But it may often be
>      desirable to have two 'default collations', in two different
senses:
>      a) different collations for matching (less precision) and sorting
>         (best precision possible)
>      b) different collations for internal operations and user-oriented
> operations
> 
> [44] *** 7.3: "There might be several ways in which a collation might
be
> specified in the XQuery/XPath static context. For example, XQuery
might
> provide syntax that specifies a default collation as part of the query
> prolog."
>      good ideas!
> 
> *** 7.3: The current proposal groups three different things into a
>      new concept called 'collation', which is different from what is
> usually
>      thought of as a collation. These are:
>      1) A collation in the traditional sense: every binary difference
> leads
>        to a non-equal match.
>      2) The use of different 'collations' to identify different
strengths
>        of the same collation (i.e. case-insensitive, accent-
> insensitive,...)
>      3) The use of 'collations' to identify character combinations
that
> serve
>        as single units in some respect in some languages (e.g. 'll'
and
> 'ch'
>        in traditional Spanish)
>      This has several problems:
>      [45] - The difference between 1) and 2), and the general use of
> highest
>             collation strength for sorting (to have deterministic
sorting)
>             and potentially lower for matching should be pointed out
> carefully.
>      [46] - Including 3) may make coordination with other efforts
somewhat
> difficult
>      [47] - Including 3) may make implementations somewhat difficult.
>             Although a given system (e.g. OS) may provide a range of
>             collations (in the 1) or 2) sense), the functionality in
3)
>             may not be available (e.g. via an API)
>      [48] - Including 3) may make specifications somewhat difficult.
>             There may be cases where it is unclear what the clusters
> should be.
>             For example, for French sorting with accents in the
reverse,
> are
>             the clusters the base letters followed by the accents? Or
is
> the
>             reverse consideration of accents ignored for this purpose?
>      [49] - This is way too important to be relegated to a note.
> 
> [50] *** 7.3: Using multiple-letter units for functions such as
'starts-
> with'
>      (independent of its definition via a collation), while
appropriate in
>      some cases, may not be that well established and tested. It
should be
>      carefully considered whether this functionality may not be too
> 'bleeding-edge',
>      and may not confuse users more than help, because there are too
many
>      cases where it is wrong, i.e. where character combinations are
taken
>      as single letters even if they are not, e.g. in foreign
loanwords,...
>      For example, a possible solution may be to use codepoint matching
> when
>      no collation is specified in the function rather than using the
> default
>      collation.
> 
> 
> [51] *** 7.3: "Some data management environments allow collations to
be
> associated with the definition of string items (that is, with the
metadata
> that describes items whose type is string). While such association may
be
> appropriate for use in environments in which data is held in a
repository
> tightly bound to its descriptive metadata, it is not appropriate in
the
> XML
> environment in which different documents being processed by a single
query
> may be described by differing schemas."
>      This is a very good point, but should also mention the fact that
a
> data-based
>      collation is not adequate for user needs.
> 
> 
> [52] *** 7.3: "Some data management environments allow collations to
be
> associated with the definition of string items (that is, with the
metadata
> that describes items whose type is string). While such association may
be
> appropriate for use in environments in which data is held in a
repository
> tightly bound to its descriptive metadata, it is not appropriate in
the
> XML
> environment in which different documents being processed by a single
query
> may be described by differing schemas."
>      This very much applies to sorting, but it looks somewhat out of
place
>      for 'character clusters' such as the traditional Spanish 'll' and
> 'ch',
>      because that seems to imply a linguistic unit, which means that
it
>      would be wrong to claim that 'el' is not an initial substring of
> 'elle'
>      if the later is not indeed Spanish.
> 
> [53] *** 7.3: Is there any case where e.g. element or attribute names
can
>      be treated as strings? How would they be sorted?
> 
> [54] *** 7.3: "It is possible to define collations that do not have
this
> property, for example a collation that attempts to sort "ISO 8859"
before
> "ISO 10646", or "January" before "February". Such collations may fail,
or
> give unexpected results, when used with functions such as
fn:contains()."
>      "this property": What property? Why 'fail'? That 'January' does
not
> contain
>      'ary' would be a consequence of the definition, not a failure.
Maybe
> it
>      would be better to say that for fn:contains and friends, using
the
>      codepoint collation in most cases produces more predictible
results
> than
>      using a specific collation.
> 
> [55] *** 7.3.1.1, last example: While in the preceeding example,
'equates'
>      is the right term, this is not necessary here. The only thing
that is
>      necessary is that the collation treats differences between 'ss'
and
>      sharp-s with less strength than differences in base letters (such
as
>      the final n). (there is also the case where the 'ss' <-> sharp-s
>      difference is at least as strong as the final 'n', and sharp-s <
> 'ss').
> 
> [56] *** 7.4 "The following functions are defined on these string
types."
>      which string types? As many as possible (up to anySimple and text
> node,
>      hopefully).
> 
> [57] *** 7.4.1: Is there a concat operation that includes
normalization
>      splicing at the contact point? This would be very helpful, and
> ideally
>      should be the default, because this may be the most efficient way
to
>      maintain a certain normalization. The same applies to string-join
>      and potentially other operations.
> 
> [57] *** 7.4.12/13: What about other transforms, such as
> katakana<->hiragana,...
> 
> [58] *** 7.4.6: "The first character of a string is located
>      at position 1, not position 0."
>      It is a pity that XQuery was not able to fix this problem. It
would
>      be good if there were a special section at the start of 7.4, or
>      somewhere else in an adequate place, that clearly explained the
>      issue of 1 origin for strings. Also, rather than writing things
>      such "beginning at position", it should always be "beginning at
>      character position", to make it easier for users to understand
>      that e.g. substrings are not identified by indicating boundary
>      positions between characters.
> 
> [59] 7.4.8/9: For continuity, fn:substring-before($string, "") should
>      return the empty string, and fn:substring-after($string, "")
>      should return $string. Rationale: The shorter a string is, the
>      earlier it generally matches. Thus, the empty string matches at
>      the start of a string.
> 
> [60] *** 7.4.11: There should be a reference to Unicode Standard Annex
#15
>      for the various normalization forms.
> 
> [61] *** 7.4.11: In the current Character Model, W3C normalization can
be
>      tested, but is not defined as a function. This probably can be
fixed
>      by specifying that the W3C normalization function first uses NFC,
>      and then prepends a space if the result is not yet in W3C
> normalization form.
> 
> [62] *** 7.4.11: In the light of the above, W3C normalization should
also
>      be made required to support.
> 
> [63] *** 7.4.11: What other normalization forms might be supported?
>      How would they be identified? How will the space of identifiers
>      be managed, to avoid conflicts?
> 
> [64] *** 7.4.12/13: The Unicode case mappings are now superseeded by
>      Unicode 4, see http://www.unicode.org/unicode/reports/tr21/.
> 
> [65] *** 7.4.12/13: It should be pointed out that mappings can change
>      the length of a string, and that lower(upper(s)) == s is not
>      guaranteed, nor is upper(lower(s)) == s, and that these functions
>      may not always be linguistically appropriate (e.g. Turkish i
>      without dot) or appropriate for the application (e.g. titlecase).
>      It should be said that in cases such as Turkish, a simple
translation
>      should be used first. It would be much better to have an uniform
>      approach for all languages, rather than having to special-case a
few
>      languages.
> 
> [66] 7.4.15: string-pad seems to be the wrong name, because it
suggests
>      padding to a certain length, e.g. string-pad("XMLQuery", 15)
would
>      return "XMLQueryXMLQuer", i.e. 15 characters. Something like
>      'multiply' seems more adequate.
> 
> [67] *** 7.4.16: It is unclear what these functions are supposed to do
>      with anyURIs/IRIs.
> 
> [68] *** 7.4.16: It is unclear what the case of $escape-reserved ==
false
>       is supposed to do. The example shows this as the identity
function,
>       which is probably not the point. Please give us a better
example,
>       so that we can understand what this is used for. It may be
better
>       ultimately to separate this into two functions, and remove the
>       $escape-reserved argument.
> 
> [69] 7.4.17: "The % character itself is escaped only if it is not
followed
>      by two hexadecimal digits": This seems dangerous. If "%20" is a
plain
>      payload, then the '%' has to be escaped, like everything else.
> 
> [70] 7.4.16: Gopher examples are very outdated. We suggest replacement
>      with something more people are familiar with.
> 
> [71] 7.5.1: 'reluctant' quantifiers: It may be better to use the Perl
>      term 'minimal'.
> 
> [72] 7.5.1: "In the absence of these quantifiers, the regular
expression
> matches the longest possible substring."
>      no, not in the absence of these quantifiers, but if the other
> (maximal)
>      set of quantifiers is used ('*?' is a quantifier, but in '*?',
'?' is
>      not a quantifer)
> 
> [73] *** 7.5.2: "Regular expression matching is defined on the basis
of
> Unicode code-points; it takes no account of collations."
>      It seems somewhat inadequate that fn:contains and friends use
> collations,
>      but regular expressions don't. But we are not sure which way the
> right
>      solutions is.
> 
> [74] 7.5.3: "An error is raised ("Invalid replacement string") if the
> value
> of $replacement contains a "$" character that is not immediately
followed
> by a digit 1-9 and not immediately preceded by a "/"."
>       "/" -> "\"
> 
> [75] *** 7.5.3: We have been told that there is a provision to replace
>      character strings with markup, but this is not discussed here.
>      We want to make sure this is available, because it is relevant
>      to get people away from using the PUA.
> 
> [76] 7.5.3: Why does fn:replace not provide for single replacements?
> 
> [77] 8.2.2/3: Boolean less-than and greater-than seem strange. On the
> other
>      hand, 'or' and 'and' seem to be badly missing.
> 
> 
> 
> This concludes the first part of these comments. We will send the
> rest of our comments tomorrow (planned around 3pm EDT July 8th).
> 
> 
> Regards,    Martin.
>
Received on Monday, 6 October 2003 18:57:56 UTC