RE: I18N last call comments on XQuery/XPath Fun/Op (first part) from Ashok Malhotra on 2003-09-29 (public-qt-comments@w3.org from September 2003)

From: Ashok Malhotra <ashokma@microsoft.com>
Date: Mon, 29 Sep 2003 15:40:18 -0700
To: "Martin Duerst" <duerst@w3.org>, <public-qt-comments@w3.org>
Cc: <w3c-i18n-ig@w3.org>
Message-ID: <E5B814702B65CB4DA51644580E4853FB0AE7762E@red-msg-12.redmond.corp.microsoft.com>
Thank you for your comments.  The WGs discussed [77] on 9/18/2003 and
decided to add a note to clarify why boolean values were considered to
be ordered.  'and' and 'or' are not available as functions but are
available in the XQuery and XPath 2.0 languages.   

All the best, Ashok

> -----Original Message-----
> From: public-qt-comments-request@w3.org [mailto:public-qt-comments-
> request@w3.org] On Behalf Of Martin Duerst
> Sent: Monday, July 07, 2003 2:06 PM
> To: public-qt-comments@w3.org
> Cc: w3c-i18n-ig@w3.org
> Subject: I18N last call comments on XQuery/XPath Fun/Op (first part)
> 
> 
> Dear XML Query WG and XSL WG,
> 
> Below please find the I18N WGs comments on your last call document
> "XQuery 1.0 and XPath 2.0 Functions and Operators"
> (http://www.w3.org/TR/2003/WD-xpath-functions-20030502/).
> 
> Please note the following:
> - Please address all replies to these comments to the I18N IG mailing
>    list (w3c-i18n-ig@w3.org), not just to me.
> - All i18n-relevant comments are marked with ***. There are also
general
>    comments on the spec which we hope you will find useful.
> - We have not yet reviewed the other documents, such as XQuery 1.0
>    or XSLT 2.0, and so we might be unaware of i18n issues that appear
>    in these specs but may have to be traced back to functions and
> operators.
>    There are also cases where we have identified an i18n issue here,
>    but we are not sure exactly what the best solution will be, and
which
>    document it will have to be addressed in. Also, there are issues
that
>    have been raised in comments to you about a different document but
>    that apply to this document, too. Sometimes, this is mentioned
below,
>    but not always.
> - Our comments are numbered in square brackets [nn].
> 
> We look forward to further discussion with you to find the best
> solution on these issues.
> 
> 
> [1] Status of this document: The 'last call' information should be
>      close to the start, not the end, of this section.
> 
> [2] 1.1 implementation defined:
>      "Possibly differing between implementations, but specified by the
> implementor for each particular implementation."
>      better: "Possibly differing between implementations, but
specified
> and
> documented by the implementor for each particular implementation."
> 
> [3] 1.2: *** re. anyAtomicType/untypedAtomic, see data model comments
> 
> [4] 1.2: ur types -> urtypes
> 
> [5] 1.2: "Diagram courtesy Asir Vedamuthu, webMethods and Jim Melton,
> Oracle"
>      Without any disrespect to these two hard-working gentlemen, this
is
> the
>      first time we have seen such an in-place acknowledgement in a W3C
> spec,
>      it does not seem appropriate, in particular because the image is
> mostly
>      a copy from XML Schema. If necessary, an ack in the ack section
> should do.
> 
> [6] 1.2: ***yearMonthDuration and dayTimeDuration:
>      glad to see that you are fixing
>      a well-known XML Schema problem.
> 
> [7] 1.3.2: ***the motivation for untypedAtomic seems unclear. There
may be
> many
>      other cases where one wants to indicate that subtypes/derived
types
>      are not acceptable. If this is considered important, then there
> should
>      be a general solution. See also our other comments on this type.
> 
> [8] 1.4: ***Regarding associating Z with dates/times without timezone,
see
>     our comments on data model.
> 
> [9] 1.4: Please provide a pointer to the normative spec for the
mappings
>     (that there are not enough pointers, and it's often not clear what
>      is the normative specification of something, is a general issue
>      with the current set of specs)
> 
> [10] 1.6: This speaks about constructors. The issues list for the data
> model
>     says constructors are gone. This is confusing.
> 
> [11] 1.6: "For most functions there is a paragraph describing what the
> function does followed by semantic rules. These rules are meant to be
> followed in the order that they appear in this document."
>      This seems to imply that some details in an earlier (in the spec)
>      function definition can influence some details in a later
>      (in the spec) definition. Please clarify.
> 
> [12] 1.6: Qname is defined in XML namespaces, not XML 1.0.
> 
> [13] 1.6, end: It would be good to add a small explanation about
multiple
>     parameters and sequences as parameters. (and that this is or is
>     not working the same way as in Perl)
> 
> [14] 1.7: "Are in the XML Schema namespace": Please always give the
>     namespace URI in such cases.
> 
> [15] 1.7: "datatypes described in this document": Are there any? Where
are
>       the descriptions?
> 
> [16] 2.3: What is the difference between fn:string() and fn:string(as
> item?)?
>      If there is no item, then that's the same as the first variant,
or
> not?
> 
> [17] 2.3: *** here and elsewhere: anyURI is not exactly an URI!
> 
> [18] 2.4: Why does this function work on sequences, but others don't?
>       This is a general issue.
> 
> [19] 2.5: What about the case that the base URI is found in the node's
>     grandparent, or even higher up?
> 
> [20] 2.5: What is 'static context'?
> 
> [21] 3. ***Error function: How can error messages be localized?
> 
> [22] 4. .implementation defined. : These dots look strange.
> 
> [23] 4. Ordering implementation defined: This should be that the
ordering
> is
>     according to execution order, which may be implementation defined.
>     Anything less diminishes the value of the error mechanism
> considerably.
> 
> [24] 5.1: xs:TYP: shouldn't this be xs:TYPE?
> 
> [25] 5.1: ***The return type of anyURI as QName seems strange. 17.14
does
> not
>       seem to explain why this special constructor appears in 5.1.
> 
> [26] 5.2: Rules are defined in the same way: What rules exactly?
> 
> [27] 6.2: Subtype substitution and type promotion should be described
>     normatively in this spec.
> 
> [28] 6.2.6: (a/b)*b+(a mod b) -> (a idiv b) * b + (a mod b)
> 
> [29] 6.2.7: Why do we need unary plus? If there is a need for a 'noop'
>     function, it may be better to have a general one.
> 
> [30] 6.4: ***Rounding functions: We are not sure that floor, ceiling,
> round,
>      and 'round-half-to-even' cover the rounding conventions around
>      the world in a balanced way. We need to look into this in more
>      detail. (for some examples, please see
>      http://www.xencraft.com/resources/multi-currency.html#rounding).
>      Also, what is 'round-half-to-even' used for?
> 
> [31] 7.1: ***String types: It is very important to integrate
> anyType/anySimpleType
>     (and if kept, untypedAtomic) and text nodes here.
> 
> 
> [32] *** 7.1: "except for the range #xD800 to #xDFFF inclusive, which
is
> the range reserved for surrogate pairs"
>      "surrogate pairs" -> "surrogates"
> 
> [33] *** 7.1: What is the difference between 'code point' and
'character'?
>      There are a few code points that are not characters (e.g. #xFFFE
and
>      #xFFFF), but that can hardly be the point of having two
definitions.
> 
> [34] *** 7.2.1 and 7.2.2: Please make sure that there are tests for
>      these functions that include surrogate pairs/codepoints above
#xFFFF.
> 
> [35] 7.3: It would be good to have a dedicated subsection 7.3 about
>      collations,..., and then section 7.4 with the actual functions.
>      (or any other structure that gives collation it's own subsection)
> 
> [36] *** 7.3: "the comparisons are inherently performed according to
some
> collation (even if that collation is defined entirely on code point
values
> or on the binary representations of the characters of the string)"
>      are "code point values" and "binary representation" alternatives
or
>      are they supposed to be the same? (the later would not be true in
>      the case of UTF-16, which should be pointed out)
> 
> [37] *** 7.3: "Strings can be compared character-by-character or in a
> logical manner, as defined by the collation."
>      This seems to imply that 'character-by-character' is not logical.
>      better change 'logical' to 'linguistically adequate' or so.
> 
> [38] *** 7.3 "For alignment with the [Character Model for the World
Wide
> Web 1.0], applications may choose collations that treat unnormalized
> strings as though they were normalized (that is, that implicitly
normalize
> the strings)."
>      This is somewhat misleading, in that early uniform normalization
>      should avoid having to compare strings that differ only in
> normalization.
>      This should be reworded carefully. Please also point out that
using
>      highest collation strength does not imply string normalization.
> 
> [39] *** 7.3: Using anyURIs to identify collations is the right thing
to
> do.
>       but there are several problems in how this is done in detail.
> 
> [40] *** 7.3: Having an anyURI to identify a collation by codepoint
order
>      (http://www.w3.org/2003/05/xpath-functions/collation/codepoint)
is
>      good. But there should at the minimum also be a predefined anyURI
>      for identifying the Unicode collation algorithm (without any
special
>      tailoring). Please note that this not necessarily means that
> implementations
>      have to support this algorithm (which we definitely would not
object
> to),
>      but it at least means that different implementations that all
> implement
>      that algorithm can interoperate.
> 
> [41] 7.3: "The XQuery/XPath static context includes A provision"
> 
> [42] *** 7.3: Rules for what collation to use: The current rules,
which
>      allow only one collation to be specified, raise an error if the
> collation
>      is not supported, and use anyURIs to identify collations without
any
>      mechanism for giving anyURIs to well-known collations, are bound
to
>      lead to interoperability problems. Collations should not be the
major
>      source of interoperability problems. With the current design,
even
>      vendors who want to be interoperable have no chance of doing so.
>      It will often be the case that e.g. a user wants just 'a French
>      collation'. How can this be indicated.
>      [we will send a separate mail about this issue to several lists
>       because we think that there is a potential for useful
coordination]
> 
> [43] *** 7.3: Context defines a single collation. But it may often be
>      desirable to have two 'default collations', in two different
senses:
>      a) different collations for matching (less precision) and sorting
>         (best precision possible)
>      b) different collations for internal operations and user-oriented
> operations
> 
> [44] *** 7.3: "There might be several ways in which a collation might
be
> specified in the XQuery/XPath static context. For example, XQuery
might
> provide syntax that specifies a default collation as part of the query
> prolog."
>      good ideas!
> 
> *** 7.3: The current proposal groups three different things into a
>      new concept called 'collation', which is different from what is
> usually
>      thought of as a collation. These are:
>      1) A collation in the traditional sense: every binary difference
> leads
>        to a non-equal match.
>      2) The use of different 'collations' to identify different
strengths
>        of the same collation (i.e. case-insensitive, accent-
> insensitive,...)
>      3) The use of 'collations' to identify character combinations
that
> serve
>        as single units in some respect in some languages (e.g. 'll'
and
> 'ch'
>        in traditional Spanish)
>      This has several problems:
>      [45] - The difference between 1) and 2), and the general use of
> highest
>             collation strength for sorting (to have deterministic
sorting)
>             and potentially lower for matching should be pointed out
> carefully.
>      [46] - Including 3) may make coordination with other efforts
somewhat
> difficult
>      [47] - Including 3) may make implementations somewhat difficult.
>             Although a given system (e.g. OS) may provide a range of
>             collations (in the 1) or 2) sense), the functionality in
3)
>             may not be available (e.g. via an API)
>      [48] - Including 3) may make specifications somewhat difficult.
>             There may be cases where it is unclear what the clusters
> should be.
>             For example, for French sorting with accents in the
reverse,
> are
>             the clusters the base letters followed by the accents? Or
is
> the
>             reverse consideration of accents ignored for this purpose?
>      [49] - This is way too important to be relegated to a note.
> 
> [50] *** 7.3: Using multiple-letter units for functions such as
'starts-
> with'
>      (independent of its definition via a collation), while
appropriate in
>      some cases, may not be that well established and tested. It
should be
>      carefully considered whether this functionality may not be too
> 'bleeding-edge',
>      and may not confuse users more than help, because there are too
many
>      cases where it is wrong, i.e. where character combinations are
taken
>      as single letters even if they are not, e.g. in foreign
loanwords,...
>      For example, a possible solution may be to use codepoint matching
> when
>      no collation is specified in the function rather than using the
> default
>      collation.
> 
> 
> [51] *** 7.3: "Some data management environments allow collations to
be
> associated with the definition of string items (that is, with the
metadata
> that describes items whose type is string). While such association may
be
> appropriate for use in environments in which data is held in a
repository
> tightly bound to its descriptive metadata, it is not appropriate in
the
> XML
> environment in which different documents being processed by a single
query
> may be described by differing schemas."
>      This is a very good point, but should also mention the fact that
a
> data-based
>      collation is not adequate for user needs.
> 
> 
> [52] *** 7.3: "Some data management environments allow collations to
be
> associated with the definition of string items (that is, with the
metadata
> that describes items whose type is string). While such association may
be
> appropriate for use in environments in which data is held in a
repository
> tightly bound to its descriptive metadata, it is not appropriate in
the
> XML
> environment in which different documents being processed by a single
query
> may be described by differing schemas."
>      This very much applies to sorting, but it looks somewhat out of
place
>      for 'character clusters' such as the traditional Spanish 'll' and
> 'ch',
>      because that seems to imply a linguistic unit, which means that
it
>      would be wrong to claim that 'el' is not an initial substring of
> 'elle'
>      if the later is not indeed Spanish.
> 
> [53] *** 7.3: Is there any case where e.g. element or attribute names
can
>      be treated as strings? How would they be sorted?
> 
> [54] *** 7.3: "It is possible to define collations that do not have
this
> property, for example a collation that attempts to sort "ISO 8859"
before
> "ISO 10646", or "January" before "February". Such collations may fail,
or
> give unexpected results, when used with functions such as
fn:contains()."
>      "this property": What property? Why 'fail'? That 'January' does
not
> contain
>      'ary' would be a consequence of the definition, not a failure.
Maybe
> it
>      would be better to say that for fn:contains and friends, using
the
>      codepoint collation in most cases produces more predictible
results
> than
>      using a specific collation.
> 
> [55] *** 7.3.1.1, last example: While in the preceeding example,
'equates'
>      is the right term, this is not necessary here. The only thing
that is
>      necessary is that the collation treats differences between 'ss'
and
>      sharp-s with less strength than differences in base letters (such
as
>      the final n). (there is also the case where the 'ss' <-> sharp-s
>      difference is at least as strong as the final 'n', and sharp-s <
> 'ss').
> 
> [56] *** 7.4 "The following functions are defined on these string
types."
>      which string types? As many as possible (up to anySimple and text
> node,
>      hopefully).
> 
> [57] *** 7.4.1: Is there a concat operation that includes
normalization
>      splicing at the contact point? This would be very helpful, and
> ideally
>      should be the default, because this may be the most efficient way
to
>      maintain a certain normalization. The same applies to string-join
>      and potentially other operations.
> 
> [57] *** 7.4.12/13: What about other transforms, such as
> katakana<->hiragana,...
> 
> [58] *** 7.4.6: "The first character of a string is located
>      at position 1, not position 0."
>      It is a pity that XQuery was not able to fix this problem. It
would
>      be good if there were a special section at the start of 7.4, or
>      somewhere else in an adequate place, that clearly explained the
>      issue of 1 origin for strings. Also, rather than writing things
>      such "beginning at position", it should always be "beginning at
>      character position", to make it easier for users to understand
>      that e.g. substrings are not identified by indicating boundary
>      positions between characters.
> 
> [59] 7.4.8/9: For continuity, fn:substring-before($string, "") should
>      return the empty string, and fn:substring-after($string, "")
>      should return $string. Rationale: The shorter a string is, the
>      earlier it generally matches. Thus, the empty string matches at
>      the start of a string.
> 
> [60] *** 7.4.11: There should be a reference to Unicode Standard Annex
#15
>      for the various normalization forms.
> 
> [61] *** 7.4.11: In the current Character Model, W3C normalization can
be
>      tested, but is not defined as a function. This probably can be
fixed
>      by specifying that the W3C normalization function first uses NFC,
>      and then prepends a space if the result is not yet in W3C
> normalization form.
> 
> [62] *** 7.4.11: In the light of the above, W3C normalization should
also
>      be made required to support.
> 
> [63] *** 7.4.11: What other normalization forms might be supported?
>      How would they be identified? How will the space of identifiers
>      be managed, to avoid conflicts?
> 
> [64] *** 7.4.12/13: The Unicode case mappings are now superseeded by
>      Unicode 4, see http://www.unicode.org/unicode/reports/tr21/.
> 
> [65] *** 7.4.12/13: It should be pointed out that mappings can change
>      the length of a string, and that lower(upper(s)) == s is not
>      guaranteed, nor is upper(lower(s)) == s, and that these functions
>      may not always be linguistically appropriate (e.g. Turkish i
>      without dot) or appropriate for the application (e.g. titlecase).
>      It should be said that in cases such as Turkish, a simple
translation
>      should be used first. It would be much better to have an uniform
>      approach for all languages, rather than having to special-case a
few
>      languages.
> 
> [66] 7.4.15: string-pad seems to be the wrong name, because it
suggests
>      padding to a certain length, e.g. string-pad("XMLQuery", 15)
would
>      return "XMLQueryXMLQuer", i.e. 15 characters. Something like
>      'multiply' seems more adequate.
> 
> [67] *** 7.4.16: It is unclear what these functions are supposed to do
>      with anyURIs/IRIs.
> 
> [68] *** 7.4.16: It is unclear what the case of $escape-reserved ==
false
>       is supposed to do. The example shows this as the identity
function,
>       which is probably not the point. Please give us a better
example,
>       so that we can understand what this is used for. It may be
better
>       ultimately to separate this into two functions, and remove the
>       $escape-reserved argument.
> 
> [69] 7.4.17: "The % character itself is escaped only if it is not
followed
>      by two hexadecimal digits": This seems dangerous. If "%20" is a
plain
>      payload, then the '%' has to be escaped, like everything else.
> 
> [70] 7.4.16: Gopher examples are very outdated. We suggest replacement
>      with something more people are familiar with.
> 
> [71] 7.5.1: 'reluctant' quantifiers: It may be better to use the Perl
>      term 'minimal'.
> 
> [72] 7.5.1: "In the absence of these quantifiers, the regular
expression
> matches the longest possible substring."
>      no, not in the absence of these quantifiers, but if the other
> (maximal)
>      set of quantifiers is used ('*?' is a quantifier, but in '*?',
'?' is
>      not a quantifer)
> 
> [73] *** 7.5.2: "Regular expression matching is defined on the basis
of
> Unicode code-points; it takes no account of collations."
>      It seems somewhat inadequate that fn:contains and friends use
> collations,
>      but regular expressions don't. But we are not sure which way the
> right
>      solutions is.
> 
> [74] 7.5.3: "An error is raised ("Invalid replacement string") if the
> value
> of $replacement contains a "$" character that is not immediately
followed
> by a digit 1-9 and not immediately preceded by a "/"."
>       "/" -> "\"
> 
> [75] *** 7.5.3: We have been told that there is a provision to replace
>      character strings with markup, but this is not discussed here.
>      We want to make sure this is available, because it is relevant
>      to get people away from using the PUA.
> 
> [76] 7.5.3: Why does fn:replace not provide for single replacements?
> 
> [77] 8.2.2/3: Boolean less-than and greater-than seem strange. On the
> other
>      hand, 'or' and 'and' seem to be badly missing.
> 
> 
> 
> This concludes the first part of these comments. We will send the
> rest of our comments tomorrow (planned around 3pm EDT July 8th).
> 
> 
> Regards,    Martin.
>
Received on Monday, 29 September 2003 18:40:22 UTC