RE: I18N last call comments on XQuery/XPath Fun/Op (first part) from Ashok Malhotra on 2003-09-02 (public-qt-comments@w3.org from September 2003)

From: Ashok Malhotra <ashokma@microsoft.com>
Date: Tue, 2 Sep 2003 04:55:57 -0700
To: "Martin Duerst" <duerst@w3.org>, <public-qt-comments@w3.org>
Cc: <w3c-i18n-ig@w3.org>
Message-ID: <E5B814702B65CB4DA51644580E4853FB0A700A67@red-msg-12.redmond.corp.microsoft.com>
Great comments!  I have responded to some by clarifying wording and
others by creating issues for the XML Query WG or the F&O taskforce to
discuss.

On two issues I have started discussion threads.  See comments inline.

 

This is not a formal response from the Query WG.  Feel free to comment

if you think something has not been addressed adequately.

 

All the best, Ashok

 

> -----Original Message-----

> From: public-qt-comments-request@w3.org [mailto:public-qt-comments-

> request@w3.org] On Behalf Of Martin Duerst

> Sent: Monday, July 07, 2003 2:06 PM

> To: public-qt-comments@w3.org

> Cc: w3c-i18n-ig@w3.org

> Subject: I18N last call comments on XQuery/XPath Fun/Op (first part)

> 

> 

> Dear XML Query WG and XSL WG,

> 

> Below please find the I18N WGs comments on your last call document

> "XQuery 1.0 and XPath 2.0 Functions and Operators"

> (http://www.w3.org/TR/2003/WD-xpath-functions-20030502/).

> 

> Please note the following:

> - Please address all replies to these comments to the I18N IG mailing

>    list (w3c-i18n-ig@w3.org), not just to me.

> - All i18n-relevant comments are marked with ***. There are also
general

>    comments on the spec which we hope you will find useful.

> - We have not yet reviewed the other documents, such as XQuery 1.0

>    or XSLT 2.0, and so we might be unaware of i18n issues that appear

>    in these specs but may have to be traced back to functions and

> operators.

>    There are also cases where we have identified an i18n issue here,

>    but we are not sure exactly what the best solution will be, and
which

>    document it will have to be addressed in. Also, there are issues
that

>    have been raised in comments to you about a different document but

>    that apply to this document, too. Sometimes, this is mentioned
below,

>    but not always.

> - Our comments are numbered in square brackets [nn].

> 

> We look forward to further discussion with you to find the best

> solution on these issues.

> 

> 

> [1] Status of this document: The 'last call' information should be

>      close to the start, not the end, of this section.

[AM]  OK.  Will fix in next version.

> 

> [2] 1.1 implementation defined:

>      "Possibly differing between implementations, but specified by the

> implementor for each particular implementation."

>      better: "Possibly differing between implementations, but
specified

> and

> documented by the implementor for each particular implementation."

[AM] Fixed.

> 

> [3] 1.2: *** re. anyAtomicType/untypedAtomic, see data model comments

[AM] In process with the XML Schema WG.  Ongoing.

> 

> [4] 1.2: ur types -> urtypes

[AM]  Wording has been changed.

> 

> [5] 1.2: "Diagram courtesy Asir Vedamuthu, webMethods and Jim Melton,

> Oracle"

>      Without any disrespect to these two hard-working gentlemen, this
is

> the

>      first time we have seen such an in-place acknowledgement in a W3C

> spec,

>      it does not seem appropriate, in particular because the image is

> mostly

>      a copy from XML Schema. If necessary, an ack in the ack section

> should do.

[AM] Diagram changed.  No acknowledgement.

> 

> [6] 1.2: ***yearMonthDuration and dayTimeDuration:

>      glad to see that you are fixing

>      a well-known XML Schema problem.

[AM] Thanks!

> 

> [7] 1.3.2: ***the motivation for untypedAtomic seems unclear. There
may be

> many

>      other cases where one wants to indicate that subtypes/derived
types

>      are not acceptable. If this is considered important, then there

> should

>      be a general solution. See also our other comments on this type.

[AM] Ongoing discussion with the Schema WG.

> 

> [8] 1.4: ***Regarding associating Z with dates/times without timezone,
see

>     our comments on data model.

[AM] OK.

> 

> [9] 1.4: Please provide a pointer to the normative spec for the
mappings

>     (that there are not enough pointers, and it's often not clear what

>      is the normative specification of something, is a general issue

>      with the current set of specs)

[AM]  This, and the referenced text in the datamodel, is the normative
text.  Is this not formal enough? 

> 

> [10] 1.6: This speaks about constructors. The issues list for the data

> model

>     says constructors are gone. This is confusing.

[AM]  I will send a note to Norm to fix.

> 

> [11] 1.6: "For most functions there is a paragraph describing what the

> function does followed by semantic rules. These rules are meant to be

> followed in the order that they appear in this document."

>      This seems to imply that some details in an earlier (in the spec)

>      function definition can influence some details in a later

>      (in the spec) definition. Please clarify.

[AM] Yes.  For example.  The early rules often tell you what to do if
the argument is an empty sequence. The later rules then assume that the
argument is not an empty sequence. 

> 

> [12] 1.6: Qname is defined in XML namespaces, not XML 1.0.

[AM] Fixed.

> 

> [13] 1.6, end: It would be good to add a small explanation about
multiple

>     parameters and sequences as parameters. (and that this is or is

>     not working the same way as in Perl)

[AM] Added issue.

> 

> [14] 1.7: "Are in the XML Schema namespace": Please always give the

>     namespace URI in such cases.

[AM] Done.

> 

> [15] 1.7: "datatypes described in this document": Are there any? Where
are

>       the descriptions?

[AM] Added references.

> 

> [16] 2.3: What is the difference between fn:string() and fn:string(as

> item?)?

>      If there is no item, then that's the same as the first variant,
or

> not?

[AM]  Added note to clarify the difference.

> 

> [17] 2.3: *** here and elsewhere: anyURI is not exactly an URI!

[AM] Fixed.

> 

> [18] 2.4: Why does this function work on sequences, but others don't?

>       This is a general issue.

[AM] There was a great deal of discussion about the semantics of
fn:data.  I can try and resurrect that if you wish.

> 

> [19] 2.5: What about the case that the base URI is found in the node's

>     grandparent, or even higher up?

[AM] I read the text as allowing recursive ascent.  I can clarify if you
think necessary.

> 

> [20] 2.5: What is 'static context'?

[AM] Added reference.

> 

> [21] 3. ***Error function: How can error messages be localized?

[AM]  Added issue.

> 

> [22] 4. .implementation defined. : These dots look strange.

[AM] Formatting.  Same convention used by XML Schema.

> 

> [23] 4. Ordering implementation defined: This should be that the
ordering

> is

>     according to execution order, which may be implementation defined.

>     Anything less diminishes the value of the error mechanism

> considerably.

[AM] Created an issue.

> 

> [24] 5.1: xs:TYP: shouldn't this be xs:TYPE?

[AM] Fixed

> 

> [25] 5.1: ***The return type of anyURI as QName seems strange. 17.14
does

> not

>       seem to explain why this special constructor appears in 5.1.

[AM]  That was a typo.  Fixed.  Sorry for the confusion.

> 

> [26] 5.2: Rules are defined in the same way: What rules exactly?

[AM] Clarified wording.

> 

> [27] 6.2: Subtype substitution and type promotion should be described

>     normatively in this spec.

[AM] This is a coordination item.  Sent mail to query-editors.

> 

> [28] 6.2.6: (a/b)*b+(a mod b) -> (a idiv b) * b + (a mod b)

[AM] Fixed

> 

> [29] 6.2.7: Why do we need unary plus? If there is a need for a 'noop'

>     function, it may be better to have a general one.

[AM] Since there is a unary plus operator function in the language we
need a op: function to define its semantics which turn out to be a noop
in this case.

> 

> [30] 6.4: ***Rounding functions: We are not sure that floor, ceiling,

> round,

>      and 'round-half-to-even' cover the rounding conventions around

>      the world in a balanced way. We need to look into this in more

>      detail. (for some examples, please see

>      http://www.xencraft.com/resources/multi-currency.html#rounding).

>      Also, what is 'round-half-to-even' used for?

[AM] Added issue.  round-half-to-even is used in business because with
it a string of rounded numbers adds closer to the same string unrounded.


> 

> [31] 7.1: ***String types: It is very important to integrate

> anyType/anySimpleType

>     (and if kept, untypedAtomic) and text nodes here.

[AM] Ongoing discussion with Schema WG.

> 

> 

> [32] *** 7.1: "except for the range #xD800 to #xDFFF inclusive, which
is

> the range reserved for surrogate pairs"

>      "surrogate pairs" -> "surrogates"

[AM] Done!

> 

> [33] *** 7.1: What is the difference between 'code point' and
'character'?

>      There are a few code points that are not characters (e.g. #xFFFE
and

>      #xFFFF), but that can hardly be the point of having two
definitions.

[AM]  Both terms are used in the documents and so need to be defined.
The Schema WG feels we should not use 'character' because it means
different things to different people.

> 

> [34] *** 7.2.1 and 7.2.2: Please make sure that there are tests for

>      these functions that include surrogate pairs/codepoints above
#xFFFF.

[AM] Sent note to the NIST folks who are creating the tests.

> 

> [35] 7.3: It would be good to have a dedicated subsection 7.3 about

>      collations,..., and then section 7.4 with the actual functions.

>      (or any other structure that gives collation it's own subsection)

[AM] Done!

> 

> [36] *** 7.3: "the comparisons are inherently performed according to
some

> collation (even if that collation is defined entirely on code point
values

> or on the binary representations of the characters of the string)"

>      are "code point values" and "binary representation" alternatives
or

>      are they supposed to be the same? (the later would not be true in

>      the case of UTF-16, which should be pointed out)

[AM] Wording changed.  'binary representation' removed. 

> 

> [37] *** 7.3: "Strings can be compared character-by-character or in a

> logical manner, as defined by the collation."

>      This seems to imply that 'character-by-character' is not logical.

>      better change 'logical' to 'linguistically adequate' or so.

[AM] Changed to 'linguistically appropriate' manner.

> 

> [38] *** 7.3 "For alignment with the [Character Model for the World
Wide

> Web 1.0], applications may choose collations that treat unnormalized

> strings as though they were normalized (that is, that implicitly
normalize

> the strings)."

>      This is somewhat misleading, in that early uniform normalization

>      should avoid having to compare strings that differ only in

> normalization.

>      This should be reworded carefully. Please also point out that
using

>      highest collation strength does not imply string normalization.

[AM] Requested clarification

> 

> [39] *** 7.3: Using anyURIs to identify collations is the right thing
to

> do.

>       but there are several problems in how this is done in detail.

[AM] OK.

> 

> [40] *** 7.3: Having an anyURI to identify a collation by codepoint
order

>      (http://www.w3.org/2003/05/xpath-functions/collation/codepoint)
is

>      good. But there should at the minimum also be a predefined anyURI

>      for identifying the Unicode collation algorithm (without any
special

>      tailoring). Please note that this not necessarily means that

> implementations

>      have to support this algorithm (which we definitely would not
object

> to),

>      but it at least means that different implementations that all

> implement

>      that algorithm can interoperate.

[AM] Added issue.

> 

> [41] 7.3: "The XQuery/XPath static context includes A provision"

[AM] Fixed

> 

> [42] *** 7.3: Rules for what collation to use: The current rules,
which

>      allow only one collation to be specified, raise an error if the

> collation

>      is not supported, and use anyURIs to identify collations without
any

>      mechanism for giving anyURIs to well-known collations, are bound
to

>      lead to interoperability problems. Collations should not be the
major

>      source of interoperability problems. With the current design,
even

>      vendors who want to be interoperable have no chance of doing so.

>      It will often be the case that e.g. a user wants just 'a French

>      collation'. How can this be indicated.

>      [we will send a separate mail about this issue to several lists

>       because we think that there is a potential for useful
coordination]

[AM] Created issue.

> 

> [43] *** 7.3: Context defines a single collation. But it may often be

>      desirable to have two 'default collations', in two different
senses:

>      a) different collations for matching (less precision) and sorting

>         (best precision possible)

>      b) different collations for internal operations and user-oriented

> operations

[AM] Created issue.

> 

> [44] *** 7.3: "There might be several ways in which a collation might
be

> specified in the XQuery/XPath static context. For example, XQuery
might

> provide syntax that specifies a default collation as part of the query

> prolog."

>      good ideas!

> 

> *** 7.3: The current proposal groups three different things into a

>      new concept called 'collation', which is different from what is

> usually

>      thought of as a collation. These are:

>      1) A collation in the traditional sense: every binary difference

> leads

>        to a non-equal match.

>      2) The use of different 'collations' to identify different
strengths

>        of the same collation (i.e. case-insensitive, accent-

> insensitive,...)

>      3) The use of 'collations' to identify character combinations
that

> serve

>        as single units in some respect in some languages (e.g. 'll'
and

> 'ch'

>        in traditional Spanish)

>      This has several problems:

>      [45] - The difference between 1) and 2), and the general use of

> highest

>             collation strength for sorting (to have deterministic
sorting)

>             and potentially lower for matching should be pointed out

> carefully.

>      [46] - Including 3) may make coordination with other efforts
somewhat

> difficult

>      [47] - Including 3) may make implementations somewhat difficult.

>             Although a given system (e.g. OS) may provide a range of

>             collations (in the 1) or 2) sense), the functionality in
3)

>             may not be available (e.g. via an API)

>      [48] - Including 3) may make specifications somewhat difficult.

>             There may be cases where it is unclear what the clusters

> should be.

>             For example, for French sorting with accents in the
reverse,

> are

>             the clusters the base letters followed by the accents? Or
is

> the

>             reverse consideration of accents ignored for this purpose?

>      [49] - This is way too important to be relegated to a note.

[AM] Created issue for items 45 to 49.

> 

> [50] *** 7.3: Using multiple-letter units for functions such as
'starts-

> with'

>      (independent of its definition via a collation), while
appropriate in

>      some cases, may not be that well established and tested. It
should be

>      carefully considered whether this functionality may not be too

> 'bleeding-edge',

>      and may not confuse users more than help, because there are too
many

>      cases where it is wrong, i.e. where character combinations are
taken

>      as single letters even if they are not, e.g. in foreign
loanwords,...

>      For example, a possible solution may be to use codepoint matching

> when

>      no collation is specified in the function rather than using the

> default

>      collation.

[AM] Created issue.

> 

> 

> [51] *** 7.3: "Some data management environments allow collations to
be

> associated with the definition of string items (that is, with the
metadata

> that describes items whose type is string). While such association may
be

> appropriate for use in environments in which data is held in a
repository

> tightly bound to its descriptive metadata, it is not appropriate in
the

> XML

> environment in which different documents being processed by a single
query

> may be described by differing schemas."

>      This is a very good point, but should also mention the fact that
a

> data-based

>      collation is not adequate for user needs.

> 

> 

> [52] *** 7.3: "Some data management environments allow collations to
be

> associated with the definition of string items (that is, with the
metadata

> that describes items whose type is string). While such association may
be

> appropriate for use in environments in which data is held in a
repository

> tightly bound to its descriptive metadata, it is not appropriate in
the

> XML

> environment in which different documents being processed by a single
query

> may be described by differing schemas."

>      This very much applies to sorting, but it looks somewhat out of
place

>      for 'character clusters' such as the traditional Spanish 'll' and

> 'ch',

>      because that seems to imply a linguistic unit, which means that
it

>      would be wrong to claim that 'el' is not an initial substring of

> 'elle'

>      if the later is not indeed Spanish.

> 

> [53] *** 7.3: Is there any case where e.g. element or attribute names
can

>      be treated as strings? How would they be sorted?

[MHK] No differently from any other strings. The way we choose a
collation is not sensitive to the origin of the strings.

> 

> [54] *** 7.3: "It is possible to define collations that do not have
this

> property, for example a collation that attempts to sort "ISO 8859"
before

> "ISO 10646", or "January" before "February". Such collations may fail,
or

> give unexpected results, when used with functions such as
fn:contains()."

>      "this property": What property? Why 'fail'? That 'January' does
not

> contain

>      'ary' would be a consequence of the definition, not a failure.
Maybe

> it

>      would be better to say that for fn:contains and friends, using
the

>      codepoint collation in most cases produces more predictable
results

> than

>      using a specific collation.

[AM] Issue.  Should system reject such a collation?

> 

> [55] *** 7.3.1.1, last example: While in the preceding example,
'equates'

>      is the right term, this is not necessary here. The only thing
that is

>      necessary is that the collation treats differences between 'ss'
and

>      sharp-s with less strength than differences in base letters (such
as

>      the final n). (there is also the case where the 'ss' <-> sharp-s

>      difference is at least as strong as the final 'n', and sharp-s <

> 'ss').

[AM] Done!

> 

> [56] *** 7.4 "The following functions are defined on these string
types."

>      which string types? As many as possible (up to anySimple and text

> node,

>      hopefully).

[AM] Under discussion.

> 

> [57] *** 7.4.1: Is there a concat operation that includes
normalization

>      splicing at the contact point? This would be very helpful, and

> ideally

>      should be the default, because this may be the most efficient way
to

>      maintain a certain normalization. The same applies to string-join

>      and potentially other operations.

[AM]  We decided not to this bit, instead to provide a function to
normalize the strings after the function had been applied.  

> 

> [57] *** 7.4.12/13: What about other transforms, such as

> katakana<->hiragana,...

[AM] There a large number of language-specific functions that we could
provide.  It's best to leave these to third party libraries.

> 

> [58] *** 7.4.6: "The first character of a string is located

>      at position 1, not position 0."

>      It is a pity that XQuery was not able to fix this problem. It
would

>      be good if there were a special section at the start of 7.4, or

>      somewhere else in an adequate place, that clearly explained the

>      issue of 1 origin for strings. Also, rather than writing things

>      such "beginning at position", it should always be "beginning at

>      character position", to make it easier for users to understand

>      that e.g. substrings are not identified by indicating boundary

>      positions between characters.

[AM] Inherited from XPath.

> 

> [59] 7.4.8/9: For continuity, fn:substring-before($string, "") should

>      return the empty string, and fn:substring-after($string, "")

>      should return $string. Rationale: The shorter a string is, the

>      earlier it generally matches. Thus, the empty string matches at

>      the start of a string.

[AM] Added issue.

> 

> [60] *** 7.4.11: There should be a reference to Unicode Standard Annex
#15

>      for the various normalization forms.

[AM] Added issue on whether to reference Unicode 4.0.

> 

> [61] *** 7.4.11: In the current Character Model, W3C normalization can
be

>      tested, but is not defined as a function. This probably can be
fixed

>      by specifying that the W3C normalization function first uses NFC,

>      and then prepends a space if the result is not yet in W3C

> normalization form.

> 

> [62] *** 7.4.11: In the light of the above, W3C normalization should
also

>      be made required to support.

[AM] Added issue.

> 

> [63] *** 7.4.11: What other normalization forms might be supported?

>      How would they be identified? How will the space of identifiers

>      be managed, to avoid conflicts?

[AM] We are not planning on supporting any other normalization forms.
If you feels strongly please make suggestions.

> 

> [64] *** 7.4.12/13: The Unicode case mappings are now superseded by

>      Unicode 4, see http://www.unicode.org/unicode/reports/tr21/.

[AM] Added issue on whether to reference Unicode 4.0.

> 

> [65] *** 7.4.12/13: It should be pointed out that mappings can change

>      the length of a string, and that lower(upper(s)) == s is not

>      guaranteed, nor is upper(lower(s)) == s, and that these functions

>      may not always be linguistically appropriate (e.g. Turkish i

>      without dot) or appropriate for the application (e.g. titlecase).

>      It should be said that in cases such as Turkish, a simple
translation

>      should be used first. It would be much better to have an uniform

>      approach for all languages, rather than having to special-case a
few

>      languages.

[AM] Added note.

> 

> [66] 7.4.15: string-pad seems to be the wrong name, because it
suggests

>      padding to a certain length, e.g. string-pad("XMLQuery", 15)
would

>      return "XMLQueryXMLQuer", i.e. 15 characters. Something like

>      'multiply' seems more adequate.

[AM] We recommend that this function be removed and moved to Appendix D.

> 

> [67] *** 7.4.16: It is unclear what these functions are supposed to do

>      with anyURIs/IRIs.

> 

> [68] *** 7.4.16: It is unclear what the case of $escape-reserved ==
false

>       is supposed to do. The example shows this as the identity
function,

>       which is probably not the point. Please give us a better
example,

>       so that we can understand what this is used for. It may be
better

>       ultimately to separate this into two functions, and remove the

>       $escape-reserved argument.

> 

> [69] 7.4.17: "The % character itself is escaped only if it is not
followed

>      by two hexadecimal digits": This seems dangerous. If "%20" is a
plain

>      payload, then the '%' has to be escaped, like everything else.

[AM] Added [67] - [69] to issue created by comment from Schema WG re.
escaping algorithm.

> 

> [70] 7.4.16: Gopher examples are very outdated. We suggest replacement

>      with something more people are familiar with.

[AM] Changed examples.

> 

> [71] 7.5.1: 'reluctant' quantifiers: It may be better to use the Perl

>      term 'minimal'.

[AM] Created issue.

> 

> [72] 7.5.1: "In the absence of these quantifiers, the regular
expression

> matches the longest possible substring."

>      no, not in the absence of these quantifiers, but if the other

> (maximal)

>      set of quantifiers is used ('*?' is a quantifier, but in '*?',
'?' is

>      not a quantifier)

[AM]  This has caused some confusion.  Wording changed.

> 

> [73] *** 7.5.2: "Regular expression matching is defined on the basis
of

> Unicode code-points; it takes no account of collations."

>      It seems somewhat inadequate that fn:contains and friends use

> collations,

>      but regular expressions don't. But we are not sure which way the

> right

>      solutions is.

[AM] We followed the XML Schema regex design which is collation
independent.  Mike Kay says "We are not aware of any implementation of
regular expressions that is collation-sensitive, and we believe that the
two ideas are probably irreconcilable since they treat strings at
different levels of abstraction."

> 

> [74] 7.5.3: "An error is raised ("Invalid replacement string") if the

> value

> of $replacement contains a "$" character that is not immediately
followed

> by a digit 1-9 and not immediately preceded by a "/"."

>       "/" -> "\"

[AM] Done!

> 

> [75] *** 7.5.3: We have been told that there is a provision to replace

>      character strings with markup, but this is not discussed here.

>      We want to make sure this is available, because it is relevant

>      to get people away from using the PUA.

[MHK] There are mechanisms in XSLT 2.0 (xsl:analyze-string) to achieve
this, but 

we decided that it requires mechanisms beyond what can be achieved in a
single function call.

> 

> [76] 7.5.3: Why does fn:replace not provide for single replacements?

[MHK] You mean, just replace the first occurrence? I think you can do
this by supplying a pattern that matches the whole string, and then
using variables:

replace($s, "^(.*?)A(.*)$", "$1B$2") 

replaces the first A with a B. Perhaps we should add this example. 

[AM] Added example. 

 

> 

> [77] 8.2.2/3: Boolean less-than and greater-than seem strange. On the

> other

>      hand, 'or' and 'and' seem to be badly missing.

[AM] Created issue.

> 

> 

> 

> This concludes the first part of these comments. We will send the

> rest of our comments tomorrow (planned around 3pm EDT July 8th).

> 

> 

> Regards,    Martin.

>
Received on Tuesday, 2 September 2003 07:56:14 UTC