- From: Ashok Malhotra <ashokma@microsoft.com>
- Date: Mon, 6 Oct 2003 15:57:21 -0700
- To: "Martin Duerst" <duerst@w3.org>, <public-qt-comments@w3.org>
- Cc: <w3c-i18n-ig@w3.org>
Thank you for your comment [54] on Characters and Collation Units.
We have clarified the wording and moved all the functions that use this
special feature of collations into a special section. We had added an
optional error message that a system can invoke if it finds that a
collation is unsuitable for substring matching.
All the best, Ashok
> -----Original Message-----
> From: public-qt-comments-request@w3.org [mailto:public-qt-comments-
> request@w3.org] On Behalf Of Martin Duerst
> Sent: Monday, July 07, 2003 2:06 PM
> To: public-qt-comments@w3.org
> Cc: w3c-i18n-ig@w3.org
> Subject: I18N last call comments on XQuery/XPath Fun/Op (first part)
>
>
> Dear XML Query WG and XSL WG,
>
> Below please find the I18N WGs comments on your last call document
> "XQuery 1.0 and XPath 2.0 Functions and Operators"
> (http://www.w3.org/TR/2003/WD-xpath-functions-20030502/).
>
> Please note the following:
> - Please address all replies to these comments to the I18N IG mailing
> list (w3c-i18n-ig@w3.org), not just to me.
> - All i18n-relevant comments are marked with ***. There are also
general
> comments on the spec which we hope you will find useful.
> - We have not yet reviewed the other documents, such as XQuery 1.0
> or XSLT 2.0, and so we might be unaware of i18n issues that appear
> in these specs but may have to be traced back to functions and
> operators.
> There are also cases where we have identified an i18n issue here,
> but we are not sure exactly what the best solution will be, and
which
> document it will have to be addressed in. Also, there are issues
that
> have been raised in comments to you about a different document but
> that apply to this document, too. Sometimes, this is mentioned
below,
> but not always.
> - Our comments are numbered in square brackets [nn].
>
> We look forward to further discussion with you to find the best
> solution on these issues.
>
>
> [1] Status of this document: The 'last call' information should be
> close to the start, not the end, of this section.
>
> [2] 1.1 implementation defined:
> "Possibly differing between implementations, but specified by the
> implementor for each particular implementation."
> better: "Possibly differing between implementations, but
specified
> and
> documented by the implementor for each particular implementation."
>
> [3] 1.2: *** re. anyAtomicType/untypedAtomic, see data model comments
>
> [4] 1.2: ur types -> urtypes
>
> [5] 1.2: "Diagram courtesy Asir Vedamuthu, webMethods and Jim Melton,
> Oracle"
> Without any disrespect to these two hard-working gentlemen, this
is
> the
> first time we have seen such an in-place acknowledgement in a W3C
> spec,
> it does not seem appropriate, in particular because the image is
> mostly
> a copy from XML Schema. If necessary, an ack in the ack section
> should do.
>
> [6] 1.2: ***yearMonthDuration and dayTimeDuration:
> glad to see that you are fixing
> a well-known XML Schema problem.
>
> [7] 1.3.2: ***the motivation for untypedAtomic seems unclear. There
may be
> many
> other cases where one wants to indicate that subtypes/derived
types
> are not acceptable. If this is considered important, then there
> should
> be a general solution. See also our other comments on this type.
>
> [8] 1.4: ***Regarding associating Z with dates/times without timezone,
see
> our comments on data model.
>
> [9] 1.4: Please provide a pointer to the normative spec for the
mappings
> (that there are not enough pointers, and it's often not clear what
> is the normative specification of something, is a general issue
> with the current set of specs)
>
> [10] 1.6: This speaks about constructors. The issues list for the data
> model
> says constructors are gone. This is confusing.
>
> [11] 1.6: "For most functions there is a paragraph describing what the
> function does followed by semantic rules. These rules are meant to be
> followed in the order that they appear in this document."
> This seems to imply that some details in an earlier (in the spec)
> function definition can influence some details in a later
> (in the spec) definition. Please clarify.
>
> [12] 1.6: Qname is defined in XML namespaces, not XML 1.0.
>
> [13] 1.6, end: It would be good to add a small explanation about
multiple
> parameters and sequences as parameters. (and that this is or is
> not working the same way as in Perl)
>
> [14] 1.7: "Are in the XML Schema namespace": Please always give the
> namespace URI in such cases.
>
> [15] 1.7: "datatypes described in this document": Are there any? Where
are
> the descriptions?
>
> [16] 2.3: What is the difference between fn:string() and fn:string(as
> item?)?
> If there is no item, then that's the same as the first variant,
or
> not?
>
> [17] 2.3: *** here and elsewhere: anyURI is not exactly an URI!
>
> [18] 2.4: Why does this function work on sequences, but others don't?
> This is a general issue.
>
> [19] 2.5: What about the case that the base URI is found in the node's
> grandparent, or even higher up?
>
> [20] 2.5: What is 'static context'?
>
> [21] 3. ***Error function: How can error messages be localized?
>
> [22] 4. .implementation defined. : These dots look strange.
>
> [23] 4. Ordering implementation defined: This should be that the
ordering
> is
> according to execution order, which may be implementation defined.
> Anything less diminishes the value of the error mechanism
> considerably.
>
> [24] 5.1: xs:TYP: shouldn't this be xs:TYPE?
>
> [25] 5.1: ***The return type of anyURI as QName seems strange. 17.14
does
> not
> seem to explain why this special constructor appears in 5.1.
>
> [26] 5.2: Rules are defined in the same way: What rules exactly?
>
> [27] 6.2: Subtype substitution and type promotion should be described
> normatively in this spec.
>
> [28] 6.2.6: (a/b)*b+(a mod b) -> (a idiv b) * b + (a mod b)
>
> [29] 6.2.7: Why do we need unary plus? If there is a need for a 'noop'
> function, it may be better to have a general one.
>
> [30] 6.4: ***Rounding functions: We are not sure that floor, ceiling,
> round,
> and 'round-half-to-even' cover the rounding conventions around
> the world in a balanced way. We need to look into this in more
> detail. (for some examples, please see
> http://www.xencraft.com/resources/multi-currency.html#rounding).
> Also, what is 'round-half-to-even' used for?
>
> [31] 7.1: ***String types: It is very important to integrate
> anyType/anySimpleType
> (and if kept, untypedAtomic) and text nodes here.
>
>
> [32] *** 7.1: "except for the range #xD800 to #xDFFF inclusive, which
is
> the range reserved for surrogate pairs"
> "surrogate pairs" -> "surrogates"
>
> [33] *** 7.1: What is the difference between 'code point' and
'character'?
> There are a few code points that are not characters (e.g. #xFFFE
and
> #xFFFF), but that can hardly be the point of having two
definitions.
>
> [34] *** 7.2.1 and 7.2.2: Please make sure that there are tests for
> these functions that include surrogate pairs/codepoints above
#xFFFF.
>
> [35] 7.3: It would be good to have a dedicated subsection 7.3 about
> collations,..., and then section 7.4 with the actual functions.
> (or any other structure that gives collation it's own subsection)
>
> [36] *** 7.3: "the comparisons are inherently performed according to
some
> collation (even if that collation is defined entirely on code point
values
> or on the binary representations of the characters of the string)"
> are "code point values" and "binary representation" alternatives
or
> are they supposed to be the same? (the later would not be true in
> the case of UTF-16, which should be pointed out)
>
> [37] *** 7.3: "Strings can be compared character-by-character or in a
> logical manner, as defined by the collation."
> This seems to imply that 'character-by-character' is not logical.
> better change 'logical' to 'linguistically adequate' or so.
>
> [38] *** 7.3 "For alignment with the [Character Model for the World
Wide
> Web 1.0], applications may choose collations that treat unnormalized
> strings as though they were normalized (that is, that implicitly
normalize
> the strings)."
> This is somewhat misleading, in that early uniform normalization
> should avoid having to compare strings that differ only in
> normalization.
> This should be reworded carefully. Please also point out that
using
> highest collation strength does not imply string normalization.
>
> [39] *** 7.3: Using anyURIs to identify collations is the right thing
to
> do.
> but there are several problems in how this is done in detail.
>
> [40] *** 7.3: Having an anyURI to identify a collation by codepoint
order
> (http://www.w3.org/2003/05/xpath-functions/collation/codepoint)
is
> good. But there should at the minimum also be a predefined anyURI
> for identifying the Unicode collation algorithm (without any
special
> tailoring). Please note that this not necessarily means that
> implementations
> have to support this algorithm (which we definitely would not
object
> to),
> but it at least means that different implementations that all
> implement
> that algorithm can interoperate.
>
> [41] 7.3: "The XQuery/XPath static context includes A provision"
>
> [42] *** 7.3: Rules for what collation to use: The current rules,
which
> allow only one collation to be specified, raise an error if the
> collation
> is not supported, and use anyURIs to identify collations without
any
> mechanism for giving anyURIs to well-known collations, are bound
to
> lead to interoperability problems. Collations should not be the
major
> source of interoperability problems. With the current design,
even
> vendors who want to be interoperable have no chance of doing so.
> It will often be the case that e.g. a user wants just 'a French
> collation'. How can this be indicated.
> [we will send a separate mail about this issue to several lists
> because we think that there is a potential for useful
coordination]
>
> [43] *** 7.3: Context defines a single collation. But it may often be
> desirable to have two 'default collations', in two different
senses:
> a) different collations for matching (less precision) and sorting
> (best precision possible)
> b) different collations for internal operations and user-oriented
> operations
>
> [44] *** 7.3: "There might be several ways in which a collation might
be
> specified in the XQuery/XPath static context. For example, XQuery
might
> provide syntax that specifies a default collation as part of the query
> prolog."
> good ideas!
>
> *** 7.3: The current proposal groups three different things into a
> new concept called 'collation', which is different from what is
> usually
> thought of as a collation. These are:
> 1) A collation in the traditional sense: every binary difference
> leads
> to a non-equal match.
> 2) The use of different 'collations' to identify different
strengths
> of the same collation (i.e. case-insensitive, accent-
> insensitive,...)
> 3) The use of 'collations' to identify character combinations
that
> serve
> as single units in some respect in some languages (e.g. 'll'
and
> 'ch'
> in traditional Spanish)
> This has several problems:
> [45] - The difference between 1) and 2), and the general use of
> highest
> collation strength for sorting (to have deterministic
sorting)
> and potentially lower for matching should be pointed out
> carefully.
> [46] - Including 3) may make coordination with other efforts
somewhat
> difficult
> [47] - Including 3) may make implementations somewhat difficult.
> Although a given system (e.g. OS) may provide a range of
> collations (in the 1) or 2) sense), the functionality in
3)
> may not be available (e.g. via an API)
> [48] - Including 3) may make specifications somewhat difficult.
> There may be cases where it is unclear what the clusters
> should be.
> For example, for French sorting with accents in the
reverse,
> are
> the clusters the base letters followed by the accents? Or
is
> the
> reverse consideration of accents ignored for this purpose?
> [49] - This is way too important to be relegated to a note.
>
> [50] *** 7.3: Using multiple-letter units for functions such as
'starts-
> with'
> (independent of its definition via a collation), while
appropriate in
> some cases, may not be that well established and tested. It
should be
> carefully considered whether this functionality may not be too
> 'bleeding-edge',
> and may not confuse users more than help, because there are too
many
> cases where it is wrong, i.e. where character combinations are
taken
> as single letters even if they are not, e.g. in foreign
loanwords,...
> For example, a possible solution may be to use codepoint matching
> when
> no collation is specified in the function rather than using the
> default
> collation.
>
>
> [51] *** 7.3: "Some data management environments allow collations to
be
> associated with the definition of string items (that is, with the
metadata
> that describes items whose type is string). While such association may
be
> appropriate for use in environments in which data is held in a
repository
> tightly bound to its descriptive metadata, it is not appropriate in
the
> XML
> environment in which different documents being processed by a single
query
> may be described by differing schemas."
> This is a very good point, but should also mention the fact that
a
> data-based
> collation is not adequate for user needs.
>
>
> [52] *** 7.3: "Some data management environments allow collations to
be
> associated with the definition of string items (that is, with the
metadata
> that describes items whose type is string). While such association may
be
> appropriate for use in environments in which data is held in a
repository
> tightly bound to its descriptive metadata, it is not appropriate in
the
> XML
> environment in which different documents being processed by a single
query
> may be described by differing schemas."
> This very much applies to sorting, but it looks somewhat out of
place
> for 'character clusters' such as the traditional Spanish 'll' and
> 'ch',
> because that seems to imply a linguistic unit, which means that
it
> would be wrong to claim that 'el' is not an initial substring of
> 'elle'
> if the later is not indeed Spanish.
>
> [53] *** 7.3: Is there any case where e.g. element or attribute names
can
> be treated as strings? How would they be sorted?
>
> [54] *** 7.3: "It is possible to define collations that do not have
this
> property, for example a collation that attempts to sort "ISO 8859"
before
> "ISO 10646", or "January" before "February". Such collations may fail,
or
> give unexpected results, when used with functions such as
fn:contains()."
> "this property": What property? Why 'fail'? That 'January' does
not
> contain
> 'ary' would be a consequence of the definition, not a failure.
Maybe
> it
> would be better to say that for fn:contains and friends, using
the
> codepoint collation in most cases produces more predictible
results
> than
> using a specific collation.
>
> [55] *** 7.3.1.1, last example: While in the preceeding example,
'equates'
> is the right term, this is not necessary here. The only thing
that is
> necessary is that the collation treats differences between 'ss'
and
> sharp-s with less strength than differences in base letters (such
as
> the final n). (there is also the case where the 'ss' <-> sharp-s
> difference is at least as strong as the final 'n', and sharp-s <
> 'ss').
>
> [56] *** 7.4 "The following functions are defined on these string
types."
> which string types? As many as possible (up to anySimple and text
> node,
> hopefully).
>
> [57] *** 7.4.1: Is there a concat operation that includes
normalization
> splicing at the contact point? This would be very helpful, and
> ideally
> should be the default, because this may be the most efficient way
to
> maintain a certain normalization. The same applies to string-join
> and potentially other operations.
>
> [57] *** 7.4.12/13: What about other transforms, such as
> katakana<->hiragana,...
>
> [58] *** 7.4.6: "The first character of a string is located
> at position 1, not position 0."
> It is a pity that XQuery was not able to fix this problem. It
would
> be good if there were a special section at the start of 7.4, or
> somewhere else in an adequate place, that clearly explained the
> issue of 1 origin for strings. Also, rather than writing things
> such "beginning at position", it should always be "beginning at
> character position", to make it easier for users to understand
> that e.g. substrings are not identified by indicating boundary
> positions between characters.
>
> [59] 7.4.8/9: For continuity, fn:substring-before($string, "") should
> return the empty string, and fn:substring-after($string, "")
> should return $string. Rationale: The shorter a string is, the
> earlier it generally matches. Thus, the empty string matches at
> the start of a string.
>
> [60] *** 7.4.11: There should be a reference to Unicode Standard Annex
#15
> for the various normalization forms.
>
> [61] *** 7.4.11: In the current Character Model, W3C normalization can
be
> tested, but is not defined as a function. This probably can be
fixed
> by specifying that the W3C normalization function first uses NFC,
> and then prepends a space if the result is not yet in W3C
> normalization form.
>
> [62] *** 7.4.11: In the light of the above, W3C normalization should
also
> be made required to support.
>
> [63] *** 7.4.11: What other normalization forms might be supported?
> How would they be identified? How will the space of identifiers
> be managed, to avoid conflicts?
>
> [64] *** 7.4.12/13: The Unicode case mappings are now superseeded by
> Unicode 4, see http://www.unicode.org/unicode/reports/tr21/.
>
> [65] *** 7.4.12/13: It should be pointed out that mappings can change
> the length of a string, and that lower(upper(s)) == s is not
> guaranteed, nor is upper(lower(s)) == s, and that these functions
> may not always be linguistically appropriate (e.g. Turkish i
> without dot) or appropriate for the application (e.g. titlecase).
> It should be said that in cases such as Turkish, a simple
translation
> should be used first. It would be much better to have an uniform
> approach for all languages, rather than having to special-case a
few
> languages.
>
> [66] 7.4.15: string-pad seems to be the wrong name, because it
suggests
> padding to a certain length, e.g. string-pad("XMLQuery", 15)
would
> return "XMLQueryXMLQuer", i.e. 15 characters. Something like
> 'multiply' seems more adequate.
>
> [67] *** 7.4.16: It is unclear what these functions are supposed to do
> with anyURIs/IRIs.
>
> [68] *** 7.4.16: It is unclear what the case of $escape-reserved ==
false
> is supposed to do. The example shows this as the identity
function,
> which is probably not the point. Please give us a better
example,
> so that we can understand what this is used for. It may be
better
> ultimately to separate this into two functions, and remove the
> $escape-reserved argument.
>
> [69] 7.4.17: "The % character itself is escaped only if it is not
followed
> by two hexadecimal digits": This seems dangerous. If "%20" is a
plain
> payload, then the '%' has to be escaped, like everything else.
>
> [70] 7.4.16: Gopher examples are very outdated. We suggest replacement
> with something more people are familiar with.
>
> [71] 7.5.1: 'reluctant' quantifiers: It may be better to use the Perl
> term 'minimal'.
>
> [72] 7.5.1: "In the absence of these quantifiers, the regular
expression
> matches the longest possible substring."
> no, not in the absence of these quantifiers, but if the other
> (maximal)
> set of quantifiers is used ('*?' is a quantifier, but in '*?',
'?' is
> not a quantifer)
>
> [73] *** 7.5.2: "Regular expression matching is defined on the basis
of
> Unicode code-points; it takes no account of collations."
> It seems somewhat inadequate that fn:contains and friends use
> collations,
> but regular expressions don't. But we are not sure which way the
> right
> solutions is.
>
> [74] 7.5.3: "An error is raised ("Invalid replacement string") if the
> value
> of $replacement contains a "$" character that is not immediately
followed
> by a digit 1-9 and not immediately preceded by a "/"."
> "/" -> "\"
>
> [75] *** 7.5.3: We have been told that there is a provision to replace
> character strings with markup, but this is not discussed here.
> We want to make sure this is available, because it is relevant
> to get people away from using the PUA.
>
> [76] 7.5.3: Why does fn:replace not provide for single replacements?
>
> [77] 8.2.2/3: Boolean less-than and greater-than seem strange. On the
> other
> hand, 'or' and 'and' seem to be badly missing.
>
>
>
> This concludes the first part of these comments. We will send the
> rest of our comments tomorrow (planned around 3pm EDT July 8th).
>
>
> Regards, Martin.
>
Received on Monday, 6 October 2003 18:57:56 UTC