- From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
- Date: Fri, 01 Aug 2003 20:54:59 -0600
- To: public-qt-comments@w3.org
- Cc: W3C XML Schema IG <w3c-xml-schema-ig@w3.org>
Dear colleagues: The XML Schema Working Group congratulates the XML Query and XSL Working Groups on their progress, and in particular on the Last Call draft of "XQuery 1.0 and XPath 2.0 Functions and Operators". We have not been able to review the last call draft in as much detail as we would have liked, but for what they are worth our comments are at http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html (an ASCII version is reproduced below for the convenience of those with access to their email but not to the Web). We apologize for the tardy arrival of these notes. -C. M. Sperberg-McQueen, for the W3C XML Schema WG ................................................................ [1]W3C [2]Architecture Domain [3]XML | [4]XML Schema | [5]Member Events | [6]Member-Confidential! W3C XML Schema WG Notes on XQuery 1.0 and XPath 2.0 Functions and Operators 1 August 2003 _________________________________________________________________ * 1. [7]Schema-related issues + 1.1. [8]Alignment of date/time values + 1.2. [9]The type anyAtomicType + 1.3. [10]The type untypedAtomic + 1.4. [11]Alignment on strings and URIs + 1.5. [12]Whitespace handling and lexical forms + 1.6. [13]Negative zero + 1.7. [14]Totally ordered Booleans * 2. [15]Other technical issues + 2.1. [16]The fn:base-uri property + 2.2. [17]Alignment of references + 2.3. [18]Characters and collation units + 2.4. [19]Surrogate pairs and Unicode scalar values + 2.5. [20]Definition of whitespace + 2.6. [21]Required normalization functionality + 2.7. [22]Case folding + 2.8. [23]Escaping URIs + 2.9. [24]The binary types + 2.10. [25]Minor items o 2.10.1. [26]User control of collations o 2.10.2. [27]Section 7.3.1.1 Examples * 3. [28]Editorial notes _________________________________________________________________ This document contains comments on the Last Call draft of 2 May 2003 of XQuery 1.0 and XPath 2.0 Functions and Operators transmitted to the XML Query and XSL Working Groups on behalf of the XML Schema Working Group. These draft comments have not been reviewed by the XML Schema Working Group and do not necessarily command consensus within the group; because we will not meet again until 28 August, the Working Group directed at its meeting today that these notes should be transmitted to the XML Query and XSL Working Groups without awaiting review. In addition to the comments below, please note that several of the [29]general comments sent on 14 July relate to the functions and operators and data model specifications. Some of those comments sent earlier overlap with some comments below. 1. Schema-related issues The comments in this section relate to the use of XML Schema in the F/O specification and thus to the particular area of responsibility borne by the XML Schema WG. 1.1. Alignment of date/time values The provision for preserving timezone information in the values of xs:dateTime, xs:date, and xs:time continues to concern us. We believe that a discrepancy of this kind between F/O and XML Schema will hurt users and impede uptake of both specifications. We believe F/O and XML Schema need to align on this, either by F/O changing to the XML Schema value space, or by changing the value space as part of XML Schema 1.1, or by some other mutually agreed upon solution. 1.2. The type anyAtomicType We reiterate our concern over the introduction of anyAtomicType into the type hierarchy. We believe that a discrepancy of this kind between F/O and XML Schema will hurt users and impede uptake of both specifications. We believe F/O and XML Schema need to align on this, either by F/O aligning with XML Schema 1.0 or by XML Schema 1.1 aligning with F/O. 1.3. The type untypedAtomic We reiterate our concern over the introduction of untypedAtomic into the type hierarchy. As with the other discrepancies, we believe alignment of the QT specs and XML Schema is critically important. Section 1.3.2 says xdt:untypedAtomic is used wherever the PSVI has xs:anySimpleType; please note that in the PSVI, this will be the case * when the element or attribute in question was declared as having type anySimpleType * when the attribute in question had no declaration and the schema processor assumed the simple urtype for it in the course of lax validation or error recovery Note that elements will not be assigned the anySimpleType as their type property in the course of lax validation or error recovery; they will have xs:anyType instead. Your use of xdt:untypedAtomic for xs:anySimpleType but not for elements which (a) lack child elements and (b) are assigned to xs:anyType may lead to results which puzzle some of your users; we believe you may wish to consider changing your mapping rules to assign xsd:untypedAtomic to such elements. 1.4. Alignment on strings and URIs The table at the beginning of section 2, Accessors, shows functions which are intended (judging by their names) to return URIs and which return values of type xs:string instead of xs:anyURI. Similarly, various functions which accept URIs as arguments are given signatures using xs:string as the type, which in turn necessitates ad hoc rules of the form "If $collationLiteral is not in the lexical space of xs:anyURI, an error is raised". As you know from our inquiry to you in mid-July, it has been suggested that in XML Schema 1.1 the xs:anyURI type be made a restriction of xs:string. But for now, there appears to be a discrepancy between the use of strings to represent URIs here and the provision of a distinct (and, for typing purposes, disjoint) type in XML Schema 1.0. We need to align on this. 1.5. Whitespace handling and lexical forms In section 5.1, paragraph 4 reads in part: "If the argument to a constructor function is a string literal, the literal must be a valid lexical form for its type ... Whitespace normalization is applied before validation ..." In all the cases which immediately come to mind, if the argument is a valid lexical form for a type, there is no need to perform any whitespace normalization on it. In XML Schema, it is the result of whitespace normalization, not the input to it, which must be a legal lexical form; we believe readers will be less confused if your usage of the terms and ours is consistent. A possible rewording: "If the argument to a constructor function is a string literal, then whitespace normalization is applied as indicated by the whitespace facet for the datatype. The whitespace-normalized string must be a valid lexical form for the type, as specified ..." 1.6. Negative zero In section 6, a note explains that the value space of xs:float and xs:double has been extended vis-à-vis that given by XML Schema, to include a negative zero. The note also explains that the negative zero will "never be obtained from the typed value of a node." We believe this discrepancy is untenable, and we are not clear why it has proven necessary to introduce it. As far as we can tell by examining the specification, the spec mentions different treatment for positive and negative zero only for the functions described in section 6.4 (fn:floor, fn:ceiling, fn:round, and fn:round-half-to-even): in the description of each of these functions it is noted that if a zero is given to the function as an argument, the sign of the zero returned as the value of the function is the same as the sign of the zero passed in as an argument. (The discussion of fn:ceiling mentions other cases when negative zero is returned; the discussion of fn:floor passes over the analogous cases in silence.) Other mentions of the signed zeroes in this specification invariably specify either that something is true both for positive and for negative zero or else that a constructor may return either a positive or a negative zero. Could you explain the motive for introducing this discrepancy with the value space defined in XML Schema? Would it not suffice to observe that IEEE 754 has both positive and negative zeroes, which are treated as different machine representations of the same values in the xs:float and xs:double value spaces, and (optionally) that the prose occasionally mentions these distinct representations of zero in the interests of alignment with IEEE 754, even though formally they are the same value? Is it essential to introduce an incompatibility with XML Schema here instead of treating positive and negative zeroes as one value with two machine representations? 1.7. Totally ordered Booleans We do not believe that it makes sense to impose a user-visible ordering on the Boolean data type. Can you explain the rationale? This is a discrepancy between F/O and XML Schema which must, we believe, be aligned. 2. Other technical issues The comments in this section relate to technical issues other than the use of XML Schema in the F/O specification; the XML Schema WG claims no particular responsibility or expertise on these questions but raises them because they seem to need attention. 2.1. The fn:base-uri property In section 2.5, the first paragraph defines a base-uri property for all node types: "Document, element and processing-instruction nodes have a base-uri property.... The base-uri of all other node types is the empty sequence." The next paragraph begins by explaining what happens "If the accessor is called on a node that does not have a base-uri property ..." If all nodes have the property, how can such a node exist? 2.2. Alignment of references XML Schema and the Functions and Operators spec should refer to the same version of Unicode. At the moment, this appears not to be true. 2.3. Characters and collation units The discussion of collation units in the second note of section 7.3 says that collation decomposes a string "into a sequence of units, each unit consisting of one or more characters", and that various comparison operations are performed on these units. The functions fn:starts-with, fn:ends-with, fn:substring-before, and fn:substring-after are all mentioned as operating on such a segmented string. The list of functions at the beginning of section 7.4, however, describes them as operating on characters, not on the nameless collation units consisting of one or more characters each. This looks like a contradiction. We believe that the general level of confusion is best minimized, and the world becomes a better place, if in XML-related specifications the word character is used always and only for the units of the Universal Character Set defined by Unicode and by ISO 10646. The word should not be used (however great the temptation becomes at times) to denote the culturally specific units of writing systems (e.g. letters, symbols, signs, graphemes, or what have you). We suggest recasting the descriptions in 7.4 to describe the effect of the functions in terms of the collation units, rather than in terms of characters. In order to avoid repeating the phrase "the nameless units of one or more characters into which a collation segments a string for purposes of comparison", you may wish to define the term letter, grapheme, collation unit, or thingy with that meaning. 2.4. Surrogate pairs and Unicode scalar values Section 7.4.6 (like some others) has a note calling attention to the fact that some implementations will represent characters with code points higher than xFFFF by using surrogate pairs. You quite correctly avoid using the term code point for the things which make up the surrogate pair, since in section 7.1 you have defined code point as excluding surrogates. But the term 16-bit values is not defined, as far as we can tell. Also, in Unicode 2 and 3 there are (as far as we have been able to tell) no rules that forbid a double encoding of characters outside the Basic Multilingual Plane (i.e. first representing them within the BMP as surrogate pairs, and then encoding the sequence of BMP items in UTF-8). Even if it is discouraged (and it is indeed outlawed in Unicode 4.0), surrogate pairs might well show up not only in UTF-16 but also in UTF-8, where they will presumably be presented by Unicode-oblivious character libraries not as pairs of 16-bit values but as four-octet sequences whose intepretation in terms of Unicode scalar values requires slightly special rules. Note that the definition of code points given in section 7.1 agrees with the definition of Unicode scalar values in Unicode 4.0 in excluding the surrogate range, but not with Unicode 2.0 (the version cited in your normative references), or Unicode 3, which define a Unicode scalar value as "a number N from 0 to 10FFF[16]", without leaving any gap for the surrogates. 2.5. Definition of whitespace Section 7.4.10 defines the function fn:normalize-space as doing various things to whitespace, but it does not define the term whitespace. It should, since various definitions are possible. The Unicode character database, for example, lists the following Unicode characters as whitespace in the file PropList-3_1_0.txt: * 0009..000D ; White_space # Cc [5] <control>..<control> * 0020 ; White_space # Zs SPACE * 0085 ; White_space # Cc <control> * 00A0 ; White_space # Zs NO-BREAK SPACE * 1680 ; White_space # Zs OGHAM SPACE MARK * 2000..200A ; White_space # Zs [11] EN QUAD..HAIR SPACE * 2028 ; White_space # Zl LINE SEPARATOR * 2029 ; White_space # Zp PARAGRAPH SEPARATOR * 202F ; White_space # Zs NARROW NO-BREAK SPACE * 3000 ; White_space # Zs IDEOGRAPHIC SPACE The XML specification defines a smaller set of characters as whitespace, for purposes of whitespace normalization. So some definition is definitely needed. 2.6. Required normalization functionality Section 7.4.11 requires conforming implementations to support Unicode normalization form NFC. Why is normalization form W3C not also required? 2.7. Case folding Sections 7.4.12 and 7.4.13 define functions for case folding. Since case folding is not consistent across languages and locales, we have grave doubts about the wisdom of this inclusion, and some members of the WG would advise you to drop these functions, which are not and cannot be language- and culture-neutral. There is precedent: the decision to drop case-folding of names from the design of XML resulted from the realization that every case-folding algorithm available, including the use of the Unicode case mapping tables, has an inherent cultural bias. The inclusion of culturally and linguistically biased functions does not contribute to achieving the goal of universal accessibility for the Web. Some members of the XML Schema WG believe your spec should not go forward with these functions in it. If you retain these functions, you should at the very least warn users that * Results may violate user expectations (in Québec, for example, the standard uppercase equivalent of "é" is "É", while in metropolitan France it is more commonly "E"; only one of these is supported by the function as defined). * Many characters of class Ll lack uppercase equivalents in the Unicode case mapping tables (we stopped counting at 150 or so); many characters of class Lu lack lowercase equivalents. * The two functions are not inverses of each other, so that for a string S of upper-case characters, fn:upper-case(fn:lower-case(S)) is not guaranteed to return S, nor is fn:lower-case(fn:upper-case(S)) for a string S of lower-case characters. Latin small letter dotless i (as used in Turkish) is perhaps the most prominent lower-case letter which will not round-trip, as Latin capital letter i with dot above is the most prominent upper-case letter which will not round trip; there are others. You may also wish to make the case mapping depend on the default or a user-specified collation. 2.8. Escaping URIs The rules for escaping URIs should be aligned across all W3C specifications; otherwise, we will drive our users crazy. We think that means that you should reference and implement the algorithm specified in the XML Linking specification ([30]http://www.w3.org/TR/2001/REC-xlink-20010627/#link-locators) and referenced by XML Schema, or the algorithm given in the W3C Character Model specification (which was the same algorithm the last time we looked). In particular, some members of the XML Schema WG were surprised to see that your algorithm escapes the percent sign in some cases but not others; this does not seem to be a feature of the algorithm given by XML Linking and by the Character Model. That said, we believe that you do your readers a good service by listing explicitly the affected characters. By suggesting that you refer to the Linking/CharMod algorithm, we do not mean to suggest that you should make your spec less useful by omitting these lists. (Editorial note: it would perhaps be useful to some readers to have a brief discussion of why the advice given in the last paragraph should be followed; our readers did not understand the rationale for this advice.) 2.9. The binary types Section 12.1.1 says that op:hexBinary-equal returns true if its arguments "are of the same length and contain the same code-points"; similarly in 12.1.2 for op:base64Binary-equal. The term code-point was defined in section 7.1 as denoting integers between 0 and 1114111 (x10FFFF), with a gap in the range where Unicode surrogates occur. It seems to be used here to denote what other specifications refer to as octets (bit strings of length 8). Taking the term code point in the sense of `octet', the definition still does not match our intuitions of what an equality test on binary data must do: it is not enough that each argument contain the same octets; they must contain them in the same order. Suggested rewording: "are identical strings of octets". If you wish to avoid the word octet, "are identical bit strings" might do, although it omits the relatively important fact that the values in question must have 8×n bits for some integer n. 2.10. Minor items 2.10.1. User control of collations Section 7.3 says in part "This specification does not use xml:lang to identify the default collation, in part because collations should be determined by the user of the data, not (normally) the data itself, and because ..." The second reason given is sound. The first (collations should not normally be determined by the data) is often advanced as a principle, but does not seem to all members of the XML Schema WG to be universally true. We are thus grateful for the "(normally)" in the sentence. But in any case, the first reason given here leads to a non-sequitur: it would be a reason not to make xml:lang determine the collation sequence without possibility of user override. But it does not, even on its face, provide a reason not to use xml:lang to identify the default collation. We suggest dropping the first reason; the second suffices. 2.10.2. Section 7.3.1.1 Examples The fourth example in section 7.3.1.1 says that fn:compare('Strassen', 'Straße') "returns 1 if and only if the default collation includes provisions that equate `ss' and the (German) character `ß' (`sharp-s')." Unless we have misunderstood the definition of the function, the return value should also be 1 if the default collation sorts "ß" (sharp s) before "s". Deleting the phrase "and only if" would remove the error. 3. Editorial notes In the course of our work, some editorial points were noted; we list them here for the use of the editors. We do not particularly expect formal responses on these comments. 1. Definition of must. Section 1.1 defines must thus: Conforming documents and processors are required to behave as described; otherwise, they are non-conformant or in error. Is the "or" inclusive or exclusive, or is "in error" intended as a synonym or approximate synonym for "non-conformant"? Possible alternatives: "otherwise, they are non-conformant and in error", "otherwise, they are either non-conformant or else in error", "otherwise, they are non-conformant, i.e. in error". 2. Definition of stable. In section 1.1, the definition of stable says, inter alia:"Some other functions ... have an explicit dependency on the dynamic context". Unless this means that they accept an argument representing the dynamic context, it seems at first glance as if explicit is here used with the meaning `implicit'. Perhaps what is intended is that the documentation will explicitly mention this dependency. Perhaps the best thing to do would be just to drop the explicit; if you really wish to stress the promise of documentation, perhaps read "Some other functions ...have a depencency on the dynamic context ... These functions are said to be contextual. [INS: Contextual functions are always identified as such in their descriptions. :INS] " 3. The term back up. The phrase back up appears to be used several times as a technical term (e.g. last paragraph of 1.7). What does it mean? 4. The term QName. Some readers (including some members of the XML Schema WG) are likely to find it disorienting for the term QName to be used here as a synonym for expanded name or universal name, and not with the same meaning QName has in the XML Namespaces Recommendation. We recognize, however, that what is returned is precisely a member of what XML Schema 1.0 defines as the value space of the xs:QName type, so that the use of the term xs:QName to denote (for example) the return type of the accessor fn:node-name is not only unexceptionable but necessary for consistency. We don't have a good solution for you here; we only note the difficulty. Perhaps a note calling the reader's attention to the issue would be in order (similar to the note on this topic in the Data Model spec). Some members of the WG suggest that this spec, like the Data Model, should prefer the term expanded QName where possible, to stress that what is referred to is the pair in the value space, not the colonized Name in the lexical space. 5. No parameters and the empty list of parameters: On first reading, the signatures of fn:string and fn:error suggest an ambiguity to some readers: the call fn:error() appears to match both the first and the second signatures. Members of the WG who have studied XQuery more thoroughly assure the rest of us that there is no ambiguity, so our purpose in making this comment is merely to call your attention to an editorial problem: it might be useful to explain to the reader why the dual signatures showing no arguments and optional arguments are not in fact ambiguous. 6. Section 2.3, first note: the word this seems to need an antecedent; it is not clear to this reader, at least, what that antecedent is. (It's also not clear what problem with blanks in fragment identifiers is being adverted to.) 7. Raising errors: Section 3 para 1 reads in part: "The occurrence of that phrase [sc. `an error is raised'] implicitly causes the invocation of the fn:error function ..." This formulation seems to involve a horrible clash of contexts: the phrase "an error is raised" occurs in this document, and it occurs continuously from the time of publication until the document ceases to exist (if documents can ever cease to exist), while the error, one expects, ought to be raised in a software system which implements the spec, and should probably not be raised continuously from now until the spec ceases to exist, if only because it would make it hard for users to get work done. For the occurrence of a phrase in the spec to cause the raising of an error in conforming software seems to involve a rather unusual kind of action at a distance. To speak a bit more seriously: perhaps the relevant part of the paragraph could be recast, perhaps along these lines: "the phrase `an error is raised' is used to describe the behavior of conforming processors in certain situations. When such situations arise in a running system, a conforming implementation of this specification must invoke the fn:error function defined in this section." This is not perfect, but we hope you get the idea. 8. Type promotion in multiple or single steps: Section 6.2 says "As far as possible, the promotions should be done in a single step. Specifically, when a decimal is promoted to a double, it must not be converted to a float and then to a double, as this risks loss of precision." [Emphasis added.] These two sentences appear to contradict each other: is the rule about single-step conversions required of conforming implementations ("must"), or recommended without being required ("should")? 9. Code points: The note in section 7.1 identifies code points as Unicode scalar values (which are in turn integers), but uses the notation #x0000 and #x10FFFF to refer to the minimum and maximum values. It's not terribly confusing in context, but strictly speaking, this notation is defined in the XML specification as denoting characters, not integers. I believe conventional representations for hexadecimal numbers would write these values as 0, 0H, x0, or 0x, and correspondingly 10FFFFH, x10FFFF, or 10FFFFx; there may be other hexadecimal representations you will prefer. The Unicode specification writes 10FFFF[16]. 10. v and w: Section 7.3 says "`uve' and `uwe' are considered equivalent in some European languages"; this is unexpected. Are you sure? Which languages? 11. Section 7.4 para 1: for "function" read "functions". Here and elsewhere, we believe that sentences like "Several of these functions use a collation" would do better if "a collation" were replaced with a plural: "Several of these functions use a collation." Unless, of course, all of these functions always use the same collation. 12. Section 7.4.6.1, final example: forgive this observation if it's clueless, but since there does not seem to be any addition operator in the example (did we miss it?), it's not immediately obvious what -INF + INF has to do with the interpretation of the example. 13. Section 7.4.15, fn:string-pad: this seems an unfortunate choice of names for a function which does not (despite its name) pad a string with blanks or some other padding character(s), but which simply replicates or copies the string multiple times. Could it be renamed without excessive heartburn? 14. Section 7.4.16, fn:escape-uri: It would help minimize confusion if the lists of characters which are or are not escaped gave the character names as well as the characters themselves in quotation marks. (In the paper copy used by one member of our review task force, this bit of the spec was almost impossible to make out without a magnifying glass.) 15. Section 7.5.3, fn:replace: The description of the function seemed unclear: The function returns the xs:string that is obtained by replacing all non-overlapping substrings of $input that match the given $pattern with an occurrence of the $replacement string. Replacing all occurrences of the pattern with an occurrence of the replacement string seems to suggest an n for 1 exchange. For "all" read "each". In the following paragraph, one occurrence of $input is not marked as an identifier, one is. References 1. http://www.w3.org/ 2. http://www.w3.org/Architecture/ 3. http://www.w3.org/XML/Group 4. http://www.w3.org/XML/Group/Schemas 5. http://www.w3.org/Member/Eventscal.html 6. http://www.w3.org/Member/#confidential 7. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e65 8. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e70 9. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e86 10. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e98 11. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e145 12. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e178 13. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e191 14. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e238 15. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e246 16. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e251 17. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e278 18. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e283 19. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e327 20. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e350 21. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e392 22. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e399 23. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e444 24. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e463 25. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e513 26. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e516 27. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e544 28. http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e577 29. http://www.w3.org/XML/Group/2003/07/xmlschema-query-notes.html 30. http://www.w3.org/TR/2001/REC-xlink-20010627/#link-locators
Received on Friday, 1 August 2003 22:55:50 UTC