XQuery feedback from Michael Rys on 2001-02-24 (www-xml-query-comments@w3.org from February 2001)

From: Michael Rys <mrys@microsoft.com>
Date: Sat, 24 Feb 2001 13:38:53 -0800
To: "W3C XML Query WG (E-mail) (E-mail)" <w3c-xml-query-wg@w3.org>
Cc: "'www-xml-query-comments@w3.org'" <www-xml-query-comments@w3.org>
Message-ID: <EC67B042372C27429014D4FB06AC9FAF0247834D@red-msg-29.redmond.corp.microsoft.com>
The following contains feedback on the XQuery document (based on the Jan 5th
draft) by our prototype developers (I apologize for the delay in
forwarding). I tried to cross-check with the WD release to make sure that I
do not repeat fixed issues and with the issues list at
http://www.w3.org/XML/Group/2001/02/issues.xml (which is not consistent with
the documents issues list!!!). Issues that are in the document but not at
the separate address above are repeated on purpose.

In addition, we gave feedback on the normative grammar directly to the
syntax editors some of which has been included already into the latest
grammar drafts it appears. Once the working group decides on the final
approach (the XQuery element constructor syntax or the XML embedded
grammars), we will be happy to provide an LALR(1) for consideration for the
normative grammar description part.

My apologies for the formatting, but it is XML transformed into HTML and the
XML is somewhat normalized and unordered, so I provide the text based on the
HTML. The feedback is ordered according to sections. My comments and remarks
are tagged with <MR>.

Best regards
Michael

Issues grouped by section
2 The XQuery Language
Issue #2: multiline comments ( Section 2: The XQuery Language, paragraph 6;
normal priority, syntax) 
	XQuery should be enhanced to allow multiline comments. If the
standard doesn't do this, vendors are sure to do it anyway, just as T-SQL
adds /*...*/ comments to SQL. 
2.1 Path Expressions
Issue #3: duplicate removal ( Section 2.1: Path Expressions, paragraph 2;
critical, semantics) 
<MR>Make sure that all operations that need to perform duplicates
elimination in list in XQuery's path expression language define which
duplicate is removed. Potential candiate is the union operator |.</MR>
Issue #5: consider character RANGEs ( Section 2.1: Path Expressions,
paragraph 9; normal priority, semantics) 
	When I worked with Pavel Curtis (formerly at PARC/Xerox) on a
language project (MOO) , we found that it was also extremely useful to allow
character ranges, like RANGE 'a' TO 'z'. In general, the RANGE operator
should work anywhere a list of values is required. For example, the
expression FOR $i IN RANGE 1 TO 5 RETURN <number>$i</number> should be a
valid XQuery. 
	<MR>This is actually a feature request for a more general RANGE
operator that generates a list of values in the given range. The character
range of course needs to be collation sensitive and thus needs a reference
to a collation order when used to generate the range list.</MR>
Issue #6: better RANGE syntax ( Section 2.1: Path Expressions, paragraph 9;
high priority, syntax) 
	XPath has such a compact syntax, that the currently proposed RANGE
operator syntax is extremely awkward in comparison (or do you intend to
introduce arbitrary XQuery expressions into XPath predicates?). I propose
using the following alternative syntax, using the character sequence ... as
a range operator: RANGE a TO b becomes a...b This syntax is more XPath-like.
	You should also consider introducing a special lexical sequence or
function for the length of the enclosing node set, e.g., chapter[2...$] or
else chapter[2...length()]
	<MR>Original proposal used .. which would overload parent operator
..</MR>
Issue #7: improve description of dereference ( Section 2.1: Path
Expressions, paragraph 11; low priority, editorial) 
	This description of the dereference operator is pretty confusing.
For example, is the name following a dereference operator an element name
(exactly the same as an XPath name test) or a type name? (Both explanations
are given <MR>* stands for any #type# should probably be #name#</MR>.)
Improve the exposition. What happens in a query like 
	FOR $node IN (//procedure UNION
		FOR $p IN //procedure[1], $e IN //* AFTER ($p//incision)[1]
BEFORE ($p//incision)[2]
		RETURN shallow($e)
		)
	RETURN $node/idref->procedure
	(modified from example Q16) -- can the dereference operator result
in shallow node copies, or only the original nodes (presumably both)? In
general, we can create XML that is no longer valid (in the sense that
multiple elements share the same id value) -- what happens to dereference
then? 
Issue #8: computed references ( Section 2.1: Path Expressions, paragraph 11;
normal priority, semantics) 
	The dereference operator applies only to path expressions. However,
in general, one will want to be able to compute a reference and then
dereference it. This functionality needs to be added to XQuery. E.g.,
concat('E', '3')->emp/@mgr
Issue #9: default namespace is poorly defined ( Section 2.1: Path
Expressions, paragraph 17; high priority, syntax) 
	The syntax NAMESPACE DEFAULT = "uri" precludes the use of a prefix
named DEFAULT (without putting the name DEFAULT in single-quotes). This
would be avoided if you change the syntax slightly to DEFAULT NAMESPACE =
"uri"
	<MR>We proposed that change for the grammar and it seems to be added
to the last grammar versions</MR>
Issue #36: no local namespace decls ( Section 2.1: Path Expressions,
paragraph 16; critical, semantics) 
	The namespace decls are kind of global. But in XML, namespace decls
are local to a particular element. XQuery needs to allow us to declare
namespaces on an element and then override those namespace decls on
subelements. 
2.2 Element Constructors
Issue #10: quote usage is awkward ( Section 2.2: Element Constructors,
paragraph 8; critical, syntax) 
	Changing the usage of quotes (as used in XPath, XSLT, and XML) is
going to cause a lot of user confusion and other problems. For one thing, it
means that no one can leverage any existing parsing code they have, because
the quoting rules have changed. But more importantly, it will cause problems
for tools that auto-generate XQuery expressions. Suppose a tool wants to
auto-generate a query from an existing chunk of XML. Instead of being able
to copy that chunk of XML into XQuery, the chunk has to be specially
serialized using the XQuery quoting rules (but only if the chunk contains an
XQuery keyword).
	Also, note that as XQuery goes through future versions, the keyword
set may expand (e.g., you may add an UPDATE keyword). These quoting rules
will break backwards compatibility with older queries that use the new
keywords as unquoted identifiers.
	Some alternatives: 
*	(recommended) Introduce unambiguous leading characters for all
identifiers. 
			The grammar is almost like this already -- for
example, in the expression <FOR it's already clear that this is an element
constructor, because of the leading < symbol. Similarly for variable names,
which always begin with $. If you add an @ symbol in front of attribute
names in element constructors, then I think the only remaining cases are
names in path expressions (which is a problem XPath has already -- for
example, the XPath and/or/@and) and prefix and function names. You need to
use the same approach for path expressions that XPath uses, to be consistent
with XPath. I think it would be fine to require that prefix and function
names (which are always local to the query, and not data-dependent) cannot
collide with keyword names.
*	Reserve keywords without possibility for use as names. 
			This alternative is unacceptable, but we mention it
because so many programming languages use it (including Java, C, C++).
*	Introduce sufficient token lookahead to disambiguate identifiers
from keywords. 
			This solution requires sitting down with the grammar
and figuring out what the rules should be. It may complicate the grammar
description, but a solution (if one exists) is unlikely to require more than
two or three tokens of lookahead.
<MR>This is an issue that would disappear with the XML based syntax under
discussion. I left it in since no decision has yet been made.</MR>
Issue #11: no computed attribute constructor ( Section 2.2: Element
Constructors, paragraph 3; critical, syntax) 
	There appears to be no computed attribute constructor. That is, if
the variable $a contains the name of my attribute, I cannot build the
element <foo $a="value"/>. 
	<MR>We have discussions on this but I could not find it in the
issues list yet</MR>
Issue #12: literal string representation different from both XPath and XML (
Section 2.2: Element Constructors, paragraph 7; normal priority, syntax) 
	Note that the syntax for escaped quote characters in string literals
differs from the XPath grammar (XPath does not allow escape characters - how
broken is that?!) and the XML grammar (which uses entities). This may be
problematic for some scenarios (e.g., automatic XQuery generation). 
Issue #37: well-formedness constraints ( Section 2.2: Element Constructors,
paragraph 1; critical, semantics) 
	Xml syntax are mixed into the XQuery syntax. How well-formed should
they be? Meaning for <author>, should all the characters be complianced to
the Xml 1.0 standard? And the end tag should match with element tag? 
	<MR>This may be addressed by the XML based grammar. Otherwise, we
need to mention the constraints on what can be in the close tag based on the
open tag.</MR>
Issue #38: validity constraints ( Section 2.2: Element Constructors,
paragraph 1; critical, semantics) 
	Are the results of expressions validated? If a schema type is
associated with an element, but the element does not satisfy the schema
constraints, does an error occur? What about scalar values that do not match
pattern facets? Etc., etc. 
	<MR>We have discussions on this on the mailing lists, but I could
not find it in the issues list yet</MR>
Issue #39: non-element ctors ( Section 2.2: Element Constructors, paragraph
5; normal priority, semantics) 
	In example Q8, why can't the construction of comment and processing
instruction be the same as the element constructor? Instead of using
comment("Houston, we have a problem"), it could be simply <!-- Houston, we
have a problem -->. 
	<MR>We have discussions on this on the mailing lists, but I could
not find it in the issues list yet</MR>
2.3 FLWR Expressions
Issue #13: tuple order of FOR ( Section 2.3: FLWR Expressions, paragraph 5;
normal priority, semantics) 
	The description of FLWR expressions says that "The tuples generated
by the FOR/LET sequence have an order that is determined by the order of
their bound elements in the input document, with the first bound variable
taking precedence, followed by the second bound variable, and so on.". But
of course, there may be many input documents, not just one. Also, there may
not be a document at all. Also, it is important to point out to the reader
that this ordering is different from XPath (which uses reverse document
order for the reverse axes). Also, you should point out that unlike path
expressions (which remove duplicate nodes), the XQuery FOR $a IN path, $b IN
PATH generates a cross-product of path with itself (not a single iteration
through path). 
Issue #14: result of RETURN ( Section 2.3: FLWR Expressions, paragraph 7;
critical, semantics) 
	The result type of a RETURN expression does not seem to be limited
to "nodes, ordered forests of nodes, or primitive values" as described. Some
examples: 
*	list of primitives: FOR $a IN document("zoo.xml") LET $b :=
path_selecting_a_list_of_scalar_types RETURN $b 
*	empty result: FOR $a IN document("zoo.xml") RETURN () 
*	list of both primitives and nodes: FOR $a IN
document(&quot;zoo.xml&quot;) RETURN 3,<foo/> 
*	unordered forest of nodes: FOR $a IN
distinct(document(&quot;zoo.xml&quot;)) RETURN <foo>$a</foo>
	If there are semantic constraints on RETURN that exceed the existing
syntactical constraints, then these need to be clearly defined. We also need
to know which constraints can be determined at analysis-time, and which can
only be determined at execution-time (e.g., an empty result). 
Issue #15: result of an unordered RETURN ( Section 2.3: FLWR Expressions,
paragraph 7; critical, semantics) 
	I'm especially concerned about the last example given for issue 14:
FOR $a IN distinct(document(&quot;zoo.xml&quot;)) RETURN <foo>$a</foo>
	How are the results of an unordered RETURN supposed to be serialized
out (or otherwise ordered later). Is this expression supposed to be illegal
unless I apply an explicit sort or group-by clause? Note that even if the
FOR loop is unordered, the RETURN result might still be unambiguously
ordered (if it is independent of the FOR). This is a mess.
<MR>I left out the issue that relates to xquery-unordered-collections
(distinct should only remove duplicates and a toset operation should do what
distinct does now)</MR> 
Issue #17: expressions as literal content ( Section 2.3: FLWR Expressions,
paragraph 17; high priority, syntax) 
	This syntax suffers from the problem that is is not immediately
clear to the user which characters will be literally echoed into the result,
and which are operators in the syntax and will be interpreted. Consider
example Q13, or an even simpler version of it: 
	LET $a := avg(//book/price)
	FOR $b in /book
	RETURN <diff>$b/price - $a</diff>
	Presumably (the spec does not explain the results for most of the
examples) this is supposed to return a result that looks like 
	<diff>1.50</diff><diff>-3.24</diff>
	But the user might have expected the result 
	<diff>7.50 - 6.00</diff><diff>2.76 - 6.00</diff>
	I suppose this alternate result is supported through the quoting
rules 
	LET $a := avg(//book/price)
	FOR $b in /book
	RETURN <diff>$b/price "-" $a</diff>
	but the user cannot determine by simple inspection that the quotes
were needed (that the hyphen would be interpreted instead of echoed). 
	It would be much easier both for language parsers and users if
interpreted expressions were syntactically distinguished from literal
values. For example, 
	LET $a := avg(//book/price)
	FOR $b in /book
	RETURN <diff>{ $b/price - $a }</diff>
	vs. 
	LET $a := avg(//book/price)
	FOR $b in /book
	RETURN <diff>{ $b/price } -  { $a }</diff>
	(I don't necessarily advocate the use of curly braces for this
purpose; I just picked a random punctuation character to illustrate the
concept.) 
	<MR>This is an issue that would probably disappear with the XML
based syntax under discussion. I left it in since no decision has yet been
made.</MR>
Issue #19: SORTBY semantics not well-defined ( Section 2.3: FLWR
Expressions, paragraph 20; critical, semantics) 
	The semantics of SORTBY are undefined. Collation order? Data type?
How do the three sorts SORT BY ($b/price), SORT BY ($b/price/text()), and
SORT BY (number($b/price)) differ? How does schema information affect (or
not affect) a sort? What if the key set is empty? What does an expression
that mixes types in the sort key, like 
	FOR $h IN //holding
	RETURN <holding>$h/title</holding>
	SORT BY ( IF $h/@type="journal" THEN $h/editor ELSE number($h/price)
)
	(modified from Q18) return? 
	<MR>We have discussions on this but I could not find it in the
issues list yet</MR>
Issue #40: non-element ctors ( Section 2.3: FLWR Expressions, paragraph 10;
normal priority, semantics) 
	In example Q10, the explanation of duplicate values for distinct is
still too vague. Does attribute order matter? Encodings? Maybe we can use
the canonical XML spec to distinguish two element contents. For example,
consider the XML 
	<e a1="1" a2="2">
		<child></child>
	</e>
	<e a2="2" a1="1"><child/></e>
	<MR>I would like to see this as a request for clarification.
Attribute order does not matter according to Infoset thus our datamodel does
not provide for that either.</MR>
Issue #41: text() vs. data() ( Section 2.3: FLWR Expressions, paragraph 21;
normal priority, syntax) 
	In Q15, text() is used. Shouldn't this change to data() for the same
reasons it's data() in the algebra spec? 
	<MR>This is issue 48, but I could not find it in the separate issues
list yet</MR>
2.4 Operators in Expressions
Issue #20: document order definition should be at the beginning ( Section
2.4: Operators in Expressions, paragraph 2; low priority, editorial) 
	The explanation of ordinal position belongs at the beginning of this
document, not in section 2.4. 
Issue #21: definition of data model instance and global ordering ( Section
2.4: Operators in Expressions, paragraph 2; critical, semantics) 
	In these specs, the phrase "data model instance" is hopelessly
confused with the phrase "XML document or fragment". The data model spec
currently says that a data model instance is a possibly unordered collection
of zero or more XML documents and fragments, and the XQuery spec says that
only one data model instance is the input to an XQuery. Thus, a document
model instance might have no global ordering. Also, even if the top-level
nodes of the data model are ordered, we know that a query fragment can
result in a non-ordered list of nodes. So how does global ordering work
then?
	This issue really needs to be resolved and cleared up
once-and-for-all. If we cannot describe consistently the data model and its
ordering (or lack thereof), then how can we hope to define a query language
over it?
	Until this issue is resolved, the BEFORE/AFTER semantics are not
well-defined. For example, 
	FOR $a IN document("one.xml"), $b IN document("two.xml")
	WHERE $a/title BEFORE $b/title
	RETURN $a
Issue #22: BEFORE/AFTER vs. XPath preceding/following ( Section 2.4:
Operators in Expressions, paragraph 2; high priority, semantics) 
	How do the XQuery keywords BEFORE and AFTER differ from the XPath
axes preceding and sibling? If they don't differ, then these keywords should
be removed. If they do differ, then we need examples. 
Issue #23: shallow semantics not well-defined ( Section 2.4: Operators in
Expressions, paragraph 2; normal priority, semantics) 
	Does shallow() copy text, comment, or p-i content of an element?
What is the global order of the resulting node? Since attributes (which may
be ID-typed) are copied, does this new node have the same or a different
identity from the original node? 
	<MR>The author of this remark seems confused between node identity
and the impact of ID-types on node ID. Thus I leave this remark in the issue
to make sure that the difference will be clear.</MR>
2.8 Datatypes
Issue #16: comma is problematic and unnecessary ( Section 2.8: Datatypes,
paragraph 5; high priority, syntax) 
	The use of commas to separate element constructors seems both
unnecessary and problematic. How can I put a comma into the text content of
an element, like <comma>,</comma>? Must I really quote it, as in
<comma>&quot;,&quot;</comma> Remove comma from this grammar.
	I understand the desire to construct lists. However, lists of
elements are already clear (<a/><b/> is a list of two elements. Since lists
of lists are not allowed (i.e., there is no need to distinguish between a
list of three elements <a/><b/><c/> vs. a list of two elements followed by
one element <a/><b/>,<c/>) there is no added value in writing $a,$b,$c
instead of just $a $b $c (even when the variables are bound to primitive
values). This is also consistent with the way the values will be serialized
out (e.g., idrefs/nmtokens are space-separated lists, not comma-separated
lists). Also, there is no value in using square brackets to construct lists
-- parentheses will work just as well (with no conflict with their use as
grouping operators).
	<MR>This would be solved by the XML based construction grammar</MR>
Issue #25: namespace for builtin XQuery data types ( Section 2.8: Datatypes,
paragraph 4; high priority, semantics) 
	Built-in XQuery data types (like ELEMENT, ATTRIBUTE, and LIST)
should not be represented with keywords, but instead as types in a reserved
XQuery namespace. This is extensible as well as consistent and compatible
with the use of XML Schema types. And there are probably a half-dozen more
reasons to prefer this approach to keywords. 
2.9 User-Defined Functions
Issue #26: list arguments to functions ( Section 2.9: User-Defined
Functions, paragraph 10; critical, semantics) 
	Although I want very much want set semantics in XQuery, the fourth
rule for function resolution needs work. Does this rule work only with
single-argument functions? What about a two-argument function? Consider this
query fragment: 
	FOR $c IN document("customers.xml")
	LET $orders := $c//Order
	RETURN concat("O", $orders/@OrderID)
	Also, what happens if sets are passed as both arguments? 
	Don't forget that set-based semantics conflicts strongly with XPath
1.0. Using set-semantics for XPath functions means that you create a new
meaning for path expressions that is different from the XPath 1.0 spec.
Experience with SQL XML shows that this new semantics needs careful
definition -- it is not enough to wave hands and say that a list-typed
argument to a function expecting a scalar results in a list of that function
applied to each scalar. The interactions of this rule with the rest of the
XPath language (like types -- XPath has no scalar list type, only nodeset)
must be explored and defined.
	<MR>See my proposal at
http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2001Feb/0271.html </MR>
Issue #27: coercion rules for function return types ( Section 2.9:
User-Defined Functions, paragraph 14; critical, semantics) 
	The rules for function resolution partially sketch out how type
coercions work on function arguments. However, the spec does not define how
type coercions work on function return values. Consider several possible
variations of this function: 
	FUNCTION coerce() RETURNS xsd:integer
	{
		RETURN 1.0
		-- RETURN "1"
		-- RETURN [ 1 ]
		-- etc.
	}
	Are these legal XQueries, and if so, how do they work? 
Issue #28: connected() misses some nodes ( Section 2.9: User-Defined
Functions, paragraph 17 ; low priority, code example/sample) 
	In example Q23, the connected() function counts nodes connected
through IDREF(s) attributes, but not IDREF(s) text content (e.g., if the
element is <e>12345</e>). Also, if this function is supposed to work in
general, then it should probably use descendant-or-self instead of just
child (that is, $e//* instead of $e/*). 
Issue #47: wrong idref instance in Q23 ( Section 2.9: User-Defined
Functions, paragraph 17; high priority, code example/sample) 
	In example Q23, the example uses a number as a ID/IDREF value which
is not a valid instance. 
2.10 User-Defined Datatypes
Issue #29: schema information should be easily queryable ( Section 2.10:
User-Defined Datatypes, paragraph 5; normal priority, code example/sample) 
	In example Q26, the namespace uri is repeated in the query and the
schema. Schema information such as target-namespace should be easily exposed
to the query engine. Of course, the user could do something like 
	SCHEMA "myschema.xsd"
	NAMESPACE xsd = "http://www.w3.org/2000/10/XMLSchema"
	DEFAULT NAMESPACE =
document("myschema.xsd")/xsd:schema/@targetNamespace/text()
	but why should the user have to do that? Instead, I should be able
to do something like: 
	SCHEMA $myschema = "myschema.xsd"
	DEFAULT NAMESPACE = target-namespace($myschema)
	If you don't define a library of such functions for XSD
manipulation, every user will end up writing their own anyway. 
2.11 Operations on Datatypes
Issue #30: type names need to be computable ( Section 2.11: Operations on
Datatypes, paragraph 2; high priority, type system) 
	The type argument to INSTANCEOF should be computeable (e.g., $expr
INSTANCEOF $type. In general, types should be dynamically computable and not
limited to compile-time only. 
Issue #31: type operations are incomplete ( Section 2.11: Operations on
Datatypes, paragraph 2 <http://www.w3.org/XML/Group/2001/01/xquery.html>;
critical, type system) 
	The type operations described in this section do not capture all of
the type relationships expressible in XSD. Presumably this section is
pending the work on MSL, but in any case, there is more to types than just
subtyping and coercions.. 
Issue #32: CAST syntax should be a function ( Section 2.11: Operations on
Datatypes, paragraph 2; high priority, syntax) 
	To be consistent with XPath, type casting should be done through
functions (e.g., number(), string(), etc.). A function syntax would also be
consistent with the scalar type constructors (e.g., date("2001-01-30") ). If
necessary, add a generic cast() function that takes two arguments - the
expression to be cast, and the type to which it should be cast. 
<MR>This is another proposal for resolution of issue
[xquery-cast-expression]. See also recent discussions on TREAT and
CAST.</MR>
3 Querying Relational Data
Issue #48: Remove the section on relational data representation in XML (
Section 3: Querying Relational Data; high priority, editorial) 
<MR>This section should be removed and examples on grouping and joins should
be given without basing it on a (IMO) strange relational-XML mapping.</MR>
B XQuery Grammar
	<MR>Generally, there should be only a BNF based LALR(1) normative
grammar. The language specific grammar should disappear from the next
working draft</MR>
E XQuery Semantics <http://www.w3.org/XML/Group/2001/01/xquery.html>
Issue #35: RANGE cannot map to algebra ( Section E: XQuery Semantics,
paragraph 1 <http://www.w3.org/XML/Group/2001/01/xquery.html>; critical,
mapping to algebra) 
	There is no corresponding algebra operator for RANGE
Issue #42: contains() has no algebra equivalent ( Section E: XQuery
Semantics, paragraph 1 <http://www.w3.org/XML/Group/2001/01/xquery.html>;
critical, mapping to algebra) 
	How is the function contains() mapped to the algebra? 
Issue #43: FILTER has no algebra equivalent ( Section E: XQuery Semantics,
paragraph 1 <http://www.w3.org/XML/Group/2001/01/xquery.html>; critical,
mapping to algebra) 
	How is the filter operator mapped to the algebra? 
Issue #46: algebra operations not exposed in XQuery ( Section E: XQuery
Semantics, paragraph 1 <http://www.w3.org/XML/Group/2001/01/xquery.html>;
critical, mapping to algebra) 
	What are the XQuery equivalents for the algebra concepts bag,
bagtolist, and index()?
Received on Saturday, 24 February 2001 16:39:34 UTC