F&O comments on 6.4

I went through the mail archives since October and removed any feedback
I had that was previously raised and answered (one item below was raised
but not answered in the archives, so I left it in).


***
*** Questions/comments on 6.4.17:
***

- What is the result of replace("xxy", "x.", "z") ?  The spec says
"non-overlapping substrings", so I assume this does not result in "zz",
but does it result in "xz" or "zy" ?  This should be made clear.

- What is the result of replace("xxx", "x(xx)|(x)xx", "y$1") ?  Is it
"yxx" or "yx" ?  Perhaps a simpler example is replace("xx", "(x)|x",
"$1").  Does it result in "", "x", or "xx"?

- So, an error is raised for replace("xxx", ".*?", "") because the
reluctant quantifier causes .* to match the "shortest possible
substring" which in this case is the empty string?  If so, I think it's
worth mentioning that the reluctant quantifiers can cause patterns that
would normally succeed to error.  If this was not intended, then the
definitions need reworking.

- Is an error raised only if the entire pattern matches the zero-length
string?  What about captured substrings, like replace("xxx", "()x*",
"$1") or replace("xxx", (^).*($)", "$1$2")?  Are these allowed
(resulting in the empty string?) or are they errors?

- If the replacement pattern is invalid, is it an error?  (This is not
stated.)  For example, replace("x", "(x)", "$").  What if the
replacement pattern refers to a non-existent match, such as replace("x",
"(x)", "$5") ?

- If $ must be escaped as \$, then clearly \ must also be escaped
(probably \\).  Otherwise, it would be impossible to insert a backslash
followed by a captured substring.  For example, replace($anything,
"(.*)", "\$1") needs to be replace($anything, "(.*)", "\\$1")

***
*** Questions/comments on 6.4.16.1:
***

- What's considered a "newline character" for the purpose of ^$.
matching?  \r? \n? \r\n? (which isn't a character, but a sequence)

-  The additional meta-characters change what is considered a "normal
character" in the regular expression.  So in addition to modifying the
XML Schema quantifier production (4), you also want to modify the Char
production (10).

I note in passing that the XML Schema spec appendix F contains two
errors:  The definition of metacharacter omits the vertical bar | (which
is properly accounted for in the Char production), while the Char
production omits the curly brace metacharacters { and } (which are
properly accounted for in the metacharacter definition).  Oops.

Furthermore, the XML Schema regexp grammar allows for expressions like
"|" and "()|()".  This is possibly an error.  (Both branches to the
choice are allowed to be empty, because branch ::= piece*.  Similarly,
parentheses can wrap the empty string.)

- Because the XML Schema grammar for regexps is flawed, and you're using
only a small part of it unmodified anyway, it's probably best to
completely define your (corrected, modified) regexp grammar here.

-  "The effect of [reluctant quantifiers] is that the regular expression
matches the shortest possible substring (consistent with the match as a
whole succeeding)."  I think this parenthetical statement should not be
parenthetical, because it significantly affects the behavior of the
reluctant quantifier.


***
*** Questions/comments on 6.4.19
***

- I suppose you know that the escaping rules differ for each URI part?
Section 2.4.2 of RFC 2396 might be illuminating.  I'm not sure
escape-uri() is useful as-is.  Should probably be something more along
the lines of construct-uri(part1 string?, part2 string?, ...).

- Consider adding functions for XML entitization/de-entitization
(suggested: entitize(string?) string? and unentitize(string?) string?).
I suppose I can cobble together the same functionality by going through
a dummy element constructor, but I think this functionality is more
central to XQuery than URI escaping.

Received on Monday, 9 December 2002 10:53:46 UTC