RE: F&O comments on 6.4 from Kay, Michael on 2002-12-09 (public-qt-comments@w3.org from December 2002)

From: Kay, Michael <Michael.Kay@softwareag.com>
Date: Mon, 9 Dec 2002 18:57:47 +0100
To: xquery@attbi.com, public-qt-comments@w3.org
Message-ID: <DFF2AC9E3583D511A21F0008C7E621060453DE88@daemsg02.software-ag.de>
> 
> ***
> *** Questions/comments on 6.4.17:
> ***
> 
> - What is the result of replace("xxy", "x.", "z") ?  The spec 
> says "non-overlapping substrings", so I assume this does not 
> result in "zz", but does it result in "xz" or "zy" ?  This 
> should be made clear.

Thanks, yes, the spec should make it clear that if the pattern does match
two overlapping substrings, the one that starts first is chosen.
> 
> - What is the result of replace("xxx", "x(xx)|(x)xx", "y$1") 
> ?  Is it "yxx" or "yx" ?  Perhaps a simpler example is 
> replace("xx", "(x)|x", "$1").  Does it result in "", "x", or "xx"?

There's a meta-question here: how are we going to provide a definitive
specification of the semantics of our regular expressions? Are we going to
try and include all the rules ourselves, or refer to some external
authority? The difficulties here is that it is very hard to find a
definitive specification for regular expressions that we can simply refer
to. I see that Java refers to  Mastering Regular Expressions, Jeffrey E. F.
Friedl, O'Reilly and Associates, 1997. The Perl specification seems
hopelessly informal, and full of statements like "this feature is
experimental".

I imagine the answer here is that when several alternatives match, the one
that counts is the first.
> 
> - So, an error is raised for replace("xxx", ".*?", "") 
> because the reluctant quantifier causes .* to match the 
> "shortest possible substring" which in this case is the empty 
> string?  If so, I think it's worth mentioning that the 
> reluctant quantifiers can cause patterns that would normally 
> succeed to error.  If this was not intended, then the 
> definitions need reworking.

Yes, it may be worth including an example like this.
> 
> - Is an error raised only if the entire pattern matches the 
> zero-length string?  What about captured substrings, like 
> replace("xxx", "()x*",
> "$1") or replace("xxx", (^).*($)", "$1$2")?  Are these 
> allowed (resulting in the empty string?) or are they errors?

I can't see any strong reason why a zero-length captured substring shouldn't
be allowed, but perhaps I'm missing something.
> 
> - If the replacement pattern is invalid, is it an error?  (This is not
> stated.)  For example, replace("x", "(x)", "$").  What if the 
> replacement pattern refers to a non-existent match, such as 
> replace("x", "(x)", "$5") ?

I think we should state that these are both dynamic errors.
> 
> - If $ must be escaped as \$, then clearly \ must also be 
> escaped (probably \\).  Otherwise, it would be impossible to 
> insert a backslash followed by a captured substring.  For 
> example, replace($anything, "(.*)", "\$1") needs to be 
> replace($anything, "(.*)", "\\$1")

Yes, it would seem so.
> 
> ***
> *** Questions/comments on 6.4.16.1:
> ***
> 
> - What's considered a "newline character" for the purpose of 
> ^$. matching?  \r? \n? \r\n? (which isn't a character, but a sequence)

I think x0A only. Users have to try quite hard to get any other sequence
through the XML parser.
> 
> -  The additional meta-characters change what is considered a 
> "normal character" in the regular expression.  So in addition 
> to modifying the XML Schema quantifier production (4), you 
> also want to modify the Char production (10).

noted
> 
> I note in passing that the XML Schema spec appendix F contains two
> errors:  The definition of metacharacter omits the vertical 
> bar | (which is properly accounted for in the Char 
> production), while the Char production omits the curly brace 
> metacharacters { and } (which are properly accounted for in 
> the metacharacter definition).  Oops.
> 
> Furthermore, the XML Schema regexp grammar allows for 
> expressions like "|" and "()|()".  This is possibly an error. 
>  (Both branches to the choice are allowed to be empty, 
> because branch ::= piece*.  Similarly, parentheses can wrap 
> the empty string.)
> 
> - Because the XML Schema grammar for regexps is flawed, and 
> you're using only a small part of it unmodified anyway, it's 
> probably best to completely define your (corrected, modified) 
> regexp grammar here.

I'm reluctant to do this. Looking at the specs for regular expressions in
Perl, Java, and Schema shows how difficult it is to do it well; I think we
would be rather arrogant to assume we can do it better if we do it all
ourselves, and I would rather avoid the risk of creating accidental
differences from the XML Schema specification.
> 
> -  "The effect of [reluctant quantifiers] is that the regular 
> expression matches the shortest possible substring 
> (consistent with the match as a whole succeeding)."  I think 
> this parenthetical statement should not be parenthetical, 
> because it significantly affects the behavior of the 
> reluctant quantifier.

Reasonable comment, though I think it's editorial.
> 
> 
> ***
> *** Questions/comments on 6.4.19
> ***
> 
> - I suppose you know that the escaping rules differ for each 
> URI part? Section 2.4.2 of RFC 2396 might be illuminating.  
> I'm not sure
> escape-uri() is useful as-is.  Should probably be something 
> more along the lines of construct-uri(part1 string?, part2 
> string?, ...).

Yes, URI escaping is a highly fraught subject. But I think that the function
we have defined here will meet the common use cases.
> 
> - Consider adding functions for XML entitization/de-entitization
> (suggested: entitize(string?) string? and unentitize(string?) 
> string?). I suppose I can cobble together the same 
> functionality by going through a dummy element constructor, 
> but I think this functionality is more central to XQuery than 
> URI escaping.

Sorry, what is "XML entitization" supposed to do?

Michael Kay
>
Received on Monday, 9 December 2002 12:57:55 UTC