Re: ACTION A-645-07: schema for serialization parameters

On Jun 24, 2016, at 6:40 AM, Abel Braaksma wrote:

> I'm sending this with HTML layout in the hope it is better readable.
>  
> Currently, with this latest version, I tested a few  scenarios, primarily with method/@value, here are the -- sometimes surprising -- results:

Thank you very much for these test cases.

>  
> Red: either incorrectly valid or incorrectly invalid
> Orange: correct results but for incorrect reasons
> Green: correctly valid or correctly invalid
>  
> Note: Red and orange can also mean that the result is unexpected but the validators pass the XSD correctly, in that case I added an "*", which means it is likely a bug in the XSD.
>  
> 0) xml:id
> MSXML.NET: chokes on xml:id, so I removed that (should we include that? Some WG's do in their schemas, I believe)

I think you are saying that MSXML.NET chokes on the xml:id attributes
included in the test document I prepared.  Is that correct?

If it does, then I think it's in error.  I'll spare everyone the
details, because on closer examination I suspect you mean that
MSXML.NET chokes on some but not all of the xml:id attributes; a
copy/paste error led to the ID "json-i-v-EQ1" occurring on three
elements. 

Can you clarify? Thanks.

Of course, we might wish not just to allow xml:id but to 
include the schema for the XML namespace; I'm agnostic on
that.

>  
> 1) spaces in EQName URI
> value="Q{ http://example.com/nss/foo }bar"
> LibXML: valid
> MSXML4: invalid (for incorrect reasons)
> MSXML.NET: invalid
> Saxon-EE: valid

This is the value on elements method-v-EQ2 and json-v-EQ2, correct?
(I put the IDs into the test so that it would be easier to discuss
specific values, without having to do a character-by-character
comparison between literals.)

My reply seems to have lost the formatting, so I should specify that
here you label the "invalid" results correct.

Why do you believe this is (or should be) invalid?  I read the grammar
of XPath 3.1 as saying it should be valid.  The relevant productions
are[1]

[117] URIQualifiedName ::= BracedURILiteral NCName /* ws: explicit */
[118] BracedURILiteral ::= "Q" "{" [^{}]* "}" /* ws: explicit */

[1] http://www.w3.org/XML/Group/qtspecs/specifications/xquery-31/html/xpath-31-diff.html#prod-xpath31-URIQualifiedName

The "ws: explicit" rule says that "the EBNF notation explicitly
notates, with S or otherwise, where whitespace characters are
allowed."  Here, the rule says that any characters other than "{" and
"}" are allowed within the braces; I think the blanks (U+0020 in this
case, but the same logic applies to other whitespace characters) found
in this literal.

I think allowing whitespace within the angle brackets is probably a
mistake, since it's not allowed in URIs or IRIs, but I was trying to
match the grammar, not improve it.


>  
> 2) invalid EQName, extra "}"
> value="Q{http://example.com/nss/foo}}bar"
> LibXML: valid
> MSXML4: invalid (for incorrect reasons)
> MSXML.NET: valid
> Saxon-EE: valid

I agree that this value is allowed by the grammar in XPath
and should be made invalid by the schema.

> 3) spaces within URI
> value="Q{http://e xample.com/nss/foo}bar"
> LibXML: valid
> MSXML4: invalid (for incorrect reasons)
> MSXML.NET: valid
> Saxon-EE: valid

I believe (but have not checked within the last several years)
that the current specs for URIs and IRIs do not allow whitespace
within either.  So I agree that in principle this should probably
be disallowed.

But it's currently allowed by the XPath spec, unless i have
missed something, so I did not try to make the schema 
disallow it.

>  
> 4) invalid EQName, double starting {{
> value="Q{{http://example.com/nss/foo}bar"
> LibXML: valid
> MSXML4: invalid (for incorrect reasons)
> MSXML.NET: valid
> Saxon-EE: valid

As for 2 above.

>  
> 5) invalid NCName part, wrong start-char
> value="Q{http://example.com/nss/foo}-bar"
> LibXML: invalid
> MSXML4: valid
> MSXML.NET: invalid
> Saxon-EE: invalid

Agreed that this is and should be invalid.

>  
> 6) url-escaped URI (should be allowed)
> value="Q{http://e%20xample.com/nss/foo}bar"
> LibXML: valid
> MSXML4: valid
> MSXML.NET: valid
> Saxon-EE: valid

Agreed that this is now and should be valid.

>  
> 7) missing NCName part
> value="Q{http://example.com/nss/foo}"
> LibXML: invalid
> MSXML4: invalid
> MSXML.NET: invalid
> Saxon-EE: invalid

Agreed that this is and should be invalid.

>  
> 8) no-namespace EQName (in "method", this should only be variants of "Q{}html", i.e. the allowed defaults)
> value="Q{}html"
> LibXML: invalid*
> MSXML4: invalid*
> MSXML.NET: invalid*
> Saxon-EE: invalid*

I do not believe the spec intends for this to be valid.  Perhaps I'm
wrong; I will have to reread the text.

Perhaps it should.

If it should be valid, is this a small enough change to
make at this point?  Or is it too late?

>  
> 9) no-namespace EQName with spaces
> value="Q{  }html"
> LibXML: invalid* (for wrong reasons, missing enum)
> MSXML4: invalid* (for wrong reasons, missing enum)
> MSXML.NET: invalid* (for wrong reasons, missing enum)
> Saxon-EE: invalid* (for wrong reasons, missing enum)

As for 8 and 1.

>  
> 10) empty value for method
> value=""
> LibXML: invalid
> MSXML4: invalid
> MSXML.NET: invalid
> Saxon-EE: invalid

I agree that this is and should be invalid.

>  
> Findings:
> - MSXML4 chokes on subtracting regexes, i.e. "[\c-[:]]", fixing that by adding a hierarchy resolve it for MSXML4

Others will know better than I; is MSXML 4 currently in wide use?

Actually, I suppose my instinct is to try to make the schema work
with it, even if it's not known to be currently in wide use.  Your
sketches show a reasonably simple way.  

> - The current expression "Q\{(.*)\}" can be made stricter to disallow whitespace and curlies, or remove allowed whitespace by deriving by restriction from a base type

Agreed as to the curly braces.  Unless we change the grammar of XPath,
I am not persuaded as to the whitespace.

> - the no-namespace EQNames that are allowed in method-type and json-node-output-method-type should be added

I'll have to think about this; I see a certain logic to it, but I don't see 
that logic in the spec.  Unless I am mistaken, this would require textual
changes to the serialization spec as well as to the schema.

> - should we include xml.xsd as we do in several other scenarios, to allow xml:id etc?

Agnostic. 

>  
> Proposal
> I propose a few minor changes that validate in all scenarios above and fixes a few bugs in the XSD:
>  
> I experimented with these two definitions (the base type is needed to remove the subtracting regex):
>  
>   <xs:simpleType name="EQName-Base">
>     <xs:restriction base="xs:token">
>       <xs:pattern value="Q\{\S*\}[^:]+"/>      
>       <xs:whiteSpace value="collapse"/>        
>     </xs:restriction>
>   </xs:simpleType>
>   
>   <xs:simpleType name="EQName">
>     <xs:restriction base="output:EQName-Base">
>       <xs:pattern value="Q\{[^\s\{\}]*\}[\i][\c]*"/>      
>       <xs:whiteSpace value="collapse"/>        
>     </xs:restriction>
>   </xs:simpleType>
>  
>  
> And
>  
>   <xs:simpleType name="EQName-Base">
>     <xs:restriction base="xs:token">
>       <xs:pattern value="Q\{[^\s\{\}]*\}[^:]+"/>      
>       <xs:whiteSpace value="collapse"/>        
>     </xs:restriction>
>   </xs:simpleType>
>   
>   <xs:simpleType name="EQName">
>     <xs:restriction base="output:EQName-Base">
>       <xs:pattern value="Q\{.*\}[\i][\c]*"/>      
>       <xs:whiteSpace value="collapse"/>        
>     </xs:restriction>
>   </xs:simpleType>

I might be inclined to make EQName-Base anonymous, but I
realize that some people find nesting in such contexts confusing
(and I'd develop the definitions using a named form, anonymizing
it at the end only because it's not a type that corresponds to any
thing outside the schema, it's *just* an artefact of our attempt to
work around the bug in MSXML 4.

>  
> I think they are equivalent and should both cancel out spaces and contained "{" and "}", but the first correctly disallows the curlies according to all four validators, the second incorrectly disallows them.

I'm not sure I follow.

> Both correctly refuse spaces. It's off-topic, but I wonder whether I made a mistake here or whether you'd agree these are indeed equivalent (the base-type is never directly used).

They look equivalent to me, but I haven't tried to prove it.

> If we do not want this change (split EQName in base and derived) then we (only) lose compatibility with MSXML4. Other validators support subtracting regexes.

Michael

-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com 
* http://cmsmcq.com/mib                 
* http://balisage.net
****************************************************************

Received on Saturday, 25 June 2016 02:11:57 UTC