W3C home > Mailing lists > Public > xmlschema-dev@w3.org > September 2007

XSV fails on negative regexp (eg mimetype of MPEG7)

From: Heiko Studt <studt@fmi.uni-passau.de>
Date: Tue, 11 Sep 2007 12:03:43 +0200 (CEST)
To: xmlschema-dev@w3.org
Message-Id: <20070911100343.E100DCBCF0@tom.rz.uni-passau.de>

Hi,

XSV seems to fail on xsd-allowed 'negated' regular expressions in
patterns. This breaks the support of MPEG7:mimetype. (urn:mpeg:mpeg7:schema:2004)
Allthough XSV documents lacks functionality in some parts of RegExp, 
this lack is not documented on its project page.


The failing pattern follows (copied out of MPEG7 V2):
---
<simpleType name="mimeType">
  <restriction base="string">
    <whiteSpace value="collapse"/>
<pattern 
value='[&#x21;-&#x7f;-[\(\)&lt;&gt;@,;:\\"/\[\]\?=]]+/[&#x21;-&#x7f;-[\(\)&lt;&gt;@,;:\\"/\[\]\?=]]+'/>
  </restriction>
</simpleType>
---


Changed to the following, the pattern seems to work right again, but
after a slept night I am not 100% sure wether it is the same
semantically; MIME is defined in RFC 2045 (5.1 - MIME), but I don't 
see the special handling of &#x21; ("!").
---
<simpleType name="mimeType">
  <restriction base="string">
    <whiteSpace value="collapse"/>
<pattern 
value='(&#x21;|[^&#x7f;\(\)&lt;&gt;@,;:\\"/\[\]\?=])+/(&#x21;|[^&#x7f;\(\)&lt;&gt;@,;:\\"/\[\]\?=])+'/>
  </restriction>
</simpleType>
---


RFC 2045 MIME (Part 1):
---
5.1.  Syntax of the Content-Type Header Field

   In the Augmented BNF notation of RFC 822, a Content-Type header field
   value is defined as follows:

     content := "Content-Type" ":" type "/" subtype
                *(";" parameter)
                ; Matching of media type and subtype
                ; is ALWAYS case-insensitive.

     type := discrete-type / composite-type

     discrete-type := "text" / "image" / "audio" / "video" /
                      "application" / extension-token

     composite-type := "message" / "multipart" / extension-token

     extension-token := ietf-token / x-token

     ietf-token := <An extension token defined by a
                    standards-track RFC and registered
                    with IANA.>

     x-token := <The two characters "X-" or "x-" followed, with
                 no intervening white space, by any token>

     subtype := extension-token / iana-token

     iana-token := <A publicly-defined extension token. Tokens
                    of this form must be registered with IANA
                    as specified in RFC 2048.>

     parameter := attribute "=" value

     attribute := token
                  ; Matching of attributes
                  ; is ALWAYS case-insensitive.

     value := token / quoted-string

     token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
                 or tspecials>

     tspecials :=  "(" / ")" / "<" / ">" / "@" /
                   "," / ";" / ":" / "\" / <">
                   "/" / "[" / "]" / "?" / "="
                   ; Must be in quoted-string,
                   ; to use within parameter values
---


According to the example of http://www.w3.org/TR/xmlschema-2/#rf-pattern
the "-"-Syntax may work as negating (allthough it is unlikly following 
http://www.w3.org/TR/xmlschema-2/#charcter-classes).
---
<simpleType name='better-us-zipcode'>
  <restriction base='string'>
    <pattern value='[0-9]{5}(-[0-9]{4})?'/>
  </restriction>
</simpleType>
---


A simple fix for this part (while I don't see wether the charClassSub
will work afterwards), may be to replace every -[ into [^ if it is
preceed by [ or (. This will not solve the issue with MPEG7 completly.


Are those things true, known or perhaps even solved somewhere?


-- 
MFG
Hopefully I have written down clearly everything needed.
Heiko Studt <studt@fmi.uni-passau.de>
Received on Tuesday, 11 September 2007 11:38:09 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 10 December 2014 20:01:58 UTC