Re: XML Schema Standard 1.0 - W3C - Regular Expressions - Patterns

On 15 Dec 2006, at 05:20 , CARCHESIO GIOVANNI wrote:

[I notice that your mail was sent in December, although owing to human
error it only appeared on the comments list the other day.  Since
the human error was mine, I can only apologize.]

 > I’m currently working on XML schemas to be used for a financial
 > application.

 > There is a specific requirement aiming to provide a data type where:
 >
 > 1) all strings must be length at least 1 and at most 35;
 > 2) the sequence “//” is not allowed within the first 16 characters
 >    of each string;
 > 3) the sequence “//” is allowed within the substring [17..35] of
 >    each string.

Thank you; this is an interesting problem in the construction of
regular expressions.  The language you describe is clearly regular,
but conventional regular expressions, including those of XML Schema,
are a little awkward in handling it, and the regular expression is
likely to be rather complex.

It is, by contrast, easy to state with assertions of the kind
supported by Schematron, or by XML Schema 1.1, and if simplicity of
expression is your primary goal, you may wish to say

   <assert test="not(contains(substring(.,1,16),'//'))"/>
   <assert test="length(.) > 0"/>
   <assert test="length(.) &lt; 36"/>

and be done with it.

But it's important to point out that the language you describe can in
fact be described using the pattern feature of XML Schema, in at least
two different ways.

For simplicity, I am going to work first on a simpler problem: define
a regular expression accepting strings of 'a' and 'b', of length
between 1 and 8, in which the string 'aa' must not appear within the
first 4 characters.  The string 'aa' may occur in positions 4 and 5,
since that's not withint the first 4 characters.  (You don't specify
whether '//' can appear in positions 16 and 17 of your strings; I am
assuming that it may.  Otherwise, modify the example below to suit.)

Once this toy problem is solved, it will be simple to extend to your
problem ('a' becomes '/', 'b' becomes '[^/]', and 4 and 8 become 16
and 35, respectively.

1 Brute force

First, consider a brute-force approach.  We construct a finite-state
automaton, with states s0, s1a, s1b, s2a, s2b, s3a, s3b, s4a, s4b,
s5x, s6x, s7x, s8x.  State s0 is the start state; every state except
s0 is an accept state.  From any state with the number n in it,
transitions go only to states with the number n+1 in it.  From an 'a'
state, transitions go only to the next 'b' or 'x' state; from a 'b'
state, they go to any kind of state ('a', 'b', or 'x').  If this ASCII
art is legible, it may make the pattern clear.  I can't draw arrows,
so know that / always points up and \ always points down and -- always
points right.  Each arrow pointing to an 'a' state should be labeled
'a', each arrow pointing to a 'b' state should be labeled 'b', each
arrow pointing to an 'x' state should be labeled '[ab]'.

     s1a  s2a  s3a  s4a
    /   \/   \/   \/   \
   s    /\   /\   /\   s5x--s6x--s7x--s8x
    \  /  \ /  \ /  \ /
     s1b--s2b--s3b--s4b

[If the diagram doesn't line up, change to a monospace font!]

 From this finite state automaton, we can construct a regular
expression in the usual way, by eliminating one state at a time and
combining the labels on adjacent arcs appropriately.

For example, after eliminating states s5x, s6x, and s7x, we have

                           [ab]{0,4}
     s1a  s2a  s3a  s4a--------------
    /   \/   \/   \/                 \
   s    /\   /\   /\                 s8x
    \  /  \ /  \ /  \                /
     s1b--s2b--s3b--s4b--------------
                           [ab]{0,4}

After eliminating several more states, we have

        (b((a(b[ab]{0,4})?)?|(b[ab]{0,5})?))?
     s1a---------------------------------------
    /                                          \
   s                                           s8x
    \                                          /
     s1b---------------------------------------
        (a(b[ab]{0,5})?)?|(b((a(b[ab]{0,4})?)?)|((b[ab]{0,5})?))?

And after eliminating the remaing states, we have the following regex
for the language (I've added white space for legibility, but if you
try this in a schema document remember that you need to get rid
of the whitespace first).

       (a(b((a(b[ab]{0,4})?)?|(b[ab]{0,5})?))?)?
       |(b(a(b[ab]{0,5})?)?|(b((a(b[ab]{0,4})?)?)|((b[ab]{0,5})?))?)?

The same method can be applied to your problem, but I leave that as an
exercise for the reader, because I don't have the patience to work it
out by hand, and if I start to try to automate the process I'll have a
lot of fun but I won't get anything else done for several days.  If
you try this approach, notice that the regular expression you get may
vary depending on which states you eliminate first.  You may wish to
use a regular expression / FSA toolkit to allow you to describe the
FSA and have the software generate the regex for you.  (The biggest
problem with this is that the regex languages supported by the tools I
know, like Grail and FSM, are not the same as that supported by XML
Schema, so you'll need to translate from one regex notation to
another.)


2 Logical conjunction

Another approach may feel less cumbersome.  The toy language we are
defining can be described as the set of strings consisting of a and b,
where:

   1 the string is between 1 and 16 characters long, AND
   2 the string 'aa' does not appear in positions 1-2, AND
   3 the string 'aa' does not appear in positions 2-3, AND
   4 the string 'aa' does not appear in positions 3-4.

Conditions 2-4 can be rephrased:

   1 the string is between 1 and 16 characters long, AND
   2 either string.1 is 'b' or string.2 is 'b',
     or the string has length < 2, AND
   3 either string.2 is 'b' or string.3 is 'b',
     or the string has length < 3, AND
   4 either string.3 is 'b' or string.4 is 'b'.
     or the string has length < 4.

Each of these conditions is easy to capture with a regular expression:

   1 [ab]{1,16}
   2 b[ab]* | [ab]b[ab]* | [ab]
   3 [ab]b[ab]* | [ab]{2}b[ab]* | [ab]{0,2}
   4 [ab]{2}b[ab]* | [ab]{3}b[ab]* | [ab]{0,3}

And we can express the conjunction of these regular expressions by
defining a chain of derivations, imposing each regex in a different
step.  (Patterns imposed on different steps are ANDed together.)

Applying this pattern to your specific problem, one way to define a
type which captures your requirements as I understand them is the
following.  We define a chain of types, t01 (which accepts any string
that doesn't have '//' beginning in position 1), t02 (which restricts
t01 by requring that the string also not have '//' beginning in
position 2), through t15 (which adds a constraint forbidding '//'
beginning in position 15), until we reach the type we care about,
which restricts t15 by adding the constraint that the string
must be between 1 and 35 characters long.

<xsd:simpleType name="t01">
   <xsd:restriction base="xsd:string">
    <xsd:pattern value="[^/].*|.{1}[^/].*"/>
   </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="t02">
   <xsd:restriction base="this:t01">
    <xsd:pattern value=".{1}[^/].*|.{2}[^/].*|.{0,1}"/>
   </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="t03">
   <xsd:restriction base="this:t02">
    <xsd:pattern value=".{2}[^/].*|.{3}[^/].*|.{0,2}"/>
   </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="t04">
   <xsd:restriction base="this:t03">
    <xsd:pattern value=".{3}[^/].*|.{4}[^/].*|.{0,3}"/>
   </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="t05">
   <xsd:restriction base="this:t04">
    <xsd:pattern value=".{4}[^/].*|.{5}[^/].*|.{0,4}"/>
   </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="t06">
   <xsd:restriction base="this:t05">
    <xsd:pattern value=".{5}[^/].*|.{6}[^/].*|.{0,5}"/>
   </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="t07">
   <xsd:restriction base="this:t06">
    <xsd:pattern value=".{6}[^/].*|.{7}[^/].*|.{0,6}"/>
   </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="t08">
   <xsd:restriction base="this:t07">
    <xsd:pattern value=".{7}[^/].*|.{8}[^/].*|.{0,7}"/>
   </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="t09">
   <xsd:restriction base="this:t08">
    <xsd:pattern value=".{8}[^/].*|.{9}[^/].*|.{0,8}"/>
   </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="t10">
   <xsd:restriction base="this:t09">
    <xsd:pattern value=".{9}[^/].*|.{10}[^/].*|.{0,9}"/>
   </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="t11">
   <xsd:restriction base="this:t10">
    <xsd:pattern value=".{10}[^/].*|.{11}[^/].*|.{0,10}"/>
   </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="t12">
   <xsd:restriction base="this:t11">
    <xsd:pattern value=".{11}[^/].*|.{12}[^/].*|.{0,11}"/>
   </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="t13">
   <xsd:restriction base="this:t12">
    <xsd:pattern value=".{12}[^/].*|.{13}[^/].*|.{0,12}"/>
   </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="t14">
   <xsd:restriction base="this:t13">
    <xsd:pattern value=".{13}[^/].*|.{14}[^/].*|.{0,13}"/>
   </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="t15">
   <xsd:restriction base="this:t14">
    <xsd:pattern value=".{14}[^/].*|.{15}[^/].*|.{0,14}"/>
   </xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name="t0">
   <xsd:restriction base="this:t15">
    <xsd:pattern value=".{1,35}"/>
   </xsd:restriction>
</xsd:simpleType>

The solution just given has the drawback that it litters the namespace
with a lot of types which only partially capture the requirements and
which should never be used for elements or attributes.  We can clean
that up a bit by making all but but the one type we care about
anonymous.

<xsd:simpleType name="T">
   <xsd:restriction>
    <xsd:simpleType><xsd:restriction>
    <xsd:simpleType><xsd:restriction>
    <xsd:simpleType><xsd:restriction>
    <xsd:simpleType><xsd:restriction>
    <xsd:simpleType><xsd:restriction>
    <xsd:simpleType><xsd:restriction>
    <xsd:simpleType><xsd:restriction>
    <xsd:simpleType><xsd:restriction>
    <xsd:simpleType><xsd:restriction>
    <xsd:simpleType><xsd:restriction>
    <xsd:simpleType><xsd:restriction>
    <xsd:simpleType><xsd:restriction>
    <xsd:simpleType><xsd:restriction>
    <xsd:simpleType><xsd:restriction>
    <xsd:simpleType>
     <xsd:restriction base="xsd:string">
      <xsd:pattern value="[^/].*|.{1}[^/].*"/>
     </xsd:restriction>
    </xsd:simpleType>
      <xsd:pattern value=".{1}[^/].*|.{2}[^/].*|.{0,1}"/>
     </xsd:restriction>
    </xsd:simpleType>
      <xsd:pattern value=".{2}[^/].*|.{3}[^/].*|.{0,2}"/>
     </xsd:restriction>
    </xsd:simpleType>
      <xsd:pattern value=".{3}[^/].*|.{4}[^/].*|.{0,3}"/>
     </xsd:restriction>
    </xsd:simpleType>
      <xsd:pattern value=".{4}[^/].*|.{5}[^/].*|.{0,4}"/>
     </xsd:restriction>
    </xsd:simpleType>
      <xsd:pattern value=".{5}[^/].*|.{6}[^/].*|.{0,5}"/>
     </xsd:restriction>
    </xsd:simpleType>
      <xsd:pattern value=".{6}[^/].*|.{7}[^/].*|.{0,6}"/>
     </xsd:restriction>
    </xsd:simpleType>
      <xsd:pattern value=".{7}[^/].*|.{8}[^/].*|.{0,7}"/>
     </xsd:restriction>
    </xsd:simpleType>
      <xsd:pattern value=".{8}[^/].*|.{9}[^/].*|.{0,8}"/>
     </xsd:restriction>
    </xsd:simpleType>
      <xsd:pattern value=".{9}[^/].*|.{10}[^/].*|.{0,9}"/>
     </xsd:restriction>
    </xsd:simpleType>
      <xsd:pattern value=".{10}[^/].*|.{11}[^/].*|.{0,10}"/>
     </xsd:restriction>
    </xsd:simpleType>
      <xsd:pattern value=".{11}[^/].*|.{12}[^/].*|.{0,11}"/>
     </xsd:restriction>
    </xsd:simpleType>
      <xsd:pattern value=".{12}[^/].*|.{13}[^/].*|.{0,12}"/>
     </xsd:restriction>
    </xsd:simpleType>
      <xsd:pattern value=".{13}[^/].*|.{14}[^/].*|.{0,13}"/>
     </xsd:restriction>
    </xsd:simpleType>
      <xsd:pattern value=".{14}[^/].*|.{15}[^/].*|.{0,14}"/>
     </xsd:restriction>
    </xsd:simpleType>
    <xsd:pattern value=".{1,35}"/>
   </xsd:restriction>
</xsd:simpleType>

(It rather makes one wish the XML Schema WG had decided to make
multiple xsd:pattern elements in the same restriction step be ANDed
together instead of being ORed together, doesn't it?  We could have
saved sixty lines or so in these definitions.)

As far as I can tell, either of these types (t0 or T) satisfies the
requirements you outlined.  The following elements are accepted:

<my:T>teststring</my:T>
<my:T>123456789.123456789.123456789.12345</my:T>
<my:T>123456789.123456//themorethemerrier</my:T>
<my:T>123456789.123456//themore////</my:T>

and the following are rejected as not matching the required
patterns.

<my:T>123456789.123456//themorethemerrier-but-this-is-too-long</my:T>
<my:T>1//456789.123456//themerrier////</my:T>
<my:T>123456///.123456//themore////</my:T>
<my:T>12345//89.123456//themerrier////</my:T>
<my:T>123456789.//3456//themore////</my:T>
<my:T>123456789.12345//6themerrier////</my:T>

Thank you for the problem; I've enjoyed working out these solutions.
I hope they help.

--C. M. Sperberg-McQueen

Received on Thursday, 8 February 2007 23:24:57 UTC