- From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
- Date: Thu, 8 Feb 2007 16:25:06 -0700
- To: CARCHESIO GIOVANNI <Giovanni.carchesio@bancaditalia.it>
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>, <www-xml-schema-comments@w3.org>
On 15 Dec 2006, at 05:20 , CARCHESIO GIOVANNI wrote: [I notice that your mail was sent in December, although owing to human error it only appeared on the comments list the other day. Since the human error was mine, I can only apologize.] > I’m currently working on XML schemas to be used for a financial > application. > There is a specific requirement aiming to provide a data type where: > > 1) all strings must be length at least 1 and at most 35; > 2) the sequence “//” is not allowed within the first 16 characters > of each string; > 3) the sequence “//” is allowed within the substring [17..35] of > each string. Thank you; this is an interesting problem in the construction of regular expressions. The language you describe is clearly regular, but conventional regular expressions, including those of XML Schema, are a little awkward in handling it, and the regular expression is likely to be rather complex. It is, by contrast, easy to state with assertions of the kind supported by Schematron, or by XML Schema 1.1, and if simplicity of expression is your primary goal, you may wish to say <assert test="not(contains(substring(.,1,16),'//'))"/> <assert test="length(.) > 0"/> <assert test="length(.) < 36"/> and be done with it. But it's important to point out that the language you describe can in fact be described using the pattern feature of XML Schema, in at least two different ways. For simplicity, I am going to work first on a simpler problem: define a regular expression accepting strings of 'a' and 'b', of length between 1 and 8, in which the string 'aa' must not appear within the first 4 characters. The string 'aa' may occur in positions 4 and 5, since that's not withint the first 4 characters. (You don't specify whether '//' can appear in positions 16 and 17 of your strings; I am assuming that it may. Otherwise, modify the example below to suit.) Once this toy problem is solved, it will be simple to extend to your problem ('a' becomes '/', 'b' becomes '[^/]', and 4 and 8 become 16 and 35, respectively. 1 Brute force First, consider a brute-force approach. We construct a finite-state automaton, with states s0, s1a, s1b, s2a, s2b, s3a, s3b, s4a, s4b, s5x, s6x, s7x, s8x. State s0 is the start state; every state except s0 is an accept state. From any state with the number n in it, transitions go only to states with the number n+1 in it. From an 'a' state, transitions go only to the next 'b' or 'x' state; from a 'b' state, they go to any kind of state ('a', 'b', or 'x'). If this ASCII art is legible, it may make the pattern clear. I can't draw arrows, so know that / always points up and \ always points down and -- always points right. Each arrow pointing to an 'a' state should be labeled 'a', each arrow pointing to a 'b' state should be labeled 'b', each arrow pointing to an 'x' state should be labeled '[ab]'. s1a s2a s3a s4a / \/ \/ \/ \ s /\ /\ /\ s5x--s6x--s7x--s8x \ / \ / \ / \ / s1b--s2b--s3b--s4b [If the diagram doesn't line up, change to a monospace font!] From this finite state automaton, we can construct a regular expression in the usual way, by eliminating one state at a time and combining the labels on adjacent arcs appropriately. For example, after eliminating states s5x, s6x, and s7x, we have [ab]{0,4} s1a s2a s3a s4a-------------- / \/ \/ \/ \ s /\ /\ /\ s8x \ / \ / \ / \ / s1b--s2b--s3b--s4b-------------- [ab]{0,4} After eliminating several more states, we have (b((a(b[ab]{0,4})?)?|(b[ab]{0,5})?))? s1a--------------------------------------- / \ s s8x \ / s1b--------------------------------------- (a(b[ab]{0,5})?)?|(b((a(b[ab]{0,4})?)?)|((b[ab]{0,5})?))? And after eliminating the remaing states, we have the following regex for the language (I've added white space for legibility, but if you try this in a schema document remember that you need to get rid of the whitespace first). (a(b((a(b[ab]{0,4})?)?|(b[ab]{0,5})?))?)? |(b(a(b[ab]{0,5})?)?|(b((a(b[ab]{0,4})?)?)|((b[ab]{0,5})?))?)? The same method can be applied to your problem, but I leave that as an exercise for the reader, because I don't have the patience to work it out by hand, and if I start to try to automate the process I'll have a lot of fun but I won't get anything else done for several days. If you try this approach, notice that the regular expression you get may vary depending on which states you eliminate first. You may wish to use a regular expression / FSA toolkit to allow you to describe the FSA and have the software generate the regex for you. (The biggest problem with this is that the regex languages supported by the tools I know, like Grail and FSM, are not the same as that supported by XML Schema, so you'll need to translate from one regex notation to another.) 2 Logical conjunction Another approach may feel less cumbersome. The toy language we are defining can be described as the set of strings consisting of a and b, where: 1 the string is between 1 and 16 characters long, AND 2 the string 'aa' does not appear in positions 1-2, AND 3 the string 'aa' does not appear in positions 2-3, AND 4 the string 'aa' does not appear in positions 3-4. Conditions 2-4 can be rephrased: 1 the string is between 1 and 16 characters long, AND 2 either string.1 is 'b' or string.2 is 'b', or the string has length < 2, AND 3 either string.2 is 'b' or string.3 is 'b', or the string has length < 3, AND 4 either string.3 is 'b' or string.4 is 'b'. or the string has length < 4. Each of these conditions is easy to capture with a regular expression: 1 [ab]{1,16} 2 b[ab]* | [ab]b[ab]* | [ab] 3 [ab]b[ab]* | [ab]{2}b[ab]* | [ab]{0,2} 4 [ab]{2}b[ab]* | [ab]{3}b[ab]* | [ab]{0,3} And we can express the conjunction of these regular expressions by defining a chain of derivations, imposing each regex in a different step. (Patterns imposed on different steps are ANDed together.) Applying this pattern to your specific problem, one way to define a type which captures your requirements as I understand them is the following. We define a chain of types, t01 (which accepts any string that doesn't have '//' beginning in position 1), t02 (which restricts t01 by requring that the string also not have '//' beginning in position 2), through t15 (which adds a constraint forbidding '//' beginning in position 15), until we reach the type we care about, which restricts t15 by adding the constraint that the string must be between 1 and 35 characters long. <xsd:simpleType name="t01"> <xsd:restriction base="xsd:string"> <xsd:pattern value="[^/].*|.{1}[^/].*"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="t02"> <xsd:restriction base="this:t01"> <xsd:pattern value=".{1}[^/].*|.{2}[^/].*|.{0,1}"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="t03"> <xsd:restriction base="this:t02"> <xsd:pattern value=".{2}[^/].*|.{3}[^/].*|.{0,2}"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="t04"> <xsd:restriction base="this:t03"> <xsd:pattern value=".{3}[^/].*|.{4}[^/].*|.{0,3}"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="t05"> <xsd:restriction base="this:t04"> <xsd:pattern value=".{4}[^/].*|.{5}[^/].*|.{0,4}"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="t06"> <xsd:restriction base="this:t05"> <xsd:pattern value=".{5}[^/].*|.{6}[^/].*|.{0,5}"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="t07"> <xsd:restriction base="this:t06"> <xsd:pattern value=".{6}[^/].*|.{7}[^/].*|.{0,6}"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="t08"> <xsd:restriction base="this:t07"> <xsd:pattern value=".{7}[^/].*|.{8}[^/].*|.{0,7}"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="t09"> <xsd:restriction base="this:t08"> <xsd:pattern value=".{8}[^/].*|.{9}[^/].*|.{0,8}"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="t10"> <xsd:restriction base="this:t09"> <xsd:pattern value=".{9}[^/].*|.{10}[^/].*|.{0,9}"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="t11"> <xsd:restriction base="this:t10"> <xsd:pattern value=".{10}[^/].*|.{11}[^/].*|.{0,10}"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="t12"> <xsd:restriction base="this:t11"> <xsd:pattern value=".{11}[^/].*|.{12}[^/].*|.{0,11}"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="t13"> <xsd:restriction base="this:t12"> <xsd:pattern value=".{12}[^/].*|.{13}[^/].*|.{0,12}"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="t14"> <xsd:restriction base="this:t13"> <xsd:pattern value=".{13}[^/].*|.{14}[^/].*|.{0,13}"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="t15"> <xsd:restriction base="this:t14"> <xsd:pattern value=".{14}[^/].*|.{15}[^/].*|.{0,14}"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name="t0"> <xsd:restriction base="this:t15"> <xsd:pattern value=".{1,35}"/> </xsd:restriction> </xsd:simpleType> The solution just given has the drawback that it litters the namespace with a lot of types which only partially capture the requirements and which should never be used for elements or attributes. We can clean that up a bit by making all but but the one type we care about anonymous. <xsd:simpleType name="T"> <xsd:restriction> <xsd:simpleType><xsd:restriction> <xsd:simpleType><xsd:restriction> <xsd:simpleType><xsd:restriction> <xsd:simpleType><xsd:restriction> <xsd:simpleType><xsd:restriction> <xsd:simpleType><xsd:restriction> <xsd:simpleType><xsd:restriction> <xsd:simpleType><xsd:restriction> <xsd:simpleType><xsd:restriction> <xsd:simpleType><xsd:restriction> <xsd:simpleType><xsd:restriction> <xsd:simpleType><xsd:restriction> <xsd:simpleType><xsd:restriction> <xsd:simpleType><xsd:restriction> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:pattern value="[^/].*|.{1}[^/].*"/> </xsd:restriction> </xsd:simpleType> <xsd:pattern value=".{1}[^/].*|.{2}[^/].*|.{0,1}"/> </xsd:restriction> </xsd:simpleType> <xsd:pattern value=".{2}[^/].*|.{3}[^/].*|.{0,2}"/> </xsd:restriction> </xsd:simpleType> <xsd:pattern value=".{3}[^/].*|.{4}[^/].*|.{0,3}"/> </xsd:restriction> </xsd:simpleType> <xsd:pattern value=".{4}[^/].*|.{5}[^/].*|.{0,4}"/> </xsd:restriction> </xsd:simpleType> <xsd:pattern value=".{5}[^/].*|.{6}[^/].*|.{0,5}"/> </xsd:restriction> </xsd:simpleType> <xsd:pattern value=".{6}[^/].*|.{7}[^/].*|.{0,6}"/> </xsd:restriction> </xsd:simpleType> <xsd:pattern value=".{7}[^/].*|.{8}[^/].*|.{0,7}"/> </xsd:restriction> </xsd:simpleType> <xsd:pattern value=".{8}[^/].*|.{9}[^/].*|.{0,8}"/> </xsd:restriction> </xsd:simpleType> <xsd:pattern value=".{9}[^/].*|.{10}[^/].*|.{0,9}"/> </xsd:restriction> </xsd:simpleType> <xsd:pattern value=".{10}[^/].*|.{11}[^/].*|.{0,10}"/> </xsd:restriction> </xsd:simpleType> <xsd:pattern value=".{11}[^/].*|.{12}[^/].*|.{0,11}"/> </xsd:restriction> </xsd:simpleType> <xsd:pattern value=".{12}[^/].*|.{13}[^/].*|.{0,12}"/> </xsd:restriction> </xsd:simpleType> <xsd:pattern value=".{13}[^/].*|.{14}[^/].*|.{0,13}"/> </xsd:restriction> </xsd:simpleType> <xsd:pattern value=".{14}[^/].*|.{15}[^/].*|.{0,14}"/> </xsd:restriction> </xsd:simpleType> <xsd:pattern value=".{1,35}"/> </xsd:restriction> </xsd:simpleType> (It rather makes one wish the XML Schema WG had decided to make multiple xsd:pattern elements in the same restriction step be ANDed together instead of being ORed together, doesn't it? We could have saved sixty lines or so in these definitions.) As far as I can tell, either of these types (t0 or T) satisfies the requirements you outlined. The following elements are accepted: <my:T>teststring</my:T> <my:T>123456789.123456789.123456789.12345</my:T> <my:T>123456789.123456//themorethemerrier</my:T> <my:T>123456789.123456//themore////</my:T> and the following are rejected as not matching the required patterns. <my:T>123456789.123456//themorethemerrier-but-this-is-too-long</my:T> <my:T>1//456789.123456//themerrier////</my:T> <my:T>123456///.123456//themore////</my:T> <my:T>12345//89.123456//themerrier////</my:T> <my:T>123456789.//3456//themore////</my:T> <my:T>123456789.12345//6themerrier////</my:T> Thank you for the problem; I've enjoyed working out these solutions. I hope they help. --C. M. Sperberg-McQueen
Received on Thursday, 8 February 2007 23:24:57 UTC