Extreme Markup Languages |
Keywords: XML Schema; UPA; subsumption; automata; regular expressions
AbstractWe describe an extension of regular expresions using arbitrary all-groups and numeric exponents (as in W3C XML Schema). We are able to provide polynomial-time algorithms for testing UPA and content model subsumption for these extended regular expressions. The algorithm handles numeric exponents (as opposed to the Kleene operators of traditional regular expressions), but comparisons of exponents are exponential in the depth of exponent nesting. The algorithm for UPA testing with all groups in USCM is polynomial in the size of the expression, but uses push-down automata - much more complex machinery than used here for that purpose, and current known algorithms for subsumption involve unrolling the numeric exponents. This is exponential in the size of the exponents - in other words large exponent values are much more onerous than small ones. As XML Schema uses numeric exponents this is a significant issue for validators. We also show how to extend these algorithms to handle XML Schema's wildcards and substitution groups. |
While XML Schema WXSprovides many facilities not available with DTDs, it also provides some features that are problematic for developing validators. At the same time, there are desirable features which it has not provided. In the former category are:
<element name="foo" type="bar" minOccurs="10" maxOccurs="20"/> |
Finally, there is empirical evidence, demonstrated by the number of inconsistencies among validators, that the initial specification is not completely clear on any number of points.
It is imperative for XML Schema to determine if there are reasonable algorithms for determining these questions. Otherwise the specification has, essentially, unusable features.
This paper describes algorithms for calculating whether a content model obeys UPA and whether one content model correctly restricts another based on the model built. The restriction algorithm has an exponential component where two chains of nested numeric exponents need to be compared, but it is only exponential in the length of the chains, rather than the size of the exponent values.
We will demonstrate our algorithms on an extension of traditional regular expressions. In the remainder of the paper we wil explain the extended syntax and define a few important properties of the terminals in these expressions. Then we will provide the UPA algorithm. This works for simple terminals. Next we will show how to extend this for substitution groups and wildcards as found in XML Schema. Next, we examine the subsumption issue, first for expressions with &-groups, and then for expressions with numeric exponents.
We start by having a set of terminals, which we will represent using a - z. These correspond to element names in XML Schema.
A schema regular expression is defined by the following grammar:
In order to determine if a regular expression follow UPA, we will need the following five properties of a particle:
As an example, consider the following regular expression:
((a | b),(((c | d) & (e, f{0,1}){2,5}){6,6} | (g & h{0,5})))
In order to better discuss it, let's first subscript every particle in the expression:
((a0 | b1)2,(((c3 | d4)5 & (e6, f7{0,1}8)9{2,5}10)11{6,6}12 | (g13 & h14{0,5}15)16)17)18
In this we have the entire expression being opaque, but particles 8 and 15 are not. Because the entire expression is a sequence, and the first particle in it (2) is opaque, the first set of the expression is {a, b}. The follow set of 2 - coincidentally the first set of 17 - is the first set of the rest of the sequence. Since it is a choice, it is the union of the first sets of the branches (particles 12 and 16). Both branches are &-groups, so they atain return the union of their first sets, or {c3, d4, e6, g13, h14}. Only some particles have non-empty confusion sets. Particle 9 has {f7} in its confusion set, as f7 is in follow(e6). Because of the optionality, confusion(10) has both e6 and f7 in it, as does confusion(12).
Ultimately a regular expression is for matching strings of terminals to determine membership in a language. This is done by matching terminals in the input against successive terminal particles in the expression. A regular expression has the UPA if, for any input string, at any point there is only particle in the regular expression which could be used to match the next item in the input. The regular expression (a, b) has UPA because it will only match an 'a' followed by a 'b'. The regular expression (a{0,1}, a) does not because when the first 'a' shows up, it could match either the first a? particle or the second a particle. Determining if a regular expression has UPA is not difficult if we use standard regular expressions as described above, but numeric exponents complicate the picture. In the first case we need only check that there are no two different particles with the same terminal either in the first set of the expression or in the follow set of any particle in the expression. (As an example of why first sets alone are insufficient, consider (a, a{0,1}, (b | a))). In the latter case, though, we have examples like:
In the first case, we have a UPA violation because the fourth match of a0 can be followed by either a fifth repetition or a1. The follow function rightly puts both in follow(a0). However, in the second case there is no ambiguity (8 repetitions of a0 followed by one a1. But in the third case we again have an ambiguity. On the eighth repetition, a0 can be followed by either b1 or b2 - but if we make example two work, then b2 is not in follow(a0).
In the fourth case we have a UPA violation between d3 and d4. Therefore we'd want the follow set of the &-group to be in the follow set of the individual branches. However, if we do that, then we'd also need to make sure that the first sets of the other branches are not also included, because otherwise c2 would have both b1 and b3 in its follow set in the fifth example. But that would then miss the UPA violation in the sixth example between b1 and b3.
Because of the issues with the two problematic constructs of &-groups and numeric exponents of the form {m,m} we need to delve further into aspects of UPA violations. There are two sources of UPA violations involving a particle that are particularly germane to our analysis.
Given a particle, there are a limited number of reasons that it can cause a UPA violation. Take the particle, P, in the following:
(α, β{0,1}, P, γ) or (α, (β | P), γ) or (α, (β & P), γ)
A violation can only result from a confusion among terminals. Either:
Suppose we have a particle, P, and P obeys UPA. Under what circumstances does P{m,n} not have UPA? At a minimum, if P{m,n} does not have UPA, then (P, P) does not have it either, although the reverse is not necessarily true (because of repeated particles). But if P obeys UPA, and (P, P) does not, then there must be some particle in P whose follow set would include the first set of the second P, and that follow set already includes a terminal in first(P). By definition, that follow set is included in follow(P). This is invariant with the number of repetitions of P.
Therefore we can test for whether P{m,n} obeys UPA by intersecting first(P) and confusion(P).
Likewise for (α1 & ... & αn). Assuming each αi obeys UPA, either ∪ni = 1first(αi) contains the same terminal with different subscripts or there are αj and αk, j not = k, where (αj, αk) violates UPA. Once again, this means there is some terminal in both confusion(αj) and first(αk). However, because a branch in an &-group never follows itself, it is not an error if (αj, αj) violates UPA because that would never occur in a single instance of the group - it would only happen if the &-group has an exponent, and then confusion(αj), being a subset of the confusion set of the group as a whole, would be compared with the first set of the group as a whole, exposing the issue. So we only need compare confusion(αj) with the first sets of the other branches.
We therefore end up with the following algorithm:
UPA(α) => if α = athen
if bi ∈ follow(a)and bj ∈ follow(a) -> i=j then true
else if α = β{m,n} then UPA(β) /\ (first(β) ∩ confusion(β) = {})
else if (α = (β1,...,βn) or α = (β1|...|βn)) and ∩ni = 1 first(βi) = {} then /\ni = 1UPA(βi)
else if α = (β1&...& βn) and ∩ni = 1 first(βi) = {} then /\ni=1(UPA(βi) /\ (confusion(βi) ∩ (∪j not = ifirst(βj)) = {})
else false.
As with UPA testing, new features like numeric exponents and &-groups add to the complexity of testing a subsumption relationship between two regular expressions, where by subsumption we mean all the strings accepted by the subsmed expression are also valid in the subsuming expressing, but not necessarily the reverse.
Without these features, given two regular expressions, α and β, that obey UPA, it is possible to determine whether α subsumes β in a straightforward way. If we consider the set of strings accepted by α L(α) and the set of strings accepted by β, L(β) as subsets of the set of all strings, then L(β) subset L(α) iff L(β) ∩ ~L(α) = {}. In other words, every string in L(β) should also be in L(α). ~L(α) is the set of strings not in L(α), so there should be no string that's both in ~L(α) and L(β). This can be determined by constructing an automaton from each regular expression, inverting the final states of the automaton from α, and constructing the cross-product automaton, an automaton whose states are pairs of states, one from each source automaton. This can be done in time quadratic with the two input automata, as there is only one state which has both start states, and from any state pair and for any terminal, there can be at most one successor state pair. Since we've inverted the accepting states in the one automaton, the subsumption holds if there is no state pair in the resulting automaton where both members of the pair is an accepting state from their respective automata.
We will approach this issue in two stages. We will first extend the above algorithm to handle &-groups and then extend that to additionally handle numeric exponents.
The straightforward subsumption story is complicated by &-groups because, even with UPA, at any point there may be more than one terminal that can match from the other machine. Consider the following pair:
Clearly, the second expression is subsumed by the first. The a is optional, (b, c, d) is one of the six possible paths through the & group, and b is the final terminal to be matched. However, the c and d branches of the &-group both have two transitions to a b, so the naive algorithm fails. A naive fix would be to break apart the group, listing all the alternatives:
(a{0,1}, ((d , c , b) | (d , b , c) | (c , d , b) | (c , b , d)(b , c , d)(b , d , c)), b)
However, this is exponential (actually O(n!)) in the size of the orignal expression, although the resulting expression still obeys UPA.
Any solution not expanding the content model needs to keep track of which branches of the &-group have been traversed at any point and only move from the &-group when all the required branches have been matched. Such a solution would also need to keep track of nested &-group - unlike choice, &-groups are commutative but not associative; i.e. ((a & b) & (c & d)) is not the same as ((a & c) & (b & d)), as the second allows (a, c, b, c), but the first requires that a be adjacent to b. This would imply the use of a stack.
To summarize, we need a means of indicating all the branches of an &-group such that we can keep track of which branches have been traversed and which have not. Further, this should not require rewriting the content model - either in advance or during execution of the algorithm - as that would return us to exponential behavior.
We will handle each &-group as a bit-vector of the length of the number of branches. We will divide this vector in two parts - the upper part corresponds to the opaque branches and the lower part to the transparent branches. For example, suppose we have the &-group (a? & b & c &d?) and want to see if it subsumes the sequence (b, a, c). We have two opaque branches and two transparent ones. We rewrite this as (b3 & c2 & d1? &a0?), corresponding to the bit-vector 1111, or 15, for all four branches, or 12, for just the opaque branches. We assign each branch to the corresponding bit. Let us call these the max and min values for the group. On entering the &-group we start with a bit-vector of 0000 - no branches completed. We cannot "leave" the group unless we've matched at least the required branches, so we'd need a bit vector of at least 1100. At each branch check, the bit vector, as a binary value, must increase so we know we've finished a new branch. Transitions corresponding to choices where the appropriate bit is 0 are active. Those wehre the bit is 1 are inactive.
To show the process, we first match b with b3 and xor the bit-vector with 23, where the power is the number of the matched branch. This gives a bit vector or 1000. Next we match the a with a0 and the bit vector becomes 1001. Finally we match the c with c2, giving a bit-vector of 1101. As 1100 ≤ 1101 ≤ 1111, the subsumption holds.
On the other hand, if we try to check (b, a, d), our final bit-vector is 1011. Since 1011 < 1100, the subsumption doesn't hold. Finally, if we try to check (b, b, d), we'd check the first b and get 1000. Checking the second b, we'd xor 1000 with 1000 with a result of 0000. As 0000 < 1000, we'd know we've tried to match the same branch twice and the subsumption fails.
At this point we need to describe how to build an automaton for an extended regular expression. In addition to the usual transitions, we need to keep track of the following when traversing a terminal:
Note that there may be more two transitions from a state with the same terminal. However (due to UPA) one will return to the &-group and the other will leave it. Each has an associated test condition, and only one can succeed.
Each transition, t, in our automaton therefore has properties:
Consider the following expression: (0a1, ((b02 & c13){0,1}0 & ((d4, e5) | e6)1), e7), where the subscripts are for identifying the states we'll produce for the automaton and the superscripts are for the branches of the &-groups. By the discussion above, we start at state 0 and produce the following transitions:
When comparing two automata we are building a new automaton that is a subset of the product of the two input automata. Let us call the automaton for the initial content model B (for base) and the one we are testing as R (for restriction). At each point, there will be some set of possible transitions from a current state in the automaton for R for which corresponding transitions must exist in B. There are two sets of important subcases:
There is also an important asymmetry between B and R for determining valid subsumptions. In particular, if some state in R is matched to a state in B by the algorithm, and one transition from the state in R goes to an &-group, then one of the following must be true:
As mentioned this is very asymmetric - if all transitions from R are simple;, then it doesn't matter whether transitions in B are simple or &-group transitions. If the transition in R matches a simple transition in B, then we proceed as usual. If it matches an &-group transition in B, then the successor states in R must eventually match all the branches in B, but they only need to match one ordering of them. As a result, when traversing states for an &-group it is not necessary to check all sequences.
This provides enough information to generate an algorithm. Because we need to check all alternatives for each state, we need to explicitly manage the &-group stacks for both B and R. As with the traditional algorithm, we invert the states in the base, create a machine from the cross-product of the states of the input machines, and determine that there are no states in the resulting machine that includes accepting states from both input machines.
As a simple example, we test whether the expression above subsumes (a, ((d, e) | e), e). If we show it as above, (0a1, ((d2, e3) | e4), e5) we can see that the transitions are:
We start at state 0 for both machines. Since we invert the accepting states of the base machine, its accepting states are 0 - 6, but not 7.
Numeric exponents add significant complexity to subsumption testing. In particular, the nesting of numeric exponents, both in B and R, potentially require testing a large number of alternatives to ensure for each path through R there is at least one path through B. And yet, in many cases this is not necessary because of overlap in the areas covered by the exponents.
We will first give a couple of example of exponents and valid and invalid content models. From that, we will be able to extract some important aspects of the problem.
For example, consider the following: (a{4,5}{2,3}). Here we have a sequence of either 4 or 5 repetitions of a, repeated 2 or 3 times. This means the following are legal:
More generally, regarding subsumption, consider two content models, (ρ{m0,,n0}...{mi,ni}) and (β{m'0,n'0}...{m'j,n'j}). We wish to establish if the latter subsumes the former. Clearly we must first establish if β subsumes ρ. Then we must determine that all possible ranges of ρ are also possible ranges of β. We can enumerate the ranges by choosing all the different m and n for each subscript - from (m0*m1*...*mi, n0*m1*...*mi) to (m0*n1*...*ni, n0*n1*...*ni). The maximum number of distinct ranges of ρ is 2i and of β is 2j. If β subsumes ρ, then we only need to determine that each range of ρ is contained within a range of β. We will use ranges(n) to refer to the ranges of the first n values in a list of exponents.
The more complex case is when the scoping is not so clear. For example, if B is (a | b){4,12} and R is (a{2,3},b{5,7}), then the iterations of a are insuficient without the additional iterations of b. Also consider B as (((a | b){3,5},c{0,1}){6,9},d) and R as ((a,b){20,25},c,d). Here, the repetitions of (a,b) in R must account for both the inner and outer exponents in B.
We can consider each terminal and each transition in B as having a nesting depth, depending on how many exponents are crossed from source to sink. Before a transition may be taken, all the "obligations" at higher levels must be fulfilled. For example, in B above, the d is at level 0 (not in the scope of any exponent) while a, b, and c are at level two. Therefore the transition from a to d is at level 0, and the exponent at levels 1 and 2 must have been satisfied before that transition could be crossed. At the other end, there is a transition from a to b at level 2, so that is counted against the exponent at that level ({3,5}), as well as a transition at level 1, so the number of iterations of a's and b's is either between 3 and 5, or over 18.
As with automata for &-group, we will need a stack for each exponent and some extra commands to attach to each transition. Each entry on the stack has an exponent and a counter to track the number of iterations completed. The counter has a pair to keep track of min and max iterations from the other automaton and a single number to track the number of matches from the innermost particle. In this case there are three kinds of command:
For example, the expression (0((a1 | b2){10,11}2,c3{0,1}2){6,9}1,d4) translates to the following transitions:
The algorithm follows:
We'll now give an example of this algorithm in action. We'll use the expression ((a, b){40,43}, c, d), which translates to the following automaton:
First, we flip accepting states in B. As the only accepting state was 4, we now have 4 as the only non-accepting state. We then start with state 0 from both machines. For each, there is one transition across a. We push ({40, 43}, {0, 0}) on the R stack and both ({6, 9}, {0, 0}) and ({10, 11}, {0, 0}) on the B stack, then increment B, so the B stack is [({10, 11}, {1, 1}),({6, 9}, {0, 0})]. We are in states 1R and 1B. We add ((0, 0), (1, 1), a) to the output. Now the only transitions are across b in both. We execute (1, 2, b, [increment]) in R and then B. The R stack is now [({40, 43}, {1, 1})] and the B stack is [({10, 11}, {2, 2}),({6, 9}, {0, 0})]. We add ((1, 1), (2, 2), b) to the output.
There are now two possibilities in R, either transition 2 or 3 above. If we traverse the a path, we bring ourselves back to state 1 in both machines, but we add ((2, 2), (1, 1), a) to the output.
If we traverse the c, we find ourselves in the most complex part of the algorithm. The R transition is 4, above. The B transition is 5. Top of the R stack is ({40, 43}, {1, 1}), top of the B stack is ({10, 11}, {2, 2}). As 40 * 2 > 10 * 1, and 1 < 2 in check(2, 1), we take the first branch. We set newk to 70 and newl to 76. The only values for newm and newn are 10 and 11. As 6 ≤ 7 ≤ 7 ≤ 9, we set the B stack to [({0, 2}, {0, 0})]. We add ((2, 2), (3, 3) c) to the output and continue. We perform the last command from the B transition, and increment, and we pop the top of the R stack. The R stack is now empty and the B stack is [({0, 2}, {1, 1})]. From here there is one transition in R, across d, and one corresponding one in B. In R we end up in state 4 with an empty stack. In B, we check the first entry on the stack. As 0 ≤ 1 ≤ 1 ≤ 2, we pass and pop the stack. So we add ((3, 3), (4, 4), d) to the output. Checking the output, we see there is no state with an accepting state from both input machines, so we conclude that B subsumes R.
The algorithms as described work on terminals. However, XSD content models generally use non-terminals, such as references to substitution groups and namespace wildcards, which resolve to sets of qualified names. It is the qualified names that are the actual terminals. In this section we will outline how the algorithms given above can be extended to cover the XSD case.
We can see how XSD "terminals" reference sets in the following examples.
Clearly, in determining UPA or subsumption we will need to consider these sets and their intersections.
While namespaces are flat, substitution groups form a tree - if R is in the substitution group of B, then the substitution group of R is a subset of the substitution group of B. Each node of this tree is a set; the structure is determined by declarations in the schema. From our perspective, we needn't worry about the particulars.
For UPA, the issue is no longer whether the same terminal appears twice, but whether any two sets named contain a common value. Since every non-empty set has a non-empty intersection with itself, we can update the UPA algorithm so that whenever two set names appear in a first or follow set together, we check that they have empty intersection. That way there can never be an occasion where the same qualified name appears in two transitions from the same state.
For subsumption testing, when we compare two transitions, such as (qR,q'R,ρ,[...]) from R and (qB,q'B,β,[...]) from B, it is not necessary that ρ and β be the same terminal, but only that ρ be a subset of β. That ensures that any qualified name matched by the transition in R is also matched by the transition in R.
It is an unfortunate side effect of this approach that UPA testing, in particular, needs to be done for every combination of element declarations a content model appears in, rather than just once. This limits the reusability of schema constructs. It also allows some very unexpected constructs; for example, a wildcard can be a restriction of a substitution group if it just so happens that all the elements defined in the namespace are also in the substitution group. There are ways to address this issue, but they are beyond the scope of this paper.
We've shown algorithms for UPA and subsumption testing for an extension to regular expression covering numerice exponents and general all-groups. Numeric exponents are a new feature added by XSD and all-groups were a feature of SGML dropped by XML 1.0 to simplify development. Both of these are important features where previously published algorithms were very expensive. While we've concentrated on the traditional notion of terminals, we have shown how XSD's wildcards and substitution groups can be handled.
The run time performance of these algorithms is very reasonable compared to alternatives. For example, naive algorithms that would unroll the numeric exponents and then determinise them in some for to test UPA would run in doubly exponential time - firstly, unrolling the exponents is O(n * ed), where n is the size of a group, e is the size of the exponents, and d is their nesting depth. Given minimal and maximal values, the resulting automaton is highly non-deterministic, even where the original had UPA (for example (a{0,10}{0, 10}) expands to 100 (102) optional a's). Any attempt to determinise this using the traditional subset construction method AHU is again exponential in the size of the input (as there may be an exponential blowup in states), giving O(2n * ed ), which is fairly large. Any subset determination on this result is huge compared to the input.
In our case, the performance of the UPA algorithm remains quadratic with the input - only slightly worse than the algorithm on just kleene algorithms. The performance of the subsumption algorithm is only exponential in the depth of the nesting of exponents, or O(n*2d). This is comparatively reasonable.
[RNG] Relax NG SpecificationJames Clark and MURATA Makoto, Oasis, 2001
[MUR95] Murata Makoto, "Forest-Regular Languages and Tree-Regular Languages". Technical Report. Fuji Xerox, Japan, 1995.
[MUR99] Murata Makoto, "Hedge Automata: A Formal Model for XML Schemata", 1999.
[CHID00] Boris Chidlovskii, "Using Regular Tree Automata as XML Schemas", Proceedings of IEEE Advances in Digital Libraries 2000 Washington, USA, 2000, eds., J. Hoppenbrouwers, et al.
[TATA] Hubert Comon, et al, Tree Automata Techniques and Applications, 1998
[WXS] Henry Thompson, et al., XML Schema Part One: Structures, World Wide Web Consortium, May 2001, http://www.w3.org/TR/xmlschema-1/
[FD] Allen Brown, et al., XML Schema: Formal Description, World Wide Web Consortium, Sept. 2001, http://www.w3.org/TR/xmlschema-formal/
[USCM] Andreas Neumann, Unambiguity of SGML Content Models - Pushdown Automata Revisited 3rd Int. Conf. on the Developments in Language Theory (DLT'97), Juli 1997, Thessaloniki, Griechenland, in: Symeon Bozapalidis, Hrsg., Proceedings of the 3rd International Conference Developments in Language Theory, S. 507-518, Aristotle University of Thessaloniki, 1997
[AHU] Alfred Aho et al., The Design and Analysis of Computer Algorithms, Addison-Wesley Pub., 1974