efficient content model validation and restriction checking

Hi all,

I try to implement a validating XML parser based on the XSD 
recommendation of April 5, 2012 using Scala. Studying the specification 
I found that the condition for a complex content model to be a valid 
restriction of a base content model requires that all sequences accepted 
by the restricted content model are also accepted by the base content model.

This requirement turned out to be rather hard to implement. I am not a 
language theory specialist. Therefore I spent some time googling around 
and found interesting research papers about the language inclusion 
problem. Yet, most of them where not directly applicable because of 
specialities of XML schema (occurence constraints, wildcards). Finally I 
found a link to an article by Thompson & Tobin 
(http://www.ltg.ed.ac.uk/~ht/XML_Europe_2003.html) where some algorithms 
are described in sufficient detail that are meant exactly for the means 
of XML schema content model validation.

After I had implemented these algorithms it turned out, that they were 
very inefficient for certain schemas. In particular the "XML Schema Test 
Collection" contains the following schema 
(msData/particles/particlesIe003.xsd):

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
targetNamespace="http://xsdtesting" xmlns:x="http://xsdtesting" 
elementFormDefault="qualified">
         <xsd:complexType name="base">
                 <xsd:choice>
                         <xsd:element name="e1" minOccurs="0" 
maxOccurs="unbounded"/>
                         <xsd:element name="e2" minOccurs="0" 
maxOccurs="unbounded"/>
                 </xsd:choice>
         </xsd:complexType>
         <xsd:complexType name="testing">
                 <xsd:complexContent>
                         <xsd:restriction base="x:base">
                                 <xsd:choice>
                                         <xsd:element name="e1" 
minOccurs="1" maxOccurs="9999999"/>
                                         <xsd:element name="e2" 
minOccurs="1" maxOccurs="9999999"/>
                                 </xsd:choice>
                         </xsd:restriction>
                 </xsd:complexContent>
         </xsd:complexType>
         <xsd:element name="doc" type="x:testing"/>
</xsd:schema>

It is impractical to validate this schema using the algorithms presented 
in the mentioned article. The constructed state machines have millions 
of states and take hours to be analyzed.

Can anyone give me some help, how content model validation can be 
implemented for such cases? Should I fall back to the specification 
given in XSD 1.0?

Thanks for your attention & regards!

Stefan

Received on Saturday, 10 January 2015 21:32:39 UTC