Re: efficient content model validation and restriction checking from Michael Kay on 2015-01-11 (xmlschema-dev@w3.org from January 2015)

From: Michael Kay <mike@saxonica.com>
Date: Sun, 11 Jan 2015 16:13:16 +0000
To: Stefan Wachter <stefan.wachter@gmx.de>
Cc: xmlschema-dev@w3.org
Message-Id: <715E6B9F-0D53-4A02-A80C-551933835FA6@saxonica.com>
There's a follow-up paper by Henry Thompson

http://www.inf.ed.ac.uk/publications/report/0856.html

which describes how finite state machines can be augmented with counters to reduce the time and space needed to handle large finite occurrence limits. Unfortunately the scheme only works for certain cases - but it does handle the most common cases that arise in practice, including your example.

I've looked at how some regex engines tackle the problem of large finite occurrence constraints, and it seems they don't build an FSA at all, they just interpret the regex as written with a crude backtracking approach to handle ambiguities. This approach can be used for XSD too, but it makes the "valid restriction" constraint a lot tougher to check.


Michael Kay
Saxonica
mike@saxonica.com
+44 (0) 118 946 5893




On 10 Jan 2015, at 21:32, Stefan Wachter <stefan.wachter@gmx.de> wrote:

> Hi all,
> 
> I try to implement a validating XML parser based on the XSD recommendation of April 5, 2012 using Scala. Studying the specification I found that the condition for a complex content model to be a valid restriction of a base content model requires that all sequences accepted by the restricted content model are also accepted by the base content model.
> 
> This requirement turned out to be rather hard to implement. I am not a language theory specialist. Therefore I spent some time googling around and found interesting research papers about the language inclusion problem. Yet, most of them where not directly applicable because of specialities of XML schema (occurence constraints, wildcards). Finally I found a link to an article by Thompson & Tobin (http://www.ltg.ed.ac.uk/~ht/XML_Europe_2003.html) where some algorithms are described in sufficient detail that are meant exactly for the means of XML schema content model validation.
> 
> After I had implemented these algorithms it turned out, that they were very inefficient for certain schemas. In particular the "XML Schema Test Collection" contains the following schema (msData/particles/particlesIe003.xsd):
> 
> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="http://xsdtesting" xmlns:x="http://xsdtesting" elementFormDefault="qualified">
>        <xsd:complexType name="base">
>                <xsd:choice>
>                        <xsd:element name="e1" minOccurs="0" maxOccurs="unbounded"/>
>                        <xsd:element name="e2" minOccurs="0" maxOccurs="unbounded"/>
>                </xsd:choice>
>        </xsd:complexType>
>        <xsd:complexType name="testing">
>                <xsd:complexContent>
>                        <xsd:restriction base="x:base">
>                                <xsd:choice>
>                                        <xsd:element name="e1" minOccurs="1" maxOccurs="9999999"/>
>                                        <xsd:element name="e2" minOccurs="1" maxOccurs="9999999"/>
>                                </xsd:choice>
>                        </xsd:restriction>
>                </xsd:complexContent>
>        </xsd:complexType>
>        <xsd:element name="doc" type="x:testing"/>
> </xsd:schema>
> 
> It is impractical to validate this schema using the algorithms presented in the mentioned article. The constructed state machines have millions of states and take hours to be analyzed.
> 
> Can anyone give me some help, how content model validation can be implemented for such cases? Should I fall back to the specification given in XSD 1.0?
> 
> Thanks for your attention & regards!
> 
> Stefan
> 
> 
>
Received on Sunday, 11 January 2015 16:13:54 UTC