Re: efficient content model validation and restriction checking from Stefan Wachter on 2015-01-12 (xmlschema-dev@w3.org from January 2015)

From: Stefan Wachter <stefan.wachter@gmx.de>
Date: Mon, 12 Jan 2015 21:17:41 +0100
To: xmlschema-dev@w3.org
Message-ID: <54B42BE5.5050007@gmx.de>
Many thanks for that information. I will definitely try to implement the 
algorithm described by Mr. Thompson. He already was so kind to provide 
me the mentioned article. In addition I will implement the runtime check 
for the undecidable cases.

Is there a specific reason why you decided that Saxon does not support 
the "runtime-check" solution?

Best regards

Stefan

Am 12.01.2015 um 10:11 schrieb Michael Kay:
> It's also worth pointing out that XSD 1.1 provides a "get out of jail card" on this one. In 3.4.6.3:
>
> It is ·implementation-defined· whether a processor (a) always detects violations of clause 2.4.2 by examination of the schema in isolation, (b) detects them only when some element information item in the input document is valid against T but not against T.{base type definition}, or (c) sometimes detects such violations by examination of the schema in isolation and sometimes not. In the latter case, the circumstances in which the processor does one or the other are ·implementation-dependent·.
>
> What this means is that if you can't prove statically that R is a valid restriction of B, or if doing the proof is too expensive, then an alternative strategy is to validate instances against both R and B, and report the schema to be invalid only if and when you encounter an instance that is valid against R but invalid against B. That's not very friendly, because a schema might be in use for years before it is discovered to be invalid, but it might be better than the strategy which Saxon adopts, which is to simply report some schemas as too complex to handle.
>
>
> Michael Kay
> Saxonica
> mike@saxonica.com
> +44 (0) 118 946 5893
>
>
>
>
> On 11 Jan 2015, at 16:13, Michael Kay <mike@saxonica.com> wrote:
>
>> There's a follow-up paper by Henry Thompson
>>
>> http://www.inf.ed.ac.uk/publications/report/0856.html
>>
>> which describes how finite state machines can be augmented with counters to reduce the time and space needed to handle large finite occurrence limits. Unfortunately the scheme only works for certain cases - but it does handle the most common cases that arise in practice, including your example.
>>
>> I've looked at how some regex engines tackle the problem of large finite occurrence constraints, and it seems they don't build an FSA at all, they just interpret the regex as written with a crude backtracking approach to handle ambiguities. This approach can be used for XSD too, but it makes the "valid restriction" constraint a lot tougher to check.
>>
>>
>> Michael Kay
>> Saxonica
>> mike@saxonica.com
>> +44 (0) 118 946 5893
>>
>>
>>
>>
>> On 10 Jan 2015, at 21:32, Stefan Wachter <stefan.wachter@gmx.de> wrote:
>>
>>> Hi all,
>>>
>>> I try to implement a validating XML parser based on the XSD recommendation of April 5, 2012 using Scala. Studying the specification I found that the condition for a complex content model to be a valid restriction of a base content model requires that all sequences accepted by the restricted content model are also accepted by the base content model.
>>>
>>> This requirement turned out to be rather hard to implement. I am not a language theory specialist. Therefore I spent some time googling around and found interesting research papers about the language inclusion problem. Yet, most of them where not directly applicable because of specialities of XML schema (occurence constraints, wildcards). Finally I found a link to an article by Thompson & Tobin (http://www.ltg.ed.ac.uk/~ht/XML_Europe_2003.html) where some algorithms are described in sufficient detail that are meant exactly for the means of XML schema content model validation.
>>>
>>> After I had implemented these algorithms it turned out, that they were very inefficient for certain schemas. In particular the "XML Schema Test Collection" contains the following schema (msData/particles/particlesIe003.xsd):
>>>
>>> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="http://xsdtesting" xmlns:x="http://xsdtesting" elementFormDefault="qualified">
>>>        <xsd:complexType name="base">
>>>                <xsd:choice>
>>>                        <xsd:element name="e1" minOccurs="0" maxOccurs="unbounded"/>
>>>                        <xsd:element name="e2" minOccurs="0" maxOccurs="unbounded"/>
>>>                </xsd:choice>
>>>        </xsd:complexType>
>>>        <xsd:complexType name="testing">
>>>                <xsd:complexContent>
>>>                        <xsd:restriction base="x:base">
>>>                                <xsd:choice>
>>>                                        <xsd:element name="e1" minOccurs="1" maxOccurs="9999999"/>
>>>                                        <xsd:element name="e2" minOccurs="1" maxOccurs="9999999"/>
>>>                                </xsd:choice>
>>>                        </xsd:restriction>
>>>                </xsd:complexContent>
>>>        </xsd:complexType>
>>>        <xsd:element name="doc" type="x:testing"/>
>>> </xsd:schema>
>>>
>>> It is impractical to validate this schema using the algorithms presented in the mentioned article. The constructed state machines have millions of states and take hours to be analyzed.
>>>
>>> Can anyone give me some help, how content model validation can be implemented for such cases? Should I fall back to the specification given in XSD 1.0?
>>>
>>> Thanks for your attention & regards!
>>>
>>> Stefan
>>>
>>>
>>>
>
Received on Monday, 12 January 2015 20:18:16 UTC