Re: Optimizing a schema. from Michael Kay on 2010-07-22 (xmlschema-dev@w3.org from July 2010)

From: Michael Kay <mike@saxonica.com>
Date: Thu, 22 Jul 2010 23:00:27 +0100
To: Casey Jordan <casey.jordan@jorsek.com>
CC: xmlschema-dev@w3.org
Message-ID: <4C48BF7B.4090102@saxonica.com>

Converting from a grammar to a FSA using XSLT looks ambitious. But the 
question is, whatever language the code is written in, what is the 
algorithm you are using? A classic algorithm is Thompson and Tobin's 
adaptation of the standard Aho/Ullman algorithm, which T&T published at 
XML Europe 2003 - which you will find here:

http://www.ltg.ed.ac.uk/~ht/XML_Europe_2003.html

I believe that if you use this algorithm, the process of determinizing 
the FSA will automatically remove redundant transitions caused by excess 
verbosity in the original grammar. The treatment of occurrence 
indicators is less than optimal, but that's a different question. I 
think the answer to your question is that you don't necessarily need to 
do any work in simplifying the grammar in advance, provided that you 
remove redundancy from the FSA after generating it.

Michael Kay
Saxonica

On 21/07/2010 20:50, Casey Jordan wrote:
> Hi folks,
>
> I have a need to optimize an xml schema into its simplest DFA form. 
> Currently I have a number of xslt stylesheets that I wrote which 
> essentially flattens a schema and converts the whole thing to a state 
> automaton in JSON to be used in a client side validator I wrote.
>
> However, the schemas I am working with right now (From the DITA open 
> toolkit) when flattened have a ton of unnecessary nested structures, 
> for instance things like this pop up:
>
> <xs:sequence>
> <xs:choice>
> <xs:sequence>
> <xs:choice>
> <xs:sequence>
> <xs:sequence>
> <xs:element/>
> </xs:sequence>
> <xs:sequence>
> <xs:element/>
> </xs:sequence>
> </xs:sequence>
> </xs:choice>
> </xs:sequence>
> </xs:choice>
> </xs:sequence>
>
> The validator works great on smaller schemas, but as its written in 
> javascript these huge unnecessary structures slow it down a lot.
>
> I am assuming that other schema processors simplify things like this 
> and before I wrote my own wanted to check with the community.
>
> Cheers,
>
> Casey
> -- 
> --
> Casey Jordan
> Jorsek Software LLC.
> "CaseyDJordan" on LinkedIn, Twitter & Facebook
> Cell (585) 348 7399
> Office (585) 239 6060
> Jorsek.com
>
>
> This message is intended only for the use of the Addressee(s) and may
> contain information that is privileged, confidential, and/or exempt from
> disclosure under applicable law.  If you are not the intended recipient,
> please be advised that any disclosure  copying, distribution, or use of
> the information contained herein is prohibited.  If you have received
> this communication in error, please destroy all copies of the message,
> whether in electronic or hard copy format, as well as attachments, and
> immediately contact the sender by replying to this e-mail or by phone.
> Thank you.

Received on Thursday, 22 July 2010 22:00:59 UTC