Re: Optimizing a schema. from Casey Jordan on 2010-07-22 (xmlschema-dev@w3.org from July 2010)

From: Casey Jordan <casey.jordan@jorsek.com>
Date: Thu, 22 Jul 2010 18:07:38 -0400
To: Michael Kay <mike@saxonica.com>
Cc: xmlschema-dev@w3.org
Message-ID: <AANLkTilLe4R_PrCQwod7XRQiLJNlFWO7OrwPM8MLs3io@mail.gmail.com>
Michael,

Thats a great reference, thanks. You are right this is looking quite
ambitious to do in xslt, thats why I was hoping I could use saxon to help
take the brunt of the load in converting the schema to an FSA.

Next I need to convert my current validation algorithm, which is recursive,
into an iterative one. If you have any ideas on that I would much appreciate
the advice. The end goal being to provide access to a PSVI which can be
queried to determine where new elements can be inserted safely into the
structure.

Thanks,

Casey

On Thu, Jul 22, 2010 at 6:00 PM, Michael Kay <mike@saxonica.com> wrote:

> Converting from a grammar to a FSA using XSLT looks ambitious. But the
> question is, whatever language the code is written in, what is the algorithm
> you are using? A classic algorithm is Thompson and Tobin's adaptation of the
> standard Aho/Ullman algorithm, which T&T published at XML Europe 2003 -
> which you will find here:
>
> http://www.ltg.ed.ac.uk/~ht/XML_Europe_2003.html<http://www.ltg.ed.ac.uk/%7Eht/XML_Europe_2003.html>
>
> I believe that if you use this algorithm, the process of determinizing the
> FSA will automatically remove redundant transitions caused by excess
> verbosity in the original grammar. The treatment of occurrence indicators is
> less than optimal, but that's a different question. I think the answer to
> your question is that you don't necessarily need to do any work in
> simplifying the grammar in advance, provided that you remove redundancy from
> the FSA after generating it.
>
> Michael Kay
> Saxonica
>
>
> On 21/07/2010 20:50, Casey Jordan wrote:
>
>> Hi folks,
>>
>> I have a need to optimize an xml schema into its simplest DFA form.
>> Currently I have a number of xslt stylesheets that I wrote which essentially
>> flattens a schema and converts the whole thing to a state automaton in JSON
>> to be used in a client side validator I wrote.
>>
>> However, the schemas I am working with right now (From the DITA open
>> toolkit) when flattened have a ton of unnecessary nested structures, for
>> instance things like this pop up:
>>
>> <xs:sequence>
>> <xs:choice>
>> <xs:sequence>
>> <xs:choice>
>> <xs:sequence>
>> <xs:sequence>
>> <xs:element/>
>> </xs:sequence>
>> <xs:sequence>
>> <xs:element/>
>> </xs:sequence>
>> </xs:sequence>
>> </xs:choice>
>> </xs:sequence>
>> </xs:choice>
>> </xs:sequence>
>>
>> The validator works great on smaller schemas, but as its written in
>> javascript these huge unnecessary structures slow it down a lot.
>>
>> I am assuming that other schema processors simplify things like this and
>> before I wrote my own wanted to check with the community.
>>
>> Cheers,
>>
>> Casey
>> --
>> --
>> Casey Jordan
>> Jorsek Software LLC.
>> "CaseyDJordan" on LinkedIn, Twitter & Facebook
>> Cell (585) 348 7399
>> Office (585) 239 6060
>> Jorsek.com
>>
>>
>> This message is intended only for the use of the Addressee(s) and may
>> contain information that is privileged, confidential, and/or exempt from
>> disclosure under applicable law.  If you are not the intended recipient,
>> please be advised that any disclosure  copying, distribution, or use of
>> the information contained herein is prohibited.  If you have received
>> this communication in error, please destroy all copies of the message,
>> whether in electronic or hard copy format, as well as attachments, and
>> immediately contact the sender by replying to this e-mail or by phone.
>> Thank you.
>>
>
>


-- 
--
Casey Jordan
Jorsek Software LLC.
"CaseyDJordan" on LinkedIn, Twitter & Facebook
Cell (585) 348 7399
Office (585) 239 6060
Jorsek.com


This message is intended only for the use of the Addressee(s) and may
contain information that is privileged, confidential, and/or exempt from
disclosure under applicable law.  If you are not the intended recipient,
please be advised that any disclosure  copying, distribution, or use of
the information contained herein is prohibited.  If you have received
this communication in error, please destroy all copies of the message,
whether in electronic or hard copy format, as well as attachments, and
immediately contact the sender by replying to this e-mail or by phone.
Thank you.
Received on Thursday, 22 July 2010 22:10:27 UTC