Re: Looking for help. Implementing a XML validation engine in JavaScript. from Casey Jordan on 2010-08-10 (xmlschema-dev@w3.org from August 2010)

From: Casey Jordan <casey.jordan@jorsek.com>
Date: Tue, 10 Aug 2010 13:16:08 -0400
To: Michael Kay <mike@saxonica.com>
Cc: "Cheney, Edward A SSG RES USAR USARC" <austin.cheney@us.army.mil>, xmlschema-dev@w3.org, www-dom@w3.org
Message-ID: <AANLkTi=uk6wj2aTtA5e48GKoNPhzwQVqCPajGDLq1Y1f@mail.gmail.com>
I should have noted the following:


   - Since this is for an in-browser wysiwyg xml editor, I am only doing
   validations of sub-trees when the change from editing. So this makes it much
   more efficient.
   - The document is validated on the server before entering the client
   environment.
   - I am assuming well-formedness and my validation will be done in the
   DOM. (The user cannot edit the raw xml and caused invalid xml form)
   - I am not worried about simple type validation, as that can be done via
   regular expressions.

*
*
*Edward:* JavaScript can most definitely handle this problem Especially if
the conversions to FSA are done on the server and cached. You also lost me
with the namespace stuff, but it sounds interesting.

*Micheal:* I agree.

*Peter:*

This is very interesting, it seems I am working from the opposite
direction.

Here are the steps I take in my current algorithm:

1.) I use XQuery and XSLT to completely normalize the schema on the server.
This resolves all imports, includes, groups, extensions, restrictions,
substitutionGroups etc.. all done with respect to namespaces. So essentially
I turn the schema into a bunch of element declarations with complexType
pattern definition for each, which is a regular grammer (from what I
understand)

2.) This gets sent to the server and my javascript reads and parses it into
memory.

3.) When validating I step through the complexType pattern as a tree and use
a greedy matching algorithm against the children of an element to be
validated. Along the way I check for required min occuences, if I reach the
end of the pattern and still have input elements that are not matched I know
I have unexpected elements. This works pretty well but theoretically it has
problems.

Here is the main problem, the greedy algorithm I developed does not work for
patterns like this:

<sequence>
    <element ref="e1"/>
    <choice minOccurs="2">
        <element ref="e2" maxOccurs="2"/>
        <element ref="e3"/>
    </choice>
    <element ref="e4"/>
</sequence>

This can be seen as a regular expression e1( e2{1,2} | e3 ){2,2} e4. My
greedy algorithm cannot validate this correctly.

Assuming input like <e1/><e2/><e2/><e4/>

My algorithm will match all the way up to the <e4/> in the first pass
through the choice: ( e2{1,2} | e3 ){2,2} because it does not realize that
both <e2/> elements should be treated individually as matches for the
choice, and not a single match for the choice. Validation then
fails because it is expecting one more of <e2/> or <e3/>. There is not an
easy way to build this into my current algorithm.

For the moment this seems to be "good enough", I rarely see real world
schemas that do this, but I need to have a game plan to actually fix this in
the future.

Cheers,

Casey

On Tue, Aug 10, 2010 at 5:23 AM, Michael Kay <mike@saxonica.com> wrote:

> On 10/08/2010 02:53, Cheney, Edward A SSG RES USAR USARC wrote:
>
>> Casey,
>>
>> I have attempted to write a Lint engine for HTML to imply conformance to
>> an XHTML type syntax with rigid structure requirements to throw errors for
>> violations that conflict with either semantics or accessibility.  The
>> problems I ran into are:
>>
>> 1)  JavaScript is too slow for any sort of validation engine where
>> structure is stressed as well as syntax.  JavaScript is a high level
>> interpreted language, and so it about as slow as it gets with regards to
>> programming.
>>
>>
>
> I would have said this a couple of years ago, but for client-side
> processing, I don't think this is true nowadays. The typical desktop machine
> (not mobile devices, perhaps) has enough processor cycles going spare to
> tolerate a great deal of inefficiency. Delays from the server and the
> network tend to overshadow any delays from client-side processing. Also, I
> think Javascript engines are getting steadily better.
>
> And it does seem that if you want to implement stuff in the browser,
> writing it in Javascript is often the only choice (writing in XSLT 1.0 is
> sometimes an alternative, but probably not here!). So writing an XSD
> validator in Javascript certainly doesn't seem an unreasonable idea.
>
> I'm not sure what you mean by "structure is stressed as well as syntax".
> Validating the structure of an instance document using an FSA is actually
> extremely fast. What takes time is building the FSA (so it would be a good
> idea to try and avoid doing that every time). The other significant cost in
> validation is validating non-string fields against simple type definitions -
> for example, dates and times - the code for that is thoroughly
> uninteresting, but because it's looking at each character, the cost can be
> quite high.
>
>
>  2) Logically, the natural inclination of JavaScript towards Lambda
>> closures in theory should be helpful enough to solve the prior problem, but
>> that is not so in practice.  This is line of thought is further enforced by
>> the mere fact that XML namespace inheritance appears to likewise follow a
>> Lambda model.  Unfortunately, this is not a benefit enough to over come the
>> processing inefficiencies described in the prior point.  Additionally,
>> Lambda models exist because they are helpful, however that do appear to
>> directly represent strong potential towards the covert channels better
>> described in systems architectures.  The covert channels are not so much a
>> concern here for security violations, but rather for logic collision with
>> regard to reuse of logic components where closure is manifest in that reuse,
>> but scoped above the parts being reused.
>>
>>
>>
> This sounds fascinating, but I have absolutely no idea what you are talking
> about. "Namespace inheritance follows a lambda model"? You seem to be seeing
> connections that I've never seen. Could you elucidate? And I don't
> understand your references to covert channels either. You seem to be packing
> an argument that deserves 10 pages into one paragraph, and it's left me
> completely lost.
>
> Michael Kay
> Saxonica
>



-- 
--
Casey Jordan
Jorsek Software LLC.
"CaseyDJordan" on LinkedIn, Twitter & Facebook
Cell (585) 348 7399
Office (585) 239 6060
Jorsek.com


This message is intended only for the use of the Addressee(s) and may
contain information that is privileged, confidential, and/or exempt from
disclosure under applicable law.  If you are not the intended recipient,
please be advised that any disclosure  copying, distribution, or use of
the information contained herein is prohibited.  If you have received
this communication in error, please destroy all copies of the message,
whether in electronic or hard copy format, as well as attachments, and
immediately contact the sender by replying to this e-mail or by phone.
Thank you.
Received on Tuesday, 10 August 2010 17:16:51 UTC