W3C home > Mailing lists > Public > public-xml-er@w3.org > February 2012

RE: error recovery

From: David Lee <David.Lee@marklogic.com>
Date: Sat, 18 Feb 2012 10:08:32 -0800
To: Noah Mendelsohn <nrm@arcanedomain.com>, Norman Walsh <ndw@nwalsh.com>
CC: W3C XML-ER Community Group <public-xml-er@w3.org>
Message-ID: <EB42045A1F00224E93B82E949EC6675E16ADAEE2F6@EXCHG-BE.marklogic.com>
Question: If this *cant* be done in a streaming processor what does that mean ?
Does it mean the input must be fully read in order to "fix" it ?  In what format ?  A degenerate case is pure text with no markup.   Then the non-streamed format is an array of bytes/chars.  Is that any better than a stream of bytes ?  

I agree that trying to enforce streaming into a spec is very difficult and typically over-optimizing considering the definition of streaming is still up for debate.
But let's consider the reverse.  Do we enforce a particular view on the raw data that is NOT a 'stream of characters'  (ignoring encoding for the moment)
 That is, what is the data model of the un-fixed data if not a stream of chars ?  I think that needs to be decided first. 
What is the abstract data model of the input data.  From there, streaming or no, should be irrelevant or unnecessary to spec.

David Lee
Lead Engineer
MarkLogic Corporation
Phone: +1 650-287-2531
Cell:  +1 812-630-7622

This e-mail and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this e-mail communication by others is strictly prohibited. If you are not the intended recipient, please notify us immediately by returning this message to the sender and delete all copies. Thank you for your cooperation.

> -----Original Message-----
> From: Noah Mendelsohn [mailto:nrm@arcanedomain.com]
> Sent: Saturday, February 18, 2012 12:37 PM
> To: Norman Walsh
> Cc: W3C XML-ER Community Group
> Subject: Re: error recovery
> On 2/18/2012 7:32 AM, Norman Walsh wrote:
> > I'm coming around to the view expressed by Noah and David (and others)
> > that we'd be better off casting this as a new set of parsing rules for
> > interpreting some sequences of characters that resemble XML but are
> > not well-formed in a way that deterministicly produces a tree.
> > I think when the process finishes, and we have a tree (if we have a
> > tree), it will be possible (for a human) to look back and say, we got
> > this tree by correcting these errors in these ways.
> Yes, I think that's generally where the focus should be. As I said in my
> earlier note, I think it's worth giving a bit of thought to whether it will
> be easy or hard to put reasonably tight bounds on identifying the subtrees
> that correspond to non-wellformed input. I also think we should
> demonstrate
> that the mapping can be implemented efficiently in a streaming processor
> for those who need streaming (though, in certain cases, there may be a
> tradeoff between streamability and the care taken in mapping non-
> wellformed
> input, as doing the latter well might involve backtracking).
> I don't think we should standardize the APIs that expose either the tree or
> error identifications, and I don't think we should the characteristics
> processors themselves.
> Noah

Received on Wednesday, 22 February 2012 12:55:53 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 19:47:26 UTC