W3C home > Mailing lists > Public > public-xml-er@w3.org > February 2012

Re: error recovery

From: Noah Mendelsohn <nrm@arcanedomain.com>
Date: Sun, 19 Feb 2012 12:56:17 -0500
Message-ID: <4F4137C1.6070805@arcanedomain.com>
To: David Lee <David.Lee@marklogic.com>
CC: Norman Walsh <ndw@nwalsh.com>, W3C XML-ER Community Group <public-xml-er@w3.org>

On 2/18/2012 1:08 PM, David Lee wrote:
> Question: If this*cant*  be done in a streaming processor what does that mean ?
> Does it mean the input must be fully read in order to "fix" it ?

It typically means that in order to parse later content you may have to 
either go back and revisit earlier content, possibly quite far back in a 
large document), or else hold onto large amounts of state retained from 
earlier in the document as you proceed to parse the rest. Also, it tends to 
mean that you can report content to a consuming application more or less as 
you go. Typically, in an XML parser, you need to retain thinks like the 
stack of open element names, the in-scope prefixes, and entity definitions, 
but not much else. So, one can argue that in that sense, XML tends to 
stream pretty well.

There are other languages for which a correct parse involves revisiting or 
retaining a lot more than that, and those languages might be viewed as .

Strictly speaking, XML breaks the second criterion for streaming above. 
Consider the following simple document:

   <x>some bytes</x
   <x>some bytes</x
   ...repeat the <x>'s 1 million times

In principle, the only thing an XML parser should say about this is that 
it's not well formed, because the <a> does not match the </b>. In practice, 
SAX streaming parsers regularly fudge on this, cheerfully reporting the <x> 
elements before discovering at the end that it was all a mistake (and we 
really don't know if those <x>'s were good data, or whether the document 
was in fact mangled earlier. So, in that sense, XML is not a streaming 
format anyway.

I presume our goal here is for XML-ER to be not much worse than XML in its 
streaming characteristics.

Received on Sunday, 19 February 2012 17:56:45 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 19:47:26 UTC