Re: Draft - Fixup or Full XML Parser from Noah Mendelsohn on 2012-02-21 (public-xml-er@w3.org from February 2012)

From: Noah Mendelsohn <nrm@arcanedomain.com>
Date: Tue, 21 Feb 2012 18:28:35 -0500
To: Norman Walsh <ndw@nwalsh.com>
CC: W3C XML-ER Community Group <public-xml-er@w3.org>
Message-ID: <4F4428A3.8090604@arcanedomain.com>

I think there's an important difference between the way the mapping from 
input XML-ER to tree is documented in the spec, and how any particular 
implementation optimizes that. I certainly agree that nothing in the spec 
should >require< an implementation to produce or serialize a well-formed 
string of characters as an intermediate step.

I continue to think that documenting the transformation from input to 
output in a declarative manner is preferable, for all the reasons set out 
in [1]; it's easier to process and generate tooling automatically, easier 
to generate test cases automatically, etc. Of course, the degree to which a 
declarative exposition is practical depends in part on the desired mappings 
from input to output, including so-called "fix ups", we want to do.

One way, though not necessarily the best way, to document such mappings 
would be at the source level. For example one could easily imagine a start 
tag mapping that would operate at the point that other cleanup had been 
done (e.g. poorly nested end TAGs and missing ">" characters unscrambled), 
and that would map unquoted attributes to some quoted equivalent. I thing 
even PERL- or Ruby-grade regexp stuff is up to doing that.

If we take that route, then the mappings we would document would be from 
non-well formed to well-formed source. The rest of the tree building would 
follow from existing specs, with the nice result that your choice of 
Infoset, XPath-DM or whatever would fall out for free.

As I say, I would not expect implementations to actually produce the 
well-formed source or any other intermediate mapping; rather, they would 
implemented an optimized path from input source to output API.

Still, declarative exposition is better when possible, and documenting some 
mappings at the source level does have some advantages. I don't think we 
should rule it out as an option.

BTW: I think that one of the reasons HTML5 found an algorithmic exposition 
more practical was the need to support asynchronous scripting that operates 
in parallel with the parse(s). We don't have that requirement for XML-ER, I 
don't think (or if we do, we should state it explicitly). With XML-ER, all 
we've said we need is a mapping from each (potentially not-well-formed) 
input to a corresponding result tree. I think much or all of that can and 
probably should be set out declaratively.

Noah

[1] http://www.w3.org/2001/tag/doc/leastPower.html

On 2/21/2012 5:03 PM, Norman Walsh wrote:
> David Lee<David.Lee@marklogic.com>  writes:
>> Norm, what's your opinion on the use case of using an ER parser as a
>> front-end to an existing parser.
>> To me that seems the simplest and most useful case. (although almost
>> certainly not the most *efficient*).
>
> It seems to me that by the time the ER parser has figured out how to do
> the fixup, it could just generate the tree more easily than turning
> it back into characters for a second parser to read.
>
>                                          Be seeing you,
>                                            norm
>

Received on Tuesday, 21 February 2012 23:29:08 UTC