Re: Draft from David Carlisle on 2012-02-21 (public-xml-er@w3.org from February 2012)

From: David Carlisle <davidc@nag.co.uk>
Date: Tue, 21 Feb 2012 10:29:35 +0000
To: public-xml-er@w3.org
Message-ID: <4F43720F.1030509@nag.co.uk>
On 21/02/2012 02:17, Noah Mendelsohn wrote:
>
>
> On 2/20/2012 8:22 PM, David Carlisle wrote:
>> I agree that the input shouldn't be described as "XML" but it
>> needn't purport to be XML either. If I choose to parse
>> "<foo>a</bar>" with this parser I don't need to (or get the
>> document to ) purport that is XML, I just want an XML-compatible
>> result so I can bash it with XSLT (typically)
>
> Are you sure you want to do that with your example.

Yes I think so, or more exactly I don't want the requirements drafted in
such a way that prevents us deciding we want that.

> It's really not clear what a user intended here. Most likely XML-ER
> will produce some tree out of this input, but if the author intended
> anything like what we know as XML, the results of any fixup have at
> least a 50/50 chance of not being "correct" (did the user mean a
> "foo" element, a "bar" element, or something else.


I think thinking of it as fixup doesn't really work. If I _choose_ to
use this parser rather than an XML one then I'm asserting that I want to
get a DOM (or XDM or whatever) tree out of some input. I don't intend to
edit or in any way fix the input to being XML. I'd view it the same way
as taking a SAX parser for GEDCOM (from Michael's XSLT book) or CSV or
JSON. You parse the input, which needn't look like XML at all, get an
XML compatible parse tree and then following applications work with it
as if it were XML. Asking at what point in the file the JSON input
wasn't well formed XML isn't very useful. I know I'm over stating the
point as xml-er "looks like" xml and has the requirement that if it
happens to _be_ xml then parsing with xml-er or xml should have the same
result, but I think comparing it to using a non-xml parser is a more
useful idiom than comparing it to a syntactic fixup followed by an xml
parse.

>
> Of course, once the XML-ER spec is written, there will be some
> answer. Let's say the answer it gives is to assume that the </bar>
> was meant to be a </foo>.

No it won't say that </bar> was meant to be <foo> any more than the XML 
1.0 spec says that <foo  a   =  'b'   > was meant to be <foo a="b">
The grammar of XML 1.0 just results in those two things being 
equivalent, it doesn't need to make a judgement about which is more 
correct or that one is changed into the other. An xml-er parser (might) make
<foo a=b> have the same result, but again we don't need to use language 
that implies that <foo a=b> is "fixed" in any way.


 > OK, do we really want to tell users to
> write <foo>a</bar> as a first class way of getting a <foo> element?
>
> I don't think so. I think we want to distinguish content that is
> correct or preferred from that which is tolerated. For the moment, I
> would assume that the "correct" content is well-formed XML.

Some things we can agree are flagged as parse errors for xml-er (and my
mis-matched end tag would be so flagged in Anne's draft) but the list of
things that are not flagged might end up being very long and so in the
end thinking that an xml-er document that does not generate a pare error
will be well formed XML will be (or might be) far from the truth.

> We might loosen that a bit to include some additional constructs like
> unquoted attributes, or perhaps names that use other than XML name
> characters. In general, though, I think we do want to identify a
> class of correct input, and I think that will be very close in
> spirit, if not necessarily in all details, to XML.
>
> Noah
>
>
If we define things such that every xml-er document that does not
generate a parse error is well formed xml then you can mechanically pass
such a document (unparsed) into an xml pipeline. If there are _any_
cases where this is not the case then you can not, and so I don't
personally feel it is particularly useful to know that if there are no
xml-er parse errors it is "almost" xml. I think as Shane said we should
just define xml-er parsing in a way that makes sense on that context and
then just see at the end how far it differs from XML given non well
formed input.

David




________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
________________________________________________________________________
Received on Tuesday, 21 February 2012 10:30:00 UTC