Re: tag name state from David Carlisle on 2012-03-05 (public-xml-er@w3.org from March 2012)

From: David Carlisle <davidc@nag.co.uk>
Date: Mon, 05 Mar 2012 16:42:56 +0000
To: public-xml-er@w3.org
Message-ID: <4F54ED10.90801@nag.co.uk>

On 05/03/2012 09:37, George Cristian Bina wrote:
> Hi,
>
> I think that it will be easier to get a first form finalized if we
> focus on the browsers usecase, and that means mainly getting from
> not well-formed XML to a DOM.
>

Hmm, over the weekend I'd experimented with what you'd need to do to the
current draft to change tag name state to check xml names as suggested
at the start of this thread. The result is attached, it proved a useful
exercise (to me at least:-) whether or not the group decides to go this
way, as it forced me to review the current draft states rather more
carefully.

As the attached isn't tested by code it's almost certainly wrong in
parts but I attach it as some may find the comparison useful. (If anyone
does think it a route worth exploring we should probably check it in to
the source control but that's probably premature at this stage.

Comparing the main approaches there are some differences as to how to
tokenise <foo<bar (as two tokens or as one with a weird name), that
difference could be made anyway and relates mainly to how close we want
to stick to html5. The main difference is that this version stops
scanning for an element name when it gets to a non-Name character.
This implies a cost during tokenisation as (a) the xml-er system has to
have a list (or specification of the code ranges) of the Name Characters
and (b) it has to check the input stream against them.

I still think that XML parsing shows that neither of these costs are
prohibitive, and if we were to insist that an xml-er system had a way to
serialise its tree to well formed XML, the same costs re-appear but just
in a different (admittedly less used) place.

In this version it only checks for Name rather than NCName ":" NCName,
ie XML rules rather than XML Namespace rules, so as Mohamed just pointed
out George's example would be OK with this, however presumably a similar
example could be constructed in which the name was not well formed XML
at all.

_if_ the consensus is that we should just target DOM, I think that's a
shame as it severely restricts the ability to position xml-er as an
"error recovery xml parser" as it wouldn't be usable in most places xml
parsers are used, however it would be usable on the web and as such I
would think that just doing what html5 does as far as tokenisation would
gain a lot more relevance so I'd argue in that case we stick very
closely to Anne's current draft.

David

________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
________________________________________________________________________

Attachments

text/html attachment: Overview.dpc.html

Received on Monday, 5 March 2012 16:43:28 UTC