tag name state from David Carlisle on 2012-02-29 (public-xml-er@w3.org from February 2012)

From: David Carlisle <davidc@nag.co.uk>
Date: Wed, 29 Feb 2012 13:34:35 +0000
To: "public-xml-er@w3.org Community Group" <public-xml-er@w3.org>
Message-ID: <4F4E296B.8020401@nag.co.uk>

Jeni said

> And to be specific, my suggestion is that when in the Tag name state
> [2], if the next character is<  then this is a Parse Error, and the
> parser emits the current token and reprocesses the current input
> character (<) in the data state.

_If_ we are going to differ from HTML5 at this point I think I would go
further. We have a hard requirement I think that any tree have a
serialisation as namespace well formed XML. If we tokenise a start tag
at this point that isn't a legal XML name then inevitably there will
have to be some arbitrary character mangling leading to names such as

oneU00003CtwoU00003CthreeU00003C

How would it work if we split up tag name state into a series of states 
so the only characters accepted are

name start
optional name - :
optional
   :
   name start
   optional name - :

ie only namespace well formed names are accepted.

using the XML1.1/XML1.0-5thed definitions of Name Start and Name characters.

In each of these states, if a non-name character is seen it is put back 
and reprocessed in data state. If that happens on the first character, 
the < is put back as data and no tag is tokenised at all.

And same for attribute names of course.

David


________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
________________________________________________________________________

Received on Wednesday, 29 February 2012 13:35:07 UTC