- From: James Clark <jjc@jclark.com>
- Date: Wed, 5 Dec 2012 15:37:06 +0700
- To: David Carlisle <davidc@nag.co.uk>
- Cc: public-microxml@w3.org
- Message-ID: <CANz3_EYETKwUBSc6mA5Aj8BNm8HWBX_ftRNsZtriDqRMGbrwZQ@mail.gmail.com>
On Tue, Nov 27, 2012 at 7:53 PM, David Carlisle <davidc@nag.co.uk> wrote: > > If the "error recovery" produces a data model that can not be queried or > constrained by conforming (micro-)xml tools then the recovery aspect is > a bit of a false promise. > I don't think it's nearly as bad as "cannot be queried or constrained". To make this concrete, let's suppose the recovery process produces a tree in which some element names contain $, but that your schema or query language syntax does not allow you to write an element name containing $. This means you cannot query conveniently for an element containing $, nor can you conveniently constrain the tree to contain such an element name. But why should you want to? The objective is not to allow you to design vocabularies that contain illegal characters. You can still constrain the tree (which may contain illegal element names) to match a vocabulary, provided the vocabulary uses legal names. Similarly for querying. That is, I think that the tokenisation stage should guarantee to return > valid name tokens, either by adding a fixup stage there, If the tokenization stage delivers the names as is, then a separate layer can do any fixup it wants. If the tokenization stage mangles the names, then it is hard for any other stage to do fixup (because fixup will necessarily lose information). Apart from names, the other data model issue is with data characters. It would be easy to specify that the parser replace any illegal character in data by 0xFFFD, but I don't think this would be very useful. Better for the parser to leave the illegal characters as is, and for a separate layer to deal with illegal characters in the way that's appropriate to the context. > or (more > likely?) by redefining valid name tokens in the data model to be > unconstrained arbitrary strings. > I still prefer this option. > > If any xml character data is allowed as an element name in the data > model then xpath can still query those elements with something like > *[name()='.....'] but if the data model is opened up even further to > allow any Unicode string as a name, a suitably extended xpath-like query > language would probably need some extended quote mechanism to refer to > non-xml characters like U+0001 I think such an extended quote mechanism is a good idea. CSS already has it: CSS escapes can be used in identifiers, which allows identifiers to contain arbitrary Unicode code points. James
Received on Wednesday, 5 December 2012 08:44:26 UTC