Re: Error recovery from James Clark on 2012-12-05 (public-microxml@w3.org from December 2012)

From: James Clark <jjc@jclark.com>
Date: Wed, 5 Dec 2012 15:37:06 +0700
To: David Carlisle <davidc@nag.co.uk>
Cc: public-microxml@w3.org
Message-ID: <CANz3_EYETKwUBSc6mA5Aj8BNm8HWBX_ftRNsZtriDqRMGbrwZQ@mail.gmail.com>

On Tue, Nov 27, 2012 at 7:53 PM, David Carlisle <davidc@nag.co.uk> wrote:

>
> If the "error recovery" produces a data model that can not be queried or
> constrained by conforming (micro-)xml tools then the recovery aspect is
> a bit of a false promise.
>

I don't think it's nearly as bad as "cannot be queried or constrained".  To
make this concrete, let's suppose the recovery process produces a tree in
which some element names contain $, but that your schema or query language
syntax does not allow you to write an element name containing $.  This
means you cannot query conveniently for an element containing $, nor can
you conveniently constrain the tree to contain such an element name.   But
why should you want to?  The objective is not to allow you to design
vocabularies that contain illegal characters.  You can still constrain the
tree (which may contain illegal element names) to match a vocabulary,
provided the vocabulary uses legal names. Similarly for querying.

That is, I think that the tokenisation stage should guarantee to return
> valid name tokens, either by adding a fixup stage there,

If the tokenization stage delivers the names as is, then a separate layer
can do any fixup it wants.  If the tokenization stage mangles the names,
then it is hard for any other stage to do fixup (because fixup will
necessarily lose information).

Apart from names, the other data model issue is with data characters.  It
would be easy to specify that the parser replace any illegal character in
data by 0xFFFD, but I don't think this would be very useful.  Better for
the parser to leave the illegal characters as is, and for a separate layer
to deal with illegal characters in the way that's appropriate to the
context.

> or (more
> likely?) by redefining valid name tokens in the data model to be
> unconstrained arbitrary strings.
>

I still prefer this option.

>
> If any xml character data is allowed as an element name in the data
> model then xpath can still query those elements with something like
> *[name()='.....'] but if the data model is opened up even further to
> allow any Unicode string as a name, a suitably extended xpath-like query
> language would probably need some extended quote mechanism to refer to
> non-xml characters like U+0001

I think such an extended quote mechanism is a good idea.  CSS already has
it: CSS escapes can be used in identifiers, which allows identifiers to
contain arbitrary Unicode code points.

James

Received on Wednesday, 5 December 2012 08:44:26 UTC