Re: the UPA-constraint and danish word division from C. M. Sperberg-McQueen on 2006-09-16 (xmlschema-dev@w3.org from September 2006)

From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
Date: Fri, 15 Sep 2006 21:34:11 -0600
To: Marie Bilde Rasmussen <mariebilderas@gmail.com>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>, xmlschema-dev@w3.org
Message-Id: <ED5D4F56-5BD0-403B-94A6-3BD524390D88@acm.org>
On 15 Sep 2006, at 14:54 , Marie Bilde Rasmussen wrote:

 > Hello everybody.
 > I can't represent the grammar that I need in aW3C schema
 > without violating the UPA-constraint. ...

 > This is my grammar expressed as an EBNF:

[names reduced to initials, for brevity -MSM]

 >    ( h, (w, (((h, b?) | (b, h?))? w)+ ))
 >  |     ((w, (((h, b?) | (b, h?))? w)+ ), h?)
 > ...

 > I can see, that my EBNF-representation violates the
 > UPA-constraint in the sense that it is not unambiguos which
 > branch in the gramar tree is to be used, when a hyphen is
 > encountered immediately following a wordpart in the input data.

 From a first visual examination I think the problem is solely
with the second branch of the outer 'or'.  The first branch looks
fine.  (Software I've consulted confirms this.) But within the
second branch, you are exactly right: once the sub-expression

     (((h, b?) | (b, h?))? w)+

has been satisfied once, the next hyphen could match either the
one at the beginning of the expression, or the one after the
expression.

 > Can anybody help me reformulating this rule or tell me why this
 > isn'tpossible witout violating the UPA-constraint. If so, I
 > would be very grateful :o)

I thought for a moment that I could solve this by putting
hyphen first in the repetition, and writing something like

   ((h, (b? w)?) | ((b, h?)?, w))+

but that, of course, also violates UPA: the w following the
hyphen needs to be optional, in case the hyphen is the final
hyphen of the word, and that means a w following an h can match
either of the two w tokens in the content model.

Working with the grammar a bit has made me believe that your
content model is, for purposes of this discussion, analogous to
the chess-game problem: using b and w for black moves and white
moves, write a content model for a chess game.  One obvious
solution is ((w, b)*, w?), but it violates UPA, and if I am
correctly informed so does every regular expression for this
language.  (At least, the chess game problem is often cited as a
well known case of a regular language without a deterministic
regular expression.  In your problem, the interaction between
hyphens and blanks complicates things a bit, and the fact that
hyphens are optional between word parts also complicates things,
but when hyphens are used, the pattern is an alternation of
wordpart and separator which can end after either part of the
alternation.

I can see three approaches to your problem in practice:

(1) decide that a word-final hyphen is a special kind of hyphen,
and give it a different element name.  Then your grammar rule
becomes

     ((h, (w, (((h, b?) | (b, h?))?, w)+ ))
   |     ((w, (((h, b?) | (b, h?))?, w)+ ), hf?))

and there is no UPA violation.

(2) define an XSD rule that comes as close as you can to
restricting the data without violating UPA, and use
Schematron to supply the additional check.  Your rule
might be:

     ((h, (w, (((h, b?) | (b, h?))?, w)+ ))
     |    (w,  ((h, b?) | (b, h?))?, w, (h | b | w)*)
     )

and Schematron rules can check that

   each b is followed on the right either by a wordpart
     or by a hyphen and then a wordpart
   each hyphen is followed on the right by
     (a) a wordpart, or (b) a blank and then a
     wordpart, or (c) nothing

(3) (It kills me to say this) Use Relax NG, which does not have
the UPA rule.

Or speak to your schema vendor about providing a mode of
operation which does not check the UPA rule -- I am reliably
informed that at least one widely deployed schema validator has
such a mode, which is turned on using a switch the vendor tells
you about only when you ask.

And in any case, raise an issue with the XML Schema Working Group
making sure they know that the UPA rule is causing problems for
you.  (Some of my colleagues on the WG are tired of hearing me
tell them this, and will be glad to hear it in a different voice.
I suspect also that some of them don't really believe me, but
they may be more apt to believe a user who is actually paying
someone for schema-aware software.  Paying customers are always
worth listening to.)

Ideally, issues are best raised by entering a bug report into
the Bugzilla bug-tracking system -- instructions are at
http://www.w3.org/XML/2006/01/public-bugzilla (let me know if
anything in them is unclear -- they haven't been well
debugged).  Or if that is too cumbersome, send email to
www-xml-schema-comments@w3.org

Thank you!

--C. M. Sperberg-McQueen
   Staff contact, W3C XML Schema Working Group
Received on Saturday, 16 September 2006 03:38:24 UTC