[Bug 3754] UPA-constraint causing principal problems in document authoring

http://www.w3.org/Bugs/Public/show_bug.cgi?id=3754

           Summary: UPA-constraint causing principal problems in document
                    authoring
           Product: XML Schema
           Version: 1.0/1.1 both
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Structures: XSD Part 1
        AssignedTo: cmsmcq@w3.org
        ReportedBy: mariebilderas@gmail.com
         QAContact: www-xml-schema-comments@w3.org


I have real-life problems with the UPA-restriction. I use W3C xml schemas for
document authoring in an intensively schema-aware xml editing system. I do
spend a lot of effort in designing the xml schemas to support the authoring
process.

I will initially state, that I am very satisfied with the choice of W3C xml
schema as schema language – I find the typing and other features of the
standard very usefull.

I do computational lexicography within a large Danish publishing house
(Gyldendal Publishers). Gyldendal is the greatest, market leading dictionary
publisher in Denmark. The texts/data we produce are dictionary data.

The distinction between so-called document oriented and so-called data oriented
xml doesn’t really fit dictionary entries very well; they can be seen as both
types. Our lexicographers/authors create native xml directly. Among other
schema design goals, I need the schemas to supply the lexicographer information
about exactly the valid operations (e.g. element insertion/deletion/renaming)
that he can perform from a given structural position while editing a dictionary
entry in the xml environment. This is why I have a principal problem with the
UPA restriction.

I found my latest example during my attempt to represent the correct word
division of Danish dictionary lemmas (also known as keywords, headwords, ...).
My grammar looks like this:

( hyphen, (wordpart, ( ( ( hyphen, blank? ) | (blank, hyphen? ) )?, wordpart )+
) ) |
( ( wordpart, ( ( ( hyphen, blank? ) | ( blank, hyphen? ) )? wordpart )+ ),
hyphen? )

this can be reformulated in prose as:
-       a word division of a lemma consists as a minimum of two word parts
-       between two word parts may occur one hyphen (at most), one blank (at
most), both, or none of them. If they both occur, they can come in any of the
two possible orders
-       a lemma may have an initial hyphen (then the part-of-speech of the
lemma is "suffix") or a final hyphen (then its POS is "prefix"). The lemma
cannot have both an initial and a final hyphen. Most lemmas have neither.

The <hyphen> and <blank>-elements represents orthographic hyphens and blanks in
the lemma. They are NOT the representation of the legal division points.

The second branch of the outermost “or” violates the UPA-constraint, because it
cannot be determined whether a hyphen following a word part is a hyphen between
two word parts or if it’s a final (trailing) hyphen of the lemma.

Michael Sperberg McQueen and Xan Gregg both proposed that I formulate a less
strict grammar and then run a schematron proces on top of it, after the initial
schema validation has been done. The schematron process is then supposed to
check for the additional rules, that could not be expressed in the W3C xml
schema language, because of the UPA-restriction. These suggestions were made on
the xmlschema-dev mailing list. Michael also suggested that, alternatively, I
could rename the final hyphen to get rid of the ambiguity (i.e. “ambiguity”
only when seen top-down!)

These are nice and very clever solutions. But I would like to take a polemic
position in this question, and this is why I now raise the issue with the
xml-schema WG.

For (human) document authoring purposes, it is of the greatest importance, that
author feel confident, that the underlying schema actually tells him exactly
what he is allowed to – or what possiblities he has. Running a post-editing
process to find out that the insertion you made of some element (because the
schema-aware soft-ware proposed you this very operation!) is actually invalid,
would possibly weaken your confidence in the schema as being a true
implementation of the editorial principles, that rules the type of text, you
work with.

Furthermore, the renaming strategy might seem neat to the designer and the data
consumer (e.g. a data processing engineer). But on the other hand, this would
blur the otherwise precise terminology of the grammar. In other words: why
claim, that a rose is not a rose is not a rose?

If W3C xml schema is (also) intended to be used in document auhoring processes,
I would like to ask the WG to reconsider the UPA-restriction in this light.

Thank you!
Marie Bilde Rasmussen,
Copenhagen, Denmark

Received on Tuesday, 19 September 2006 21:18:34 UTC