- From: <bugzilla@wiggum.w3.org>
- Date: Tue, 19 Sep 2006 21:18:31 +0000
- To: www-xml-schema-comments@w3.org
- CC:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3754 Summary: UPA-constraint causing principal problems in document authoring Product: XML Schema Version: 1.0/1.1 both Platform: All OS/Version: All Status: NEW Severity: enhancement Priority: P2 Component: Structures: XSD Part 1 AssignedTo: cmsmcq@w3.org ReportedBy: mariebilderas@gmail.com QAContact: www-xml-schema-comments@w3.org I have real-life problems with the UPA-restriction. I use W3C xml schemas for document authoring in an intensively schema-aware xml editing system. I do spend a lot of effort in designing the xml schemas to support the authoring process. I will initially state, that I am very satisfied with the choice of W3C xml schema as schema language – I find the typing and other features of the standard very usefull. I do computational lexicography within a large Danish publishing house (Gyldendal Publishers). Gyldendal is the greatest, market leading dictionary publisher in Denmark. The texts/data we produce are dictionary data. The distinction between so-called document oriented and so-called data oriented xml doesn’t really fit dictionary entries very well; they can be seen as both types. Our lexicographers/authors create native xml directly. Among other schema design goals, I need the schemas to supply the lexicographer information about exactly the valid operations (e.g. element insertion/deletion/renaming) that he can perform from a given structural position while editing a dictionary entry in the xml environment. This is why I have a principal problem with the UPA restriction. I found my latest example during my attempt to represent the correct word division of Danish dictionary lemmas (also known as keywords, headwords, ...). My grammar looks like this: ( hyphen, (wordpart, ( ( ( hyphen, blank? ) | (blank, hyphen? ) )?, wordpart )+ ) ) | ( ( wordpart, ( ( ( hyphen, blank? ) | ( blank, hyphen? ) )? wordpart )+ ), hyphen? ) this can be reformulated in prose as: - a word division of a lemma consists as a minimum of two word parts - between two word parts may occur one hyphen (at most), one blank (at most), both, or none of them. If they both occur, they can come in any of the two possible orders - a lemma may have an initial hyphen (then the part-of-speech of the lemma is "suffix") or a final hyphen (then its POS is "prefix"). The lemma cannot have both an initial and a final hyphen. Most lemmas have neither. The <hyphen> and <blank>-elements represents orthographic hyphens and blanks in the lemma. They are NOT the representation of the legal division points. The second branch of the outermost “or” violates the UPA-constraint, because it cannot be determined whether a hyphen following a word part is a hyphen between two word parts or if it’s a final (trailing) hyphen of the lemma. Michael Sperberg McQueen and Xan Gregg both proposed that I formulate a less strict grammar and then run a schematron proces on top of it, after the initial schema validation has been done. The schematron process is then supposed to check for the additional rules, that could not be expressed in the W3C xml schema language, because of the UPA-restriction. These suggestions were made on the xmlschema-dev mailing list. Michael also suggested that, alternatively, I could rename the final hyphen to get rid of the ambiguity (i.e. “ambiguity” only when seen top-down!) These are nice and very clever solutions. But I would like to take a polemic position in this question, and this is why I now raise the issue with the xml-schema WG. For (human) document authoring purposes, it is of the greatest importance, that author feel confident, that the underlying schema actually tells him exactly what he is allowed to – or what possiblities he has. Running a post-editing process to find out that the insertion you made of some element (because the schema-aware soft-ware proposed you this very operation!) is actually invalid, would possibly weaken your confidence in the schema as being a true implementation of the editorial principles, that rules the type of text, you work with. Furthermore, the renaming strategy might seem neat to the designer and the data consumer (e.g. a data processing engineer). But on the other hand, this would blur the otherwise precise terminology of the grammar. In other words: why claim, that a rose is not a rose is not a rose? If W3C xml schema is (also) intended to be used in document auhoring processes, I would like to ask the WG to reconsider the UPA-restriction in this light. Thank you! Marie Bilde Rasmussen, Copenhagen, Denmark
Received on Tuesday, 19 September 2006 21:18:34 UTC