Re: I used XML Schema in a recent project, and encountered a few things that, from Jim Showalter on 2005-08-17 (www-xml-schema-comments@w3.org from July to September 2005)

From: Jim Showalter <jim@jimandlisa.com>
Date: 16 Aug 2005 20:25:08 -0600
To: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>
Cc: "W3C XML Schema Comments list" <www-xml-schema-comments@w3.org>
Message-id: <053701c5a2ad$bb406a50$040a11ac@JIMHOMEPC2>
> On Tue, 2005-07-26 at 07:46, Jim Showalter wrote:
>
>> [I used XML Schema in a recent project and encountered a few things
>> that,] if improved, would make it a lot more powerful.
>
>> My application has a complicated configuration file that has to be
>> carefully checked by the application before proceeding. I wrote an
>> XML Schema for the config file, in order to get XmlValidatingReader
>> to do some of the checking so I wouldn't have to code it
>> myself. Using all of the capabilities of XML Schema that I could, I
>> was able to reduce the amount of code I had to write from 560 lines
>> to 220 lines. But with a few more general capabilities in XML
>> Schema, I could have eliminated all of my checking code.
>
> Many thanks for your comments.  I hope that eventually the Working
> Group as a whole will respond; in the meantime, here are some
> reactions from one WG member speaking only for himself.
>
>> Here is what I found missing:
>
>> 1) I needed a "contiguous" restriction. Yes, I can create 
>> sequences,
>> and yes, I can create unique and key restrictions, but there was no
>> way to say that there must not be gaps. For example, I could have a
>> number ranging from 7 to 20, and I could establish a key on that
>> number, which made sure that, say, the number 8 wasn't used twice,
>> but I couldn't enforce that the numbers had to be sequential (7, 8,
>> 9, etc.). It sure would have been useful to be able to say no gaps
>> are allowed.
>
> Hmm.  Sounds like an interesting constraint.  But it's not clear to 
> me
> how best to enable it in a general and declarative way.  So I'll ask
> you in response: can you give a bit more information about the
> application that requires this constraint?  (Column mapping in 
> tables,
> if we're talking about the same application as in 2 -- but is
> it always a constraint that all columns in the input be used?)

The specific problem I am solving has to do with reading in a document 
that contains 1-N tables, where the tables aren't necessarily in the 
same format, and producing from them a single output table. This makes 
it easy for downstream content-production utilities, because they only 
have to deal with a single table. The different formats of tables each 
have a table kind, which is one of the columns of meta-data in the 
output table. Other metadata includes a table number, which allows us 
to handle nested tables (we "flatten" the tree of nested tables into 
the single output table), and footnotes referenced from the bodies of 
the tables. Some of the columns in the input tables are not propagated 
to the output tables. Also, some columns might be in common between 
multiple table formats, so instead of replicating those duplicated 
columns, we want to be able to merge them into a single column. So the 
mapping of input columns to output columns obeys these constraints:

1) An input column can be suppressed (not mapped to any output 
column).

2) If an input column is not suppressed, then it must be mapped to an 
output column that is defined.

3) All output columns have to have at least one input column mapped to 
them.

4) The range of output columns must start at 1 and be contiguous (if 
it wasn't, then I would not be able to generate the output--I wouldn't 
have a column I'm supposed to have).

I was able to express #1 #2 in existing XML Schema, but not #3 or #4.

> Or to go at the problem a different way: is there a general class of
> problems that the contiguity constraint you describe seems to you to
> be a particular instance of?

The example above is one instance of a need for contiguity. My hope is 
for XML Schema to continue to evolve to be more and more powerful, so 
as to cut down on the amount of junk coding that has to be done by 
programmers. Of course, that has to be balanced against putting 
esoteric support in the standard that almost nobody needs. For 
example, I want contiguity, but somebody else might want to be able to 
say that the range should only consist of even numbers, or odd 
numbers, or fibonacci numbers, or whatever. Coming up with a way to 
express arbitrary range constraints isn't feasible (it becomes a 
programming language). But I think "no gaps" would be useful enough it 
should be supported (but then, you would expect me to think that!).

>> 2) I needed a way to say that, if a number was used, then there had
>> to be a keyref to that number. Why? Because my program is mapping
>> one set of numbers to another (actually, they're columns in
>> tables--I'm mapping input columns to output columns, with no gaps,
>> and I need to make sure that every output column is mapped to by at
>> least one input column). I couldn't have the numbers 1, 2, 3, and 
>> 4,
>> and another set of numbers 1, 2, 3, with 4 left out. A general
>> notion of ref counts would be really useful. It could have min and
>> max ref counts, which would allow all kinds of flexible uses. A ref
>> count with a min of 1 and max of 1 would mean that every key must
>> have exactly one keyref to it. A ref count with a min of 1 and no
>> max would mean that every key must have at least one keyref to it
>> (which was the semantics I needed). A ref count with a min of 0
>> would mean that a key didn't have to be referred to, and so forth.
>
> I was about to say you can do this, but realize the method I had in
> mind doesn't quite do the trick. In the special case of wanting
> exactly ONE such keyref, it's possible to enforce the rule by 
> defining
> a new pair of identity constraints, in which the old key and keyref
> become the new keyref and key.

I hadn't heard of that approach. I'll file that away in case it comes 
up. In my particular case, I can map N input columns to 1 output 
column, so this approach won't work, as you pointed out.

>  But I don't know a good way to require
> at least one reference, except to supply, as you suggest, reference
> count information as part of the PSVI.  That would at least make it
> easier for the app when using a validator that exposes that 
> particular
> property.
>
> Here, too, I wonder whether there is a more general class of
> constraints of which this is one instance.

I don't have a general class of constraint to offer as an example. 
Just my earlier example about mapping input columns to output columns. 
I think ref-count-constraints could be made generally useful, and 
others would find uses for them in their programs as well.

>> 3) I needed a way to say that the max value for some attribute 
>> could
>> not exceed the value of some other attribute. Generalized, it would
>> be really useful to be able to have basic expressions for
>> comparision (equal, not equal, greater than, less than, greater 
>> than
>> or equal, less than or equal) of arbitrary fields in the schema.
>
> Requiring that the values of two attributes stand in some defined
> relation to each other is a frequently desired constraint, which in 
> WG
> discussions we label 'co-occurrence constraints'.

I'll remember that term so I can refer to the concept more easily from 
now on.

> Some members of the Working Group were already convinced, during the
> development of XML Schema 1.0, that such constraints were necessary
> and natural, just like table-level CHECK clauses in SQL, which can
> express constraints on the values of two or more columns in each
> record.  By the analogy with SQL, some of us concluded (I did, 
> anyway)
> that the correct way to design such a facility was to use a simple
> query language (in SQL, CHECK clauses use the syntax of the WHERE
> clause), and that we should therefore delay adding such a feature
> until such time as XQuery and XPath 2.0 should be completed.

I agree--it would be natural to do this with a query language.

> In the meantime, of course, others have used XPath 1.0 for such
> purposes, and developed Schematron on that basis.
>
>> 4) It would be really nice to be able to specify error messages for
>> error conditions in the XML Schema. I am currently relying on the
>> error messages from the XML reader, but they tend to be pretty
>> cryptic. Previously I had written my own messages, which were
>> application-specific and quite informative. For example:
>
>>      theLogger.ConfigFileError("Specified config file contains
>> output
>> column heading base name '" + outputHeadingBaseName +
>>       "' with forbidden characters (only a-z, A-Z, and 0-9 are
>> allowed).");
>
>> whereas now my application outputs messages like:
>
>> The 'output-heading-base-name' attribute has an invalid value
>> according to its data type. An error occurred at
>> file:///C:/Documents
>> and Settings/<filename goes here>, (21, 48).
>
>> I would like to be able to hook my messages into the schema to
>> override the default messages.
>
> Hmm.  At first glance, this seems like a question of the interface 
> to
> the validator you are using; at the worst, you ought to be able to
> intercept its diagnostics and substitute your own.  On that line of
> reasoning, there's no need for a change in the schema language, just
> an improvement in software interfaces.

I can hook into the invalidation event handler in Microsoft's 
XmlValidatingReader to output my own error messages. I guess what I 
was thinking was that it was more natural to define schema-related 
error messages in the schema itself, rather than in code that 
validates the schema. The code doing the validation would defer to the 
defined error messages. Default messages would be defined in the root 
schema, and could be overridden locally if desired.

> Of course, it might be convenient to have generic diagnostics as
> part of schema annotation, so that violations of particular validity
> constraints could be trapped and associated with a particular
> error message.  Offhand, it seems likely that one might want a
> WHERE clause to say under what circumstances a particular message
> is appropriate, which seems to tie this in with co-occurrence
> constraints.  It's also a potential use case for an explicit
> fallback mechanism analogous to the one in XSLT -- an xsd:fallback
> element in a declaration could be associated with a message supplied
> by the schema author.

This sounds like a nice approach.

> Wearing a vocabulary designer's hat, this looks cool to me; I don't
> know how implementors of schema-aware software will like it.
>
> Thanks again for the comments.
>
> --Michael Sperberg-McQueen, World Wide Web Consortium
>
>
Received on Wednesday, 17 August 2005 02:29:08 UTC