Re: LC-16 ( LC-132 ): Allow arbitrary order with occurrence > 1 from C. M. Sperberg-McQueen on 2000-10-16 (www-xml-schema-comments@w3.org from October to December 2000)

From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
Date: Mon, 16 Oct 2000 08:51:24 -0600
To: "Martin J. Duerst" <duerst@w3.org>
Cc: "Martin Gudgin" <marting@develop.com>, "Schema Comments" <www-xml-schema-comments@w3.org>, "Dan Rupe" <Dan_Rupe@go.com>
Message-Id: <4.3.2.7.1.20001015191543.02002938@espanola.com>
At 2000-10-15 03:34, Martin J. Duerst wrote:

>>A bit vector is one way (I believe a fairly common one) of implementing
>>the and-connector; it is, however, not the only way.
>
>What are the others? Straightforward finite state machines don't
>do the job, as I explained in the message to Henry.

Straightforward finite state machines have the disadvantage that
in large and-groups they grow very rapidly in size.  This does not
mean they cannot be used, or have never been used, in production
systems.  And they certainly do "do the job" in any sense I think
salient here: they calculate the correct answer in finite time.

>>Could
>>you give a concrete use case for allowing an arbitrary sequence of
>>a, b, c, and d elements where (a) the sequence of the elements is
>>significant,
>
>Did you want to write 'insignificant'? That's what both the current
>all groups and my proposal are about.

I think not.  If the sequence of child elements has no significance,
and they are not all optional, then the order of children might as
well be (and usually should be) fixed.

In a content model like (a,b,c,d,e) there are no inferences to be
drawn from the fact that instance documents have elements in a
particular order.  In a content model like (a & b & c & d & e),
the order of elements in the instance is subject to the control
of the user and may be used to convey information.  If there is no
information to be conveyed, then (a,b,c,d,e) would do as well,
and in most editors somewhat better.  Use cases where
the information conveyed by the sequence in the instance would be
meaningful to the application would be far more persuasive evidence
of the need for an & connector with the qualities being described,
than use cases where the information is not meaningful.

>>(b) each element must occur some distinct number of times
>>(a one to four times, b exactly once, c ten to thirty times, and d
>>exactly three times)?  I have no trouble imagining users who say that
>>is what they want; I am having trouble imagining a case where they
>>are right.
>
>The very general case is probably extremely rare. But the
>'unbounded' case for some of the elements is not that rare.
>This is extremely similar to the other places where occurrence
>indicators are used: 0, 1, and unbounded are the most frequent
>cases, any other actual numbers are quite rare.

If a, b, and c must each happen one or more times, and no significance
is to be attached to their order, then a content model like
(a+, b+, c+) captures all the constraints.  If they must each
occur zero or more times, (a*, b*, c*) or (a|b|c)* captures all
the constraints (the second requires a note saying that the
sequence is not significant).  I have not seen anything to suggest
that (a+ & b+ & c+) fills an actual need.


>In another comment (in the context of character encodings and Unicode),
>you have said that conversion back to legacy systems isn't that
>important because we want things to move on. Do you see a difference
>between non-Unicode systems and non-XML systems in that respect?

No.  The point of the parallelism with the SGML &-connector is not
conversion (although conversion between XML and SGML systems does
occupy a lot of attention in production systems, according to people
I talk to), but preservation of the current relationship between
XML and SGML as far as possible.

>You are arguing here that the increasing difficulty of writing
>the regular expressions corresponds to the increasing rarity and
>undesirability of the patterns.

Well, no.  I am arguing that the pattern you describe as
clumsy and error-prone is neither clumsy nor error-prone.

>... This is just the core of my all group
>proposal: allow people to write things down the way they
>think about it, and let the machines do the rest of the work;
>they are much better at it.

I have a higher opinion of the ingenuity of people than you
seem to:  no matter what formalism is used, there will be
languages people can describe easily and briefly with words
which the formalism either cannot describe or can describe
only with some difficulty.  Seeking to "allow people to write
things down the way they think about it" is seeking for
artificial intelligence and the ability to define formal
languages using only natural languages instead of
formalisms.

Left to one's own devices, one might well wish to leave the
all-group and numeric exponents out of the language, because
they map so poorly to standard grammatical formalisms and
parser-generation techniques.  The WG agreed to allow both,
to support certain fairly simple cases (numeric exponents for
EDI, all-groups for dumping relations), and voted against
those who felt that these changes were the thin end of a wedge
that could eventually destroy the basic conceptual model of
document grammars.

You are doing a good job of persuading me that the alarmists
were right, and that the WG might have done better to take a
firmer grammar-based line.

>But I also very much understand that regular expressions are not
>everybody's speciality, and I think that many people who will want
>to use XML Schema won't be experts in regular expressions, and
>shouldn't have to try to become experts.

Becoming expert in a tool is only important for those who wish to
use the tool well.  One doesn't have to become an expert in
regular expressions to use XML Schema or DTDs -- only to
use them expertly.

Michael Sperberg-McQueen
Received on Monday, 16 October 2000 10:52:53 UTC