Problem with DTDs and schemas losing structure from Hugh Field-Richards on 1999-12-03 (www-xml-schema-comments@w3.org from October to December 1999)

From: Hugh Field-Richards <hsfr@hydra.dra.hmg.gb>
Date: Fri, 3 Dec 1999 15:51:23 +0000
To: www-xml-schema-comments@w3.org
Message-Id: <v02130500b46d9391ab40@[192.5.29.220]>
Hi

I recently posted an enquiry re DTDs and RDF on the RDF
comment newsgroup. I thought I would expand on the problem
that I see there is with XML DTD and schemas and fire it at
the XML schema newsgroup. I am sorry if
   1. this posting is a little overlong
   2. what I am saying has been solved
   3. in places I am stating the obvious, and
   4. this is not the right newsgroup for the posting

The reason that I posted the original comment was a
preliminary enquiry on how containers (and other similarly
general purpose) entities work within XML (the problem
surfaced originally within my investigation of RDF. In fact
the problem is more fundamental but the RDF container
is a good focus.

I believe that it is impossible to produce DTDs that define
unambiguous structural content when RDF container tags are
used. It leads to a problem in defining the overall structure of
a set of meta-data, allowing any structural element to be
contained, syntactical correctly, within this type of element,
regardless of whether it is semantically valid to do so.

As a result, with the current XML specifications we can
construct well-formed documents-we can construct
syntactically valid documents-but we are completely unable
to construct semantically valid documents (i.e. meaningful)
using these constructions.

I would like someone to show how the XML schema approach
can solve this problem. For DTD below I suspect that the
words XML schema can be used interchangably. I will stick
with DTD for moment.

All XML (and other tagged systems based on SGML) are
inherently a tree based structure. Every tagged component
is wholly enclosed by another tag. This is shown by the
nature of the DTD that provides details of what is contained
by each tag within the content description---and, I believe,
schemas does not help us any more in this regard.

Each level's content is wholly defined, and has a relevance
in the structure within which it sits. When a tag is used it
has an explicit locality because of its existence within a
single namespace. If we introduce a general purpose structure,
such as <rdf:li>, which is defined outside our local context,
then we have the problem of how does that new structure
associate itself with its enclosing tag?

Note that each list item within the list structure can contain
any further structure-it is this that makes it general purpose.
Any structure that appears within this list item, loses any
structural context that the list item itself sits within. In other
words there is no inheritance through the list structure. When
we use this structure in several places it is impossible to
impose any context through the list item to any enclosed
structures. Thus it is impossible to have any explicit locality
by means of the name space alone.

It is worth putting another way: consider an array in a common
programming language such as Pascal. An array (an ordered list
of items) is declared and used as

        type
                seq = array[0..100] of integer;

        var
            list : seq;

        list[ 0 ] := 23;
        list[ 1 ] := 56;

We can also say

        type
                seq1 = array[0..100] of char;

        var
              anotherList : seq1;

        anotherList[ 0 ] := 'a';
        anotherList[ 1 ] := 'b';

The key point here is that while we have a common syntactical
construction (type) for making ordered lists, we also have another
mechanism (var) for providing semantic constructions. The key is
how do we do this for our original problem?

What we need is an implicit locality by means of position within
the structure. Thus the list structure inherits a context from the
enclosing structure, and the structure below the list would inherit
that context from the list structure.

Finally here is a simple example that I believe illustrates all this.
Consider a simple DTD scrap

        <!ELEMENT rdf:seq ( rdf:li+ ) >
        <!ELEMENT rdf:li ( number | addr ) >    -- I could write ANY here

        <!ELEMENT person ( tel, email ) >
        <!ELEMENT tel ( rdf:seq ) >
        <!ELEMENT number ( #PCDATA ) >
        <!ELEMENT email ( rdf:seq ) >
        <!ELEMENT addr ( #PCDATA ) >

We would be able to write

        <tel>
                <rdf:seq>
                        <rdf:li>
                              <number>1234</number>
                              <addr>hsfr@hydra.dra.hmg.gb</addr>
                        </rdf:li>
                </rdf:seq>
        </tel>

Any parser based on the above DTD would mark this as both
well-formed and valid. But it would clearly be not what we intended.
The current content model for the RDF is effectively allowing any
element to appear - as Goldfarb says: "An element type that has an
ANY [or equivalent:hsfr] content specification is completely
unstructured."

I would appreciate your comments on this, and where I have gone
wrong. For us not being able to have a sematically valid document
is a problem. Unless this problem is solved I believe it
will be difficult to have any form of context sensitive editor,
a huge problem. This is another area which is very important
to us when using meta-data entered by unskilled personel.

TIA

Hugh F-R

-------------------------------------------------------------------
Dr Hugh S. Field-Richards
Defence Evaluation and Research Agency,
St Andrew's Road, Malvern, Worcs, WR14 3PS, UK
Tel: ++1684 895075   Fax: ++1684 896113  Email: hsfr@hydra.dra.hmg.gb

The views expressed above area entirely those of the writer and do not
represent the views, policy or understanding of any other person or
official body.
-------------------------------------------------------------------
Received on Monday, 6 December 1999 05:33:34 UTC