Re: Spans, and an RFI from Peter Murray-Rust on 1997-04-06 (w3c-sgml-wg@w3.org from April 1997)

From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
Date: Sun, 06 Apr 1997 17:37:06 GMT
To: w3c-sgml-wg@w3.org
Message-Id: <5450@ursus.demon.co.uk>
I discovered with pleasure that a new version of XML-LINK has been
mounted - I hope it's OK to discuss it, even tho' not announced?  And I 
assume that this is the version that's being printed up for the W3C.
I think it looks good and any comments below are minor criticisms.

[As a RL outsider, I'd like to congratulate the ERB on the work that 
they have put in and the speed and harmony with which the drafts have 
been created.  By comparison the chemical community has spent 10 years
discussing the name of element 104 without resolution.]


In message <2.2.32.19970402201845.00caf3ec@pop> "Steven J. DeRose" writes:
> At 10:47 AM 03/11/97 -0400, you wrote:
> 
> >Request for information: what TEI xptr implementations are there?  And
> >what do they implement?  I'm going to be *very* reluctant to
> >vote for anything, no matter how cool & peachy-keen, that nobody
> >is actually using.

I have coded most of the XML-LINK draft on TEI pointers (I need another
hour for -ve numbers in PRECEDING).  I notice that the draft has 
substantially changed (no complaints, I've read the pre-amble:-).  

Thanks for changing the tables in 3.2 3.3, etc.  Much clearer now.

5.2 list item 3 'XML ID attribute'.  This term is only meaningful for
a document which is valid or at least has an ATTLIST for the given
element.  There is a (natural) assumption in this section that TEI
searches will be extended primarily to valid documents, but I'd like to
argue for WF documents as well.  I think it's quite likely that TEI
searches will be used on fragments because they are a very powerful
method of rationalising partially structured data (indeed I am including
the approach through my code).  The point is that is that *people
who don't know anything about SGML* will have no idea of the special
significance of ID.  They may well create documents with attributes named
IDs which do not fulfil the uniqueness criterion (or the naming 
conventions).  Since XML-LINK (but not XML-LANG) puts special emphasis
on the attribute *name* ID as well as its type, this should be highlighted
as a reserved word in XML-LANG.  [This is not a trivial point - if a 
data object is referred to as an ID, then it's natural for a beginner to 
use that as an attribute name].


There are areas where the draft is unclear to someone who doesn't
come from the extended TEI community and since the idea is that the 
draft is self contained, here are some:

5.3 Spans, etc.  I suspect there are some well-known semantics to this
word which are unknown to me and not in the draft.   list item 2 refers
to the TEI 'FROM' and 'TO' attributes, but I don't have a pointer to 
the TEI spec on this.  This should be made more self-contained.

5.3.1 I am confused here.  The result of evaluatiing a location term is
always an element (i.e. it is either a single element or a properly nested
tree of elements).  However, for the ".." operator the result is "all of
the text" from the first location (or the start of the element) to the end
of the [text] or the element selected by the second series.

For *one* location term I think this is clear - either an element with
GI or a (complete) chunk of *CDATA (but not including other elements).
For *two* terms, there seems to be no requirement that they start and
end at the same level of the hierarchy.  Thus is the example in 5.3.3
("a sentence (A) with no embedded child elements", etc.) any contiguous
subset (?span) makes sense.  But what if B has children and the second
location term stops 'somewhere in the middle of B'?  Is this allowed?

I appreciate that this makes most sense if the document is viewed as an 
event stream and that any span represents a 'chunk of marked up text'.
But if the document has a complex structure then starting at one point
in the tree and ending at a higher or lower level may be nonsense.

I am also not quite clear what the word 'text' means.  Is it synonymous 
with *CDATA?  [I would prefer a term that made it clear that numbers,
etc. were allowed here].  The example uses the phrase 'third span of
character data', for example.

I notice that PATTERN and FOREIGN have disappeared.  Presumably there is
no bar to applications using them - the constraint is that publicly 
visible XML addresses don't contain them, I assume.  Since I believe 
that almost everyone is going to want applications to carry out 
PATTERN-based searches, it would be useful to have a generally agreed 
convention for the syntax (even if the regexs were different).

	P.



-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/
Received on Sunday, 6 April 1997 13:17:44 UTC