- From: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>
- Date: Tue, 10 Jun 97 19:00:56 CDT
- To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
Some time ago, James Clark asked why the full array of tree-walking keywords from the TEI extended-pointer notation was needed for XML linking, and I was asked by the ERB to provide some account of why I thought they were desirable. I should note first that strictly speaking the 'functionality' of extended pointers -- pointing at SGML elements or spans within SGML documents -- is completely covered by the DESCENDANT keyword, with a single numeric parameter. DESCENDANT 42878 refers to the 42,878th element in the document (the one with the 42,878th start-tag), and if we limit ourselves to pointing at elements or pointing at spans that begin and end with elements, everything else is syntactic sugar. Byte offset is also sufficient, as far as that goes -- in fact, byte offset is more powerful than DESCENDANT because it handles sub-element addressing. So no argument for *any* of the keywords can show that any of the keywords adds new functionality; the TEI keywords that do, namely PATTERN, TOKEN, STR, SPACE, FOREIGN, have all already been omitted. On the other hand, no one is arguing for byte offset. Why? It's (a) fragile, (b) opaque, and (c) given to silent failures. We want something better than byte offset or DESCENDANT with one argument (which suffers from the same problems), because we want pointers that - are easy to understand / comprehensible / expressive - survive changes to the target document as well as possible - show signs of failure when they fail All of the keywords in question (ID, CHILD, DESCENDANT, ANCESTOR, PREVIOUS, NEXT, PRECEDING, FOLLOWING) serve substantially these same purposes. When using natural language to point at texts, we routinely use typed and untyped counts: the third paragraph of the second chapter, or page 232. Allowing restriction of the tree-traversal keywords by means of GI and attribute-value pairs as well as counting means that the pointer is clearer (the second quotation in German, or the second quotation after the beginning of page 232, is clearer than "the 82305th sentence of the text") and more robust: a byte offset is broken if even one byte is added or deleted before the target; a pointer to the second quotation after the beginning of page 232 is broken only if a Q element is added between the page boundary and the current second Q -- if the pointer restricts by language as well, only a new quotation in German will break the pointer. No pointer is guaranteed safe forever when it's pointing at a changing document. But restriction by GI and attribute value, and being able to traverse the tree in any desired direction, from any point chosen for its relative stability (page 232 is *not* a good choice in this regard!) provides much more robust pointers than counters without facilities for such restrictions. Counters by themselves are hopelessly fragile. Why so many ways to walk the tree, though? Well, primarily because in general the best approach to pointing into changing read-only material appears to be to point at some stable point and work from there. The stable point may be the end of a section, or the beginning; it may be an element with an ID. There is no good way to guarantee that the fixed point will always be above the target resource in the tree, so CHILD and DESCENDANT aren't enough; you need ANCESTOR. It's not plausible to assume that the fixed point is always to the left of the target -- sometimes it will be to the right, so if you have NEXT, you need PREVIOUS. Some of the keywords are conservative (CHILD, NEXT, PREVIOUS) and are useful for catching typos (a pointer to NEXT(82) is an error if there aren't 82 younger siblings of the current location source); I suspect that for stable documents it might be plausible to work exclusively with these and ANCESTOR. But if I want to point at the material on page 23 in a transcription of Lachmann's edition which is currently being prepared, the most reliable way to do it is to say "find the first <PB N='23' ED='La'> and go till you reach the next <PB ED='La'>" -- it's more convenient, as well as more robust, not to have to know exactly how many generations down in the tree these milestone elements are. A project I am working on now is preparing documents in a data capture DTD and will translate them automatically into an archival DTD later; at that time the nesting level of virtually everything in the documents will change, since the whole point of the data capture DTD is to have fewer levels and exploit some regularities in the data. I think it's a strong argument *for* the TEI keywords that <xptr xml-link='simple' href="descendant(1,pb,n,23,ed,La)..ditto()following(1,pb,ed,La)"/> will survive that transition, whereas the obvious choice using a smaller set of keywords <xptr xml-link="simple" href="child(2)(1)(1)(3)..root()child(2)(1)(2)(2)"/> is almost certain to break; if we add GIs to the parameters we will hear about the breakage, but it will still break. To make extended pointers robust, you need keywords that are *not* limited to single-generation jumps or sibling searches: you need DESCENDANT, PRECEDING, and FOLLOWING. (Note to those who don't want to have to look this up: descendant finds elements at any level of containment; PRECEDING and FOLLOWING search left and right across the entire tree -- unlike NEXT and PREVIOUS they are *not* limited to the siblings of their location source. Yes, I know, there's no way to remember which is PREVIOUS and which is PRECEDING -- better names would be welcome.) Both functions are needed: PREVIOUS for the case of pointing into a relatively stable document where you *want* the more restricted method of counting, to trap errors, and PRECEDING for being able to cross container boundaries. I had hoped to produce more concrete examples, but I am running late and have to curtail this message. So I'll leave it with the page-reference example: find, at any depth, the first PB with N=23 and ED=La, then take everything up to the next PB from that edition. Note that the next PB is not guaranteed to have N=24: page 24 could easily be blank and not marked with a PB. How do you do this reliably in a target document subject to change without DESCENDANT, DITTO, and FOLLOWING? I hope this helps clarify why I want a full set of tree-traversal keywords. Maybe James is right, and this is sort of a poor-folk's query language. If so, I suspect it's smaller and easier to implement than any of the Rich Folk's query languages, and to a really surprising degree it does get the job done. -C. M. Sperberg-McQueen ACH / ACL / ALLC Text Encoding Initiative University of Illinois at Chicago tei@uic.edu All opinions expressed in this note (except those I have quoted from other authors) are mine. They are not necessarily those of the Text Encoding Initiative, its executive committee or other participants, its sponsors, or its funders. Ditto for the Model Editions Partnership and the University of Illinois, and the XML Editorial Review Board. Anyone who says otherwise is itching for a fight.
Received on Tuesday, 10 June 1997 20:50:21 UTC