- From: Steven J. DeRose <sjd@ebt.com>
- Date: Tue, 14 Mar 1995 15:48:15 -0500
- To: www-html@www10.w3.org
At 1:35 AM 3/14/95 +0500, Joe English wrote: >This is not much of an issue for HTML documents on the Web, >since they tend to be small and are rendered as a single unit >anyway. It's not like a browser is going to display the book of >Leviticus and have to worry about a marked region starting in Exodus >and ending in Deuteronomy. On the contrary, that is *exactly* the problem. I do have Leviticus on a web site, and although my server is kind enough to break it into net-size chunks if/when asked, I sure do have to know whether there is some long-distance thing in effect, otherwise we can't know to send whatever start-tag caused it when sending a smaller piece. >> Likewise, one cannot easily build a stack-based >> formatter, e.g. that keys styles off the list of element types in one's >> ancestry. > >This is only partly true, and irrelevant besides. >If the browser is going to include this functionality -- >highlighting regions that may cross element boundaries -- >it can't use ancestor-driven style resolution in any case, >regardless of how the regions are identified. Your critique is incorrect. Existence proof: open a dynatext book, since dynatext does in fact use "ancestor-driven style resolution" for SGML. It quite happily supports "highlighting regions that may cross element boundaries" -- just do a drag-select or a phrase search and watch. One reason the point Dave cites is relevant, is that highlighting can reasonably be construed as a different animal from style resolution. In actual practice, this has many advantages. >As far as efficiency goes, the Tk text widget is quite efficient, >and it doesn't use any hierarchical information at all; >all formatting attributes are specified with discontiguous, >potentially overlapping tagged regions. > >And lastly, you *can* use a single-pass parser with a stack-based >formatter to keep track of marked spans. Precisely my point: you must do O(n), not O(lg n). Is that not unfortunate? If you only want to solve tiny cases, of course it doesn't matter how you do it. But if you want a system that will last, you have to think more about scalability. If Tk isn't using any hierarchical information at all, then it's format control is a lot more limited than it need be. >> An editor is in even worse shape. There is no way to validate >> that such pairs even match, because "matching" is not a generic notion -- >> it has to be custom-built for each kind of pair. > >Any SGML parser can do the ID/IDREF validation, and HyTime reftype >constraints can do (most of) the rest, if it's that important. Sorry, but ID/IDREF can't do this. SGML cannot validate that your empty elements come in pairs. You can make one end have an ID and one an IDREF, but SGML cannot guarantee there is any particular *number* of IDREFs that point to an ID (in this case, that number would have to be exactly 1). Nor can SGML guarantee that the start of a span precedes the end. HyTime reftypes don't help for this case -- all they do is let you guarantee the element type of the IDREF's target (and actually any element whose content satisfies the content model for the specified GI is accepted, even if the GI is different). This is really important, since without such checking it is very hard for authors to balance their markers. This is especially true when they start cutting and pasting, since it is easy to pick up a scope that contain one such end, and move it unknowingly with bizarre effects. Out-of-line markers are pretty easy. I'll use Xanadu tumbler notation for the treewalk instead of HyTime, for brevity, but you get the same effect with treeloc followed by a leaf-level dataloc in HyTime, or via the corresponding TEI structure: <marked from="1.5.2.4" to="1.5.3.1"> Could hardly be simpler. Permit the first component of the address to be an ID, and you've got a pretty robust system. The advantages of out-of-line specs include: * For cases like search-hit highlighting, you don't have to change the document content itself -- the meta-information stays separate. For example, a client could choose to discard them during 'save'. * You can point even into things that can't contain IDs, such as CDATA elements. * You can use the same mechanism to point into, out of, and between not just sgml or html docs, but graphics and other media. * You can even annotate or otherwise link read-only data. A note received after I started this reply asked if anyone knew what the HyTime syntax is for tree-path locators, byte offsets from tree locations, etc. It's easy, though one must be *very* careful about defining byte offsets whether in HyTime or anything else -- it is not at all easy to count across element bounds, because it introduces an interaction between the parser (which actually knows offsets of things in the source), and the higher-level application (which I hope only knows about the structures the parser found!). But at any rate, one method of doing this in HyTime is to chain a nameloc that points to some element with an ID, to a treeloc that walks down a level at a time by child number, to a dataloc that expresses the byte offset into the leaf of choice. For example: <nameloc id=n1 nametype=element> <nmlist> sec37 </></> <treeloc id=t1 locsrc=n1> 1 6 3</> <dataloc id=d1 locsrc=t1> 10 5</> This 'location ladder' would be reference via ID "d1", and it points to 5 characters beginning at the 10th character of the 3rd child of the 6th child of the subtree with ID sec37. The initial "1" in the treeloc must be there in case ID sec37 points (possibly indirectly) to a forest of nodes, not a tree: in that case you'd use it to specify which tree of the forest. For typical cases just put in "1" and don't worry. The Text Encoding Initiative Guidelines (available online at various sites) give another syntax worth considering, described with full BNF and semantic definitions in section 14.2. The equivalent ladder would be: <xptr target="ID (sec37) CHILD (6) (3) STR (10 14)"> Both of these syntaxes are highly powerful and flexible, and are already formally standardized, proven, and available for adoption. Let's just use one. <shameless-plug> Both syntaxes are discussed in much more detail in Steven J. DeRose and David G. Durand: *Making HyperMedia Work: A User's Guide to HyTime*. Boston: Kluwer Academic Publishers, 1994. ISBN 0-7923-9432-1. </shameless-plug> Steve
Received on Tuesday, 14 March 1995 15:36:20 UTC