Stupid NET Tricks

I'm interested in the question of what value there is in requiring
un-minimized end tags when omittag is not allowed.  Requiring
redundant information and allowing humans into the process of
maintaining it will guarantee errors.

Someone has suggested that it makes XML processors a little easier
to write since they won't need a stack.  I don't see what kind of
useful processing you can do with structured data while not
keeping track of its context in the structure.  You'll want a
stack with or without named end tags.

I decided to play around with NET to see if I could come up with
an SGML tagging scheme which eliminates named end tags while (in
my opinion) improving legibility and programmer-friendliness.  I
apologize in advance to anyone whose sensibilities are offended by
this kludgery.

We can do this by making some adjustments to the concrete syntax.
Define NET such that it is a right-hand side of a commonly
recognized character pair. For instance "]", ">", "}" or ")".
We'll define some other delimiters while we're at it, taking care
to match the number of left and right hand characters in the
delimiter strings.  Since we're all learning Scheme these days,
I'll use parentheses in this example. SHORTREF must be disabled,
along with OMITTAG and CONCUR.
 
  	   NET   ")"
	   STAGO "((" 
  	   ETAGO "(/"
	   TAGC  ")/)"    <!-- needs two close parens, but must be
			       lexically distinct from two NETs -->

This results in a syntax for which any text editor with paren
matching capability can be used to assist in navigating an
instance.  Some of them, e.g. Emacs lisp-mode, can be used for
pretty-printing as well.  My feeling is that it also helps make
the element structure lexically evident:

 ((gi attr="val") Here is the content.)

 ((again) We can still use un-minimized end tags. (/again)/)

 
    [ Thanks and apologies to Arjun Ray for this idea. ]
Here's a common tag-souper's mistake:
 <p> foo <bold> stuff <ital> more stuff</bold> what's this?</ital></p>

Compared to:
 ((p) foo ((bold) stuff ((ital) more stuff) what's this?))
 

An interesting by-product of this delimiter scheme is an alternate
version of Huitfeld's and DeRose's Trapeze Act solution to
recognizing EMPTY elements in the absence of a DTD.  We switch the
requirement of using the NET enabling start tag from the EMPTY
element tag, to all non-empty element tags.  That way, NET can be
used to terminate every element.  This has the same advantages and
disadvantages of the Trapeze Act.  I believe it also has the same
effect as Charles Goldfarb's proposed [ S | E ]TAGC revision to
8879.

Here is the example from Michael Sperberg-McQueen's summary of the
EMPTY element problem. "blort" has declared content EMPTY.

 ((p) foo ((blort)/) bar )  
    <!-- use of TAGC tells XML processor that blort is empty --> 
 
 ((p) foo ((noblort) bar ))  
                  <!-- <p> foo <noblort> bar </noblort></p> --> 
 
 ((p) foo ((noblort)) bar )  
                  <!-- <p> foo <noblort></noblort> bar </p> -->

 ((p) foo ((blort) bar)) 
        <!-- This should be an error. 
             It tries to put data in an element declared EMPTY
	     <p> foo <blort> bar </blort></foo>  --> 

 ((p) foo ((blort)) bar) 
                     <!-- So should this -->

 ((p) foo ((noblort)/) bar) 
         <!-- Hmmm, XML would think this is empty, while an
	      un-XML-fettered SGML parser would think " bar " is
	      blort's content.  An XML validator with a DTD 
	      should report this as an error  --> 


While I won't be shocked to learn that I've overlooked an
obvious flaw, I do hope the above encourages thought in two areas: 

1) Are un-minimized end tags really helpful in light of all the
   other proposed restrictions for XML?

2) If we're going to standardize on a concrete syntax, is the
   reference concrete syntax the best possible?

Thanks for your patience.

-Bill

-- 
William D. Lindsey
blindsey@bdmtech.com
+1 (303) 672-8954

Received on Saturday, 14 September 1996 20:29:07 UTC