ERB: decision and conundrum

More on addressing.  On March 15, the ERB agreed that:

1. Contrary to our decision of last time, we will support subelement
   addressing by a simple search operator.  We will make it clear that
   bit-for-bit matching without respect to words or tokens is compliant
   behavior; if implementations wish to compete on the basis of 
   case-folding or other fancy search optimization, that's fine.

2. Locators shall consist of a URL, optionally followed by a '#'

3. The '#' may be followed by the string "<tei>", in which case the
   remainder of the locator is to be treated as a TEI extended pointer.
   Michael Sperberg-McQueen has an action item to figure out the required
   changes to TEI xptr syntax to fit them into a URL.

Note: with respect to our previous concerns on internationalization, 
we investigated and it appears that both Netscape and MSIE are trying
to do the right thing; while there remain bugs in this area, our policy
seems to be reasonable.

On another subject, we agonized further over the fact that current
implementations of '#' in URLs always fetch the whole document and
then navigate to the fragment in the client.  For SGML, this is
probably often unreasonable.  Too bad - this behavior is not 
carved in stone; early implementations that stupidly try to fetch
the entire OED or Physician's Desk Reference, just to pull out a 
fragment, will not succeed in the marketplace


4. If the '#' is followed only by a string, then.... what?  This should
   be an IDREF, right?  Maybe.  And if it is, how do you know how to find
   ID attributes in an XML document out at the far end of a URL?  Can you
   be sure of finding the appropriate declaration in the internal
   DTD subset?  Can you be sure of finding the external subset?

On the Web, in the URL "http://foo.bar.com/baz.html#sec1.2", the 
"sec1.2" should correspond to a <A NAME='sec1.2'.  It is not, in the
HTML DTD, an ID attribute.  They want to use more characters than SGML
ID allows, and they don't want to enforce uniqueness.  If there is more
than one matching NAME=, few browsers will do anything reasonable, but
it's not an error.  In fact, the semantics of #-fragments in HTML are
easily expressed in a simple TEI xptr query saying "find the first
A element whose NAME attribute has the value whatever".  We could
duplicate that in XML, but it feels limiting.  We could duplicate it
but, in the linking element, provide other attributes to say what 
the element type and attribute name you're trying to match are.  But
then you're duplicating something you could do with a "#<tei>" string.
Or, we could say that it *is* an IDREF, and by default look for an
attribute named 'ID' with the indicated value, and also, if it's
possible, look in the internal subset or the whole DTD to find out 
what attributes are IDs.  This would be weaker than HTML in the
allowed values (SGML NAME) and requirement for only one match.  Big

What we want is to have a simple behavior that makes sense, specified
simply.  No surprise that it's hard to be simple.  Input and inspiration
from the WG are solicited.

Cheers, Tim Bray
tbray@textuality.com http://www.textuality.com/ +1-604-708-9592