Re: ERB: decision and conundrum from Peter Murray-Rust on 1997-03-17 (w3c-sgml-wg@w3.org from March 1997)

From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
Date: Mon, 17 Mar 1997 08:27:28 GMT
To: w3c-sgml-wg@w3.org
Message-Id: <4764@ursus.demon.co.uk>
In message <3.0.32.19970316185821.0099a750@pop.intergate.bc.ca> Tim Bray writes:
> More on addressing.  On March 15, the ERB agreed that:
> 
> 1. Contrary to our decision of last time, we will support subelement
>    addressing by a simple search operator.  We will make it clear that

This sounds like what I need.  Is 'subelement addressing' a clear term?
I take it to mean finding a substring or position in a chunk of #PCDATA, by
some means either to be further described or left to the application?
Does this also apply to mixed content or is one expected to have
navigated down to the raw #PCDATA...

>    bit-for-bit matching without respect to words or tokens is compliant
>    behavior; if implementations wish to compete on the basis of 
>    case-folding or other fancy search optimization, that's fine.

I like the idea of regular expressions.  Is this a 'fancy search'?

> 
[...]
> 
> CONUNDRUM: 
> 
> 4. If the '#' is followed only by a string, then.... what?  This should
>    be an IDREF, right?  Maybe.  And if it is, how do you know how to find
>    ID attributes in an XML document out at the far end of a URL?  Can you
>    be sure of finding the appropriate declaration in the internal
>    DTD subset?  Can you be sure of finding the external subset?
> 
> On the Web, in the URL "http://foo.bar.com/baz.html#sec1.2", the 
> "sec1.2" should correspond to a <A NAME='sec1.2'.  It is not, in the
> HTML DTD, an ID attribute.  They want to use more characters than SGML
> ID allows, and they don't want to enforce uniqueness.  If there is more
> than one matching NAME=, few browsers will do anything reasonable, but

WF XML documents may contain more than one ID (uniqueness is a VC, not
a WFC).  Therefore we need to say something about it, if only to say it's
undefined.

I think that there may be marketing problems with ID rather than NAME.
I started CML with proper SGML ID/IDREF addressing and changed to
HREF/NAME.  I think my reasons were:
	- the semantics of HREF/NAME were commonly applied (although 
		fuzzy).  If you are going to use IDs you will need to 
		educate the webhackers (like me) as to why this is better.
	- since many of my docs were manualy edited I frequently ended up 
		with multiple IDs or IDREFS unresolved.  This gave (correctly)
		zillions of errors in sgmls.  I just felt it was easier to
		use NAME which had no validity constraints (for me).
Whether we like it or not, a very large number of people will come to
XML from HTML.  A functioning HREF/NAME mechanism is an attractive thing
to give them.  

> it's not an error.  In fact, the semantics of #-fragments in HTML are
> easily expressed in a simple TEI xptr query saying "find the first
> A element whose NAME attribute has the value whatever".  We could

That is what I would expect as 'reasonable behaviour' for multiple
identical IDs or NAMEs.

> duplicate that in XML, but it feels limiting.  We could duplicate it
> but, in the linking element, provide other attributes to say what 
> the element type and attribute name you're trying to match are.  But
> then you're duplicating something you could do with a "#<tei>" string.
> Or, we could say that it *is* an IDREF, and by default look for an
> attribute named 'ID' with the indicated value, and also, if it's

Is the intention that in XML there is a default recommendation that all
IDs have a name 'ID'?  (It would certainly help in WF documents which may
not otherwise have a means of identifying IDs).

> possible, look in the internal subset or the whole DTD to find out 
> what attributes are IDs.  This would be weaker than HTML in the
> allowed values (SGML NAME) and requirement for only one match.  Big
> deal?
> 
> What we want is to have a simple behavior that makes sense, specified
> simply.  No surprise that it's hard to be simple.  Input and inspiration
> from the WG are solicited.

In summary, if there are behaviours in HTML that we like and can work with,
it makes sense to carry them over, even if we have to clarify the semantics.
Hope this webhacker's view is useful.

Is it useful at this stage for the pointer to carry information about the
MIME type of the target document?  I would find this very useful myself
because it allows downloading a search engine of the appropriate type.
Thus:

<A HREF="http://www.venus.co.uk/omf/cml/mydoc.cml#(3 MOL)"
   MIME="text/xml" METHOD="http://www.ch.ic.ac.uk/omf/pmr.chemime.ChemTree">

This would tell the application that the remote file was expected to be
of MIME type text/xml.  The classes used to search the document are
given.  If the remote document is not text/xml, then there may be an error
or some content negotiation.  You may disagree with this philosophy
(that the calling document rather than the called document defines the
MIME type), but I suspect we shall use it in our community in some cases.
[I daren't say what 'MIME' types we use - they might not be text/xml].

	P.


-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/
Received on Monday, 17 March 1997 04:29:21 UTC