Re: addressing into char content with xml-link from Peter Murray-Rust on 1997-04-11 (w3c-sgml-wg@w3.org from April 1997)

From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
Date: Fri, 11 Apr 1997 10:25:29 GMT
To: w3c-sgml-wg@w3.org
Message-Id: <5611@ursus.demon.co.uk>
In message <3.0.32.19970410164725.006b6e50@pophost.arbortext.com> Paul Grosso writes:
> In discussions with others over that last couple days, I've come
> to the conclusion we should consider added to xml-link the capability
> to address into data character content (aka dataloc).
> 
This is an extremely important area for me and will - I think - be critical
if XML is to be used for non-textual data (as well as textual data).  In this
message I make a simple proposal which is easy to implement and I hope is in 
the spirit of XML.

First, however, if XML does *NOT* specify addressing into data character
content one of the following may happen:
	- people will simply walk away from XML.  (This will happen in my own 
		discipline because much of our dataloc is inherently structured).
	- people will write PIs, develop new Elements or whatever simply to
		address this problem.  This is guaranteed to be out of XML's
		ontrol and therefore a mess.
	- people will make up new (illegal) syntax.  This will be even worse.

One difficulty seems to be that we cannot at this stage devise a single robust
mechanism for this process.  Even for textual pattern matching there will be 
great variety in the approaches including case-sensitivity and the variety of 
regular expressions used.  

Two non-numeric examples (both from the JUMBO demo):
(a) Stock prices are held as a whitespace-separated array of numbers (i.e.
record-ends are not differentiated from spaces or tabs and all are folded to
a single space by the CML processor).  This array is robust and could have
10000 elements.  It's unnecessary overkill to have to encapsulate every number
in tags.  We need a mechanism for addressing individual points or ranges
of points.
(b) Proteins are a linear sequence of subunits and can be internally addressed.
It's important to be able to send an XML document that says: '*these* are the
amino acids which are mutated in HIV protease and give rise to resistance'.
Alternatively we may wish to address 3-D regions of space "this is where the
antibody binds".

In addition we have discipline-specific search mechanism which are robust and
which we'd like to be able to use for addressing.  To find all the molecules
which contain a benzene ring we need to write something like:
	ROOT DESCENDANT (ALL MOL) (...) (SMILESSEARCH) (c1ccccc1)

It seems clear to me that we shall have to live with a multitude of mechanisms 
for adressing dataloc and therefore we should welcome this and provide 
something where this can be done as robustly and extensibly as possible
without feature creep.

<PROPOSAL>
I suggest that we recall the FOREIGN keyword to allow non-XML based searches.
</PROPOSAL>

The syntax as I recall is that it is:
(FOREIGN)(ANYTHING)(YOU)(LIKE)

The immediate advantage is that it will allow authors of documents to write
*something* that is not syntactically invalid.  (At present there is nothing
they can do.)  A subsidiary proposal is that where addressing mechanisms are
likely to be common or agreed within a subcommunity, that XML provides a way 
of identifying these mechanisms.

<PROPOSAL>
That the token immediately after FOREIGN _may_ be an identifier (e.g. an FPI)
identifying the precise mechanism to be used.
</PROPOSAL>

If, for example, we need to identify a regex for searching, we could precisely
define the regex used.  This could point to a document, or even to a catalog
containing various implementations.  As an example 

caesar.xml#(ROOT,DESCENDANT,(ALL,LINE)(FOREIGN)(-//POSIX//REGEX)("the*")

(the details are illustrative - I don't know if POSIX has an FPI)

If we collect xml-friendly resources under XML-DEV, then we might wish to 
develop an identifying scheme which allows them to be located.  In the fullness
of time, they might even get formal XML FPIs.

In summary, the advantage of this is that it does not committ the implementers
to anything more than identifying the word FOREIGN.  An XML-engine is under
no obligation to provide a mechanism beyond that.  If a small community
such as moleculat science wishes to develop something like:
DESCENDANT,(ALL,MOL)(FOREIGN)("-//Chemical Markup Language//SEARCH SMILES//")
(c1ccccc1)
they will be able to make it interoperate without bothering XML.  If the wider
community wishes to search text for components, then XML can provide formal
robust pointers to a set of methods which most implementers would be 
prepared to provide.  In cases where these use algorithms it may be possible to
refer to (say) Java classes, thus providing guaranteed interoperability.

	P.

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/
Received on Friday, 11 April 1997 06:44:01 UTC