- From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
- Date: Fri, 11 Apr 1997 10:25:29 GMT
- To: w3c-sgml-wg@w3.org
In message <3.0.32.19970410164725.006b6e50@pophost.arbortext.com> Paul Grosso writes: > In discussions with others over that last couple days, I've come > to the conclusion we should consider added to xml-link the capability > to address into data character content (aka dataloc). > This is an extremely important area for me and will - I think - be critical if XML is to be used for non-textual data (as well as textual data). In this message I make a simple proposal which is easy to implement and I hope is in the spirit of XML. First, however, if XML does *NOT* specify addressing into data character content one of the following may happen: - people will simply walk away from XML. (This will happen in my own discipline because much of our dataloc is inherently structured). - people will write PIs, develop new Elements or whatever simply to address this problem. This is guaranteed to be out of XML's ontrol and therefore a mess. - people will make up new (illegal) syntax. This will be even worse. One difficulty seems to be that we cannot at this stage devise a single robust mechanism for this process. Even for textual pattern matching there will be great variety in the approaches including case-sensitivity and the variety of regular expressions used. Two non-numeric examples (both from the JUMBO demo): (a) Stock prices are held as a whitespace-separated array of numbers (i.e. record-ends are not differentiated from spaces or tabs and all are folded to a single space by the CML processor). This array is robust and could have 10000 elements. It's unnecessary overkill to have to encapsulate every number in tags. We need a mechanism for addressing individual points or ranges of points. (b) Proteins are a linear sequence of subunits and can be internally addressed. It's important to be able to send an XML document that says: '*these* are the amino acids which are mutated in HIV protease and give rise to resistance'. Alternatively we may wish to address 3-D regions of space "this is where the antibody binds". In addition we have discipline-specific search mechanism which are robust and which we'd like to be able to use for addressing. To find all the molecules which contain a benzene ring we need to write something like: ROOT DESCENDANT (ALL MOL) (...) (SMILESSEARCH) (c1ccccc1) It seems clear to me that we shall have to live with a multitude of mechanisms for adressing dataloc and therefore we should welcome this and provide something where this can be done as robustly and extensibly as possible without feature creep. <PROPOSAL> I suggest that we recall the FOREIGN keyword to allow non-XML based searches. </PROPOSAL> The syntax as I recall is that it is: (FOREIGN)(ANYTHING)(YOU)(LIKE) The immediate advantage is that it will allow authors of documents to write *something* that is not syntactically invalid. (At present there is nothing they can do.) A subsidiary proposal is that where addressing mechanisms are likely to be common or agreed within a subcommunity, that XML provides a way of identifying these mechanisms. <PROPOSAL> That the token immediately after FOREIGN _may_ be an identifier (e.g. an FPI) identifying the precise mechanism to be used. </PROPOSAL> If, for example, we need to identify a regex for searching, we could precisely define the regex used. This could point to a document, or even to a catalog containing various implementations. As an example caesar.xml#(ROOT,DESCENDANT,(ALL,LINE)(FOREIGN)(-//POSIX//REGEX)("the*") (the details are illustrative - I don't know if POSIX has an FPI) If we collect xml-friendly resources under XML-DEV, then we might wish to develop an identifying scheme which allows them to be located. In the fullness of time, they might even get formal XML FPIs. In summary, the advantage of this is that it does not committ the implementers to anything more than identifying the word FOREIGN. An XML-engine is under no obligation to provide a mechanism beyond that. If a small community such as moleculat science wishes to develop something like: DESCENDANT,(ALL,MOL)(FOREIGN)("-//Chemical Markup Language//SEARCH SMILES//") (c1ccccc1) they will be able to make it interoperate without bothering XML. If the wider community wishes to search text for components, then XML can provide formal robust pointers to a set of methods which most implementers would be prepared to provide. In cases where these use algorithms it may be possible to refer to (say) Java classes, thus providing guaranteed interoperability. P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/
Received on Friday, 11 April 1997 06:44:01 UTC