RE: Surface vs. Abstract Syntax, was: RE: What do the ontologists want from pat hayes on 2001-05-19 (www-rdf-logic@w3.org from May 2001)

From: pat hayes <phayes@ai.uwf.edu>
Date: Sat, 19 May 2001 13:03:02 -0500
To: "Jonathan Borden" <jborden@mediaone.net>
Cc: www-rdf-logic@w3.org
Message-Id: <v0421013eb72c5a50558b@[205.160.76.183]>
> > >compaining about XML's verbosity is directly along the lines of
> > complaining
> > >that LISP uses too many paren's.
> >
>
>pat hayes wrote:
>
> > I disagree. The point is that parens are informationally quite dense.
> > The only way of indicating applicative structure in fewer symbols
> > would be some form of Polish notation, and that can only be used for
> > fixed-arity (usually binary) operators.
>
>Minimizing tokens was apparently not the most significant critereon for
>development of the XML syntax. Interestingly the forerunner of XML, SGML
>provided facilities that would, for example, allow the s-expression syntax
>to be declared as a valid SGML application, and i recall seeing just that at
>some point.

Ah, what an opportunity missed. Never mind....

>All these issues were argued out and in the end a compromise was reached.
>Like all compromises, not everyone is happy with each feature. What _has_
>happened is that a large number of people have found XML easy to work with
>and adequate for their particular needs, even if not the most efficient
>representation for their particular application. I can assure you that these
>issues have been beaten to an infinite set of deaths on the SGML/XML lists.

I'm sure.

>I can't argue with any of these points except to say: XML parsers have been
>spread like viruses to all reaches of the digital world. XML applications
>can run on platforms ranging from cell phones to mainframes. Perhaps if the
>LISP community could have resisted the temptation to fragment, and had
>provided ubiquitous open source software

LISP was open source before there was any proprietary software. But I 
agree that there is no way to put the clock back.

>, we might be using parens rather
>than angle brackets, but that is a different battle to fight. I am more
>concerned with developing a semantic "infrastructure" because I believe
>there are some fundamental issues that need to be addressed*
>
>* of course if you insist on using parens, I can provide a piece of software
>(or tell you how to write one) that will parse s-expressions as XML.

Oh, I know. I can write one myself. (Of course there is no reason to 
suppose that the XML that mine produces will look anything much like 
the XML that yours produces, but that's life.)

> >
> > >perhaps the greatest benefit of XML is that its surface syntax directly
> > >represents its abstract syntax,
> >
> > So does LISP. In fact, so do almost all mathematical and formal notations.
> >
> > > and for someone familiar with XML, this
> > >means that one can look at a document, even in the absense of a
> > schema, and
> > >get a pretty good idea of its structure.
> >
> > This is true of any explicit syntax. You can look at a page of
> > mathematics and do that, even if you don't know the math very well.
> >
> > I think that what makes XML so longwinded is not that its surface
> > syntax *represents* its abstract syntax, but that it explicitly
> > *describes* it, which is like writing English by prefixing (and
> > postfixing!) every word and phrase by a label describing its
> > syntactic category.
>
>this is a good analogy. suppose we could transport ourselves 1 million years
>into the future (and suppose that the world still uses Unicode). If you were
>assigned the task of translating documents written in some completely
>unknown language into english, i dare say that you would _greatly_
>appreciate if each word were marked up in such fashion.

Well, we are having a purely academic discussion here, of course, but 
I don't see that it would be much help, unless I also knew what the 
mark-up labellings were supposed to mean. And if I didnt know how to 
distinguish mark-up from text, it would be an active hindrance. If 
the Rosetta stone texts had been marked-up using some conventions we 
no longer knew it would probably be completely undecipherable.

>a critereon for XML was not succinctness. a critereon was the ability to
>archive information for very long term use.
>
> > This seems to me to be based on a
> > misunderstanding of the very nature of syntax.
>
>Nah, the point of 'well formedness' as opposed to 'schema valid' is that
>documents are better stand alone and you don't need to continueally refer to
>a schema which can get misplaced. you can parse math because of the schema
>built into your head.

That is called knowing the language, yes. To read a marked-up text 
you need to be able to read TWO languages. I don't see that making it 
more 'stand-alone'.

>suppose this very practical real world example:
>
>"
>Dear Dr. Smith;
>
>I have had the great pleasure of seeing your wonderful 42 year old patient
>Rev. John Roberts III, who presents with back pain radiating down the right
>leg for 2 weeks. He has weakness in the left gastrocnemious. An MRI
>demonstrates a Left L5-S1 disc herniation. I recommend surgery.
>
>Best Regards,
>
>Jonathan Borden, M.D.
>"
>
>Marked up, it contains new tokens and no new information to me:
>
><office.note>
>Dear <referring.md><person.name><title>Dr.</title>
><family.name>Smith</family.name></referring.md>;
>
>I have had the great pleasure of seeing your wonderful <patient.age>42
>year<patient.age> old patient <patient><person.name><prefix>Rev.</prefix>
><given>John</given> <family>Roberts</family>
><suffix>III</suffix></person.name></patient>, who presents with
><chief.complaint>back pain radiating down the <laterality>right</laterality>
>leg for <duration>2 weeks</duration></chief.complaint>. He has
><physical.exam>weakness in the left gastrocnemious</physical.exam>. An MRI
>demonstrates a Left L5-S1 disc herniation. I recommend
><procedure><coded.value type="cpt"
>code="63030">surgery</coded.value></procedure>
>
>Best Regards,
>
><attending.surgeon><person.name>Jonathan Borden,
>M.D.</person.name><attending.surgeon>
></office.note>
>
>
>The point is that successive applications on a processing chain can add
>information and apply transformations akin to "knowledge sources".
>
>Languages (almost all
> > of them, natural and artificial) work by *displaying* their syntactic
> > structure, not by *describing* it.
>
>The above XML still displays on a browser as text, despite being heavily
>marked up.
>
>If you do both, you pretty much
> > guarantee to be using more symbols than you need to be using to
> > convey the same information.
>
>yep. but these symbols are meaningful to applications that process the
>information.

OK, that is clearly the key point. Thanks for your example. In this 
kind of application, I can see the utility.  Never mind future 
Rosetta stones: the point is to attach labels to pieces of existing 
text to enable some engine to isolate the parts of the text that are 
useful to it and ignore the rest. This makes perfectly good sense, I 
agree, and is obviously useful and important in the Web world, but it 
doesn't live up to the hype. It does not make texts self-describing, 
and it does not make them more comprehensible in the long term. The 
thing using the markup needs to know what the markup labels indicate, 
for example. (The use of the </label> notation still seems 
mind-numbingly daft to me, but I guess it is too late to change that 
now.)

But that kind of utility - text markup - is one thing, and inventing 
an entirely distinct formal language is another. I can see very 
little utility in having a new formalism rendered into XML syntactic 
form, and a great deal of inutility. There is the pragmatic utility 
which is supposed to arise from the ubiquitousness of XML parsers and 
so on, though I think this argument is often overstated:

> XML naturally represents trees and somewhat naturally handles
> > >maps.
> >
> > OK, I agree that such savings are very handy when the work has
> > already been done.
>
>
>that's really the entire point, we don't need to constantly reinvent
>browsers, transform engines (e.g. XSLT), database glue, query languages etc.
>etc.

Again, this kind of argument has been made many times in favor of 
'standard' formats of one kind or another, and I am cynical about 
them. People will write new engines, come what may, and if something 
better than XML is available, they might change to that. Some of them 
already are doing. This kind of off-the-shelf inertia gives stuff a 
lifetime of one or two product cycles, which for software is about 
2-3 years these days. And in any case, the lifetime of the 
information is often only a few years, in any case. Things change, 
software stops working on newer platforms, etc. etc.. Usually, if 
something is worth doing other than badly, the restriction to a bad 
legacy format isn't worth the harm it causes. (Even if it pays off 
for a while, the longer the  change is put off, the nastier it gets 
to be to make it (the current state of the FBI might be an example, 
but there are plenty throughout government agencies.) So the best way 
to proceed is to stay flexible and be ready to change when 
circumstances require it, rather than to invest in the final answer 
and feel that ones problems are finally solved. There are always 
enough managers out there who want the final answer, however, to 
listen to the people who think they have it.)

>IMHO RDF is really too young to consider itself a legacy. Especially taking
>on a task as monumental as becoming a "semantic platform" for the "semantic
>web". RDF needs to adapt, and adapt well.

Amen, but tell that to the W3C marines.

Pat Hayes

---------------------------------------------------------------------
IHMC					(850)434 8903   home
40 South Alcaniz St.			(850)202 4416   office
Pensacola,  FL 32501			(850)202 4440   fax
phayes@ai.uwf.edu 
http://www.coginst.uwf.edu/~phayes
Received on Saturday, 19 May 2001 14:03:05 UTC