Re: DTD of natural language

	Richard: ... There _has_ been a bit of work on automatically learning DTDs 
	from examples of marked-up documents, and there is a _lot_ of work going on 
	trying to learn approximate natural language grammars from corpora.
	
	Peter: But are corpora based on XML (I know they could, but are they 
	actually implemented this way?)
	
Some are.  (The LT group at Edinburgh have built a rather good kit of
"Normalised SGML" and XML tools precisely for this purpose.)
Some aren't.  The Wall Street Journal collection, for example.

But what of it?  XML is even more restrictive than SGML.  If you want to
represent the structure of an English sentence in XML, you have to make
heavy use of attributes.  (Rather like RDF.)  But DTDs (and XML Schemas)
cannot express contextual constraints on attribute values; if attribute
A of element type E may _ever_ have value V, then it may _always_ have
value V.  (Well, in XML that's true.  In SGML there's a technical
exception which isn't useful in this context.)  Programs that learn DTDs
learn the grammar of which element types can appear where and perhaps what
attributes they may have, but they do not learn what DTDs cannot ever
express.

SGML and XML just aren't *useful* for expressing the structure of English
or any other natural language.  Lisp data values, yes.  Prolog data values,
yes.  Either would be considerably more compact than XML.  You might raise
RDF as a counter-example, but RDF is inexpressibly clumsy and bulky, and
the structural constraints could not be expressed as a DTD.

	Richard: ... some natural languages cannot be described by context-free 
	grammars, and SGML is not even as powerful as general context-free grammars.
	
	Peter: some natural languages, or all othem? Which natural languages can and 
	which ones can not? Can you point me to more info regarding this?
	
I have heard of *proofs* for three languages (one a German dialect, another
another Germanic language, and one an African language) but am not up to
date.  The people who invented GPSG (Generalised Phrase Structure Grammar)
pointed out that Chomsky's original proof, based on English, was flawed.
GPSP takes a context-free core language and applies transformations to
the grammar rather than the sentences.  The grammars one ends up with are
context free, but huge.  Parsing using a standard context-free parsing
would be grossly inefficient, so special techniques are used.  From
memory, the name of the person at SRI who proved that there _were_ languages
that couldn't be handled at all by context-free grammars was Hans Uszkoreit
(sp?) but I could be wrong about that.

Received on Tuesday, 7 November 2000 22:11:35 UTC