- From: Richard A. O'Keefe <ok@atlas.otago.ac.nz>
- Date: Wed, 8 Nov 2000 16:11:15 +1300 (NZDT)
- To: html-tidy@w3.org, johnspeter@hotmail.com, ok@atlas.otago.ac.nz
Richard: ... There _has_ been a bit of work on automatically learning DTDs from examples of marked-up documents, and there is a _lot_ of work going on trying to learn approximate natural language grammars from corpora. Peter: But are corpora based on XML (I know they could, but are they actually implemented this way?) Some are. (The LT group at Edinburgh have built a rather good kit of "Normalised SGML" and XML tools precisely for this purpose.) Some aren't. The Wall Street Journal collection, for example. But what of it? XML is even more restrictive than SGML. If you want to represent the structure of an English sentence in XML, you have to make heavy use of attributes. (Rather like RDF.) But DTDs (and XML Schemas) cannot express contextual constraints on attribute values; if attribute A of element type E may _ever_ have value V, then it may _always_ have value V. (Well, in XML that's true. In SGML there's a technical exception which isn't useful in this context.) Programs that learn DTDs learn the grammar of which element types can appear where and perhaps what attributes they may have, but they do not learn what DTDs cannot ever express. SGML and XML just aren't *useful* for expressing the structure of English or any other natural language. Lisp data values, yes. Prolog data values, yes. Either would be considerably more compact than XML. You might raise RDF as a counter-example, but RDF is inexpressibly clumsy and bulky, and the structural constraints could not be expressed as a DTD. Richard: ... some natural languages cannot be described by context-free grammars, and SGML is not even as powerful as general context-free grammars. Peter: some natural languages, or all othem? Which natural languages can and which ones can not? Can you point me to more info regarding this? I have heard of *proofs* for three languages (one a German dialect, another another Germanic language, and one an African language) but am not up to date. The people who invented GPSG (Generalised Phrase Structure Grammar) pointed out that Chomsky's original proof, based on English, was flawed. GPSP takes a context-free core language and applies transformations to the grammar rather than the sentences. The grammars one ends up with are context free, but huge. Parsing using a standard context-free parsing would be grossly inefficient, so special techniques are used. From memory, the name of the person at SRI who proved that there _were_ languages that couldn't be handled at all by context-free grammars was Hans Uszkoreit (sp?) but I could be wrong about that.
Received on Tuesday, 7 November 2000 22:11:35 UTC