- From: Peter Johns <johnspeter@hotmail.com>
- Date: Mon, 06 Nov 2000 22:17:33 EST
- To: ok@atlas.otago.ac.nz, html-tidy@w3.org
Hi Richard, Richard: ... There _has_ been a bit of work on automatically learning DTDs from examples of marked-up documents, and there is a _lot_ of work going on trying to learn approximate natural language grammars from corpora. Peter: But are corpora based on XML (I know they could, but are they actually implemented this way?) Richard: ... some natural languages cannot be described by context-free grammars, and SGML is not even as powerful as general context-free grammars. Peter: some natural languages, or all othem? Which natural languages can and which ones can not? Can you point me to more info regarding this? --------------------------------------------------------- >From: "Richard A. O'Keefe" <ok@atlas.otago.ac.nz> >To: html-tidy@w3.org, johnspeter@hotmail.com >Subject: Re: DTD of natural language >Date: Tue, 7 Nov 2000 10:47:05 +1300 (NZDT) > >Peter Johns <johnspeter@hotmail.com> wrote: > I hear you laughing out loud, but do you know about research in > this topic? I mean, like building the DTD of the whole English > Language, or at least a good attempt at it? > >The English language as such does not contain tags. In full SGML there >is a feature called "short references" that you can use to do clever >things with punctuation. Basically, you can say "if you see this string >inside this element, turn it into a reference to that entity", and of >course that entity can expand to pretty much anything. > >For example, you could have > >+---------------------------------------------------------------+ >| | >|"Now is the time", the Walrus said, "to talk of many things." | >|But the shoes and ships and sealing-wax had to wait. | >| | >+---------------------------------------------------------------+ > >automatically mapped to > ><P><S><Q><W/Now/ <W/is/ <W/the/ <W/time/</Q><K/,/ ><W/the/ <W/Walrus/ <W/said/<K/,/ <Q><W/to/ <W/talk/ ><W/of/ <W/many/ <W/things/</Q><K/./</S> ><S><W/But/ <W/the/ <W/shoes/ <W/and/ <W/ships/ ><W/and/ <W/sealing-wax/ <W/had/ <W/to/ <W/wait<K/./</S></P> > >(Mind you, handling '."' is a bit tricky.) > >Anything finer grained than that would have to be sensitive to >properties of the words. As it happens, it _is_ "technically >possible to give name characters delimiter roles", so in principle >you could map words to arbitrary SGML forms. > >However, there are some pervasive features of natural languages that >SGML does rather badly with: > - lexical ambiguity. Does "lie" mean "recline" or "deceive"? > It is common for different senses of a word to have different > grammatical properties. > - long-distance dependencies (such as agreement). > - extraction and free word or phrase order >SGML DTDs are simply the wrong _kind_ of grammar to conveniently >describe natural languages. It was finally proven about 15 years >ago that some natural languages cannot be described by context- >free grammars, and SGML is not even as powerful as general >context-free grammars. > >That's a pity; there are ways in which SGML could be made context- >sensitive without making it harder to parse. > >There _has_ been a bit of work on automatically learning DTDs from >examples of marked-up documents, and there is a _lot_ of work going >on trying to learn approximate natural language grammars from >corpora. _________________________________________________________________________ Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com. Share information about yourself, create your own public profile at http://profiles.msn.com.
Received on Monday, 6 November 2000 22:18:08 UTC