W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2000

Re: DTD of natural language

From: Peter Johns <johnspeter@hotmail.com>
Date: Mon, 06 Nov 2000 22:17:33 EST
To: ok@atlas.otago.ac.nz, html-tidy@w3.org
Message-ID: <F174V6uPDVniSnRMWuL00000038@hotmail.com>
Hi Richard,

Richard: ... There _has_ been a bit of work on automatically learning DTDs 
from examples of marked-up documents, and there is a _lot_ of work going on 
trying to learn approximate natural language grammars from corpora.

Peter: But are corpora based on XML (I know they could, but are they 
actually implemented this way?)

Richard: ... some natural languages cannot be described by context-free 
grammars, and SGML is not even as powerful as general context-free grammars.

Peter: some natural languages, or all othem? Which natural languages can and 
which ones can not? Can you point me to more info regarding this?

---------------------------------------------------------


>From: "Richard A. O'Keefe" <ok@atlas.otago.ac.nz>
>To: html-tidy@w3.org, johnspeter@hotmail.com
>Subject: Re:  DTD of natural language
>Date: Tue, 7 Nov 2000 10:47:05 +1300 (NZDT)
>
>Peter Johns <johnspeter@hotmail.com> wrote:
>	I hear you laughing out loud, but do you know about research in
>	this topic?  I mean, like building the DTD of the whole English
>	Language, or at least a good attempt at it?
>
>The English language as such does not contain tags.  In full SGML there
>is a feature called "short references" that you can use to do clever
>things with punctuation.  Basically, you can say "if you see this string
>inside this element, turn it into a reference to that entity", and of
>course that entity can expand to pretty much anything.
>
>For example, you could have
>
>+---------------------------------------------------------------+
>|                                                               |
>|"Now is the time", the Walrus said, "to talk of many things."  |
>|But the shoes and ships and sealing-wax had to wait.           |
>|                                                               |
>+---------------------------------------------------------------+
>
>automatically mapped to
>
><P><S><Q><W/Now/ <W/is/ <W/the/ <W/time/</Q><K/,/
><W/the/ <W/Walrus/ <W/said/<K/,/ <Q><W/to/ <W/talk/
><W/of/ <W/many/ <W/things/</Q><K/./</S>
><S><W/But/ <W/the/ <W/shoes/ <W/and/ <W/ships/
><W/and/ <W/sealing-wax/ <W/had/ <W/to/ <W/wait<K/./</S></P>
>
>(Mind you, handling '."' is a bit tricky.)
>
>Anything finer grained than that would have to be sensitive to
>properties of the words.  As it happens, it _is_ "technically
>possible to give name characters delimiter roles", so in principle
>you could map words to arbitrary SGML forms.
>
>However, there are some pervasive features of natural languages that
>SGML does rather badly with:
>  - lexical ambiguity.  Does "lie" mean "recline" or "deceive"?
>    It is common for different senses of a word to have different
>    grammatical properties.
>  - long-distance dependencies (such as agreement).
>  - extraction and free word or phrase order
>SGML DTDs are simply the wrong _kind_ of grammar to conveniently
>describe natural languages.  It was finally proven about 15 years
>ago that some natural languages cannot be described by context-
>free grammars, and SGML is not even as powerful as general
>context-free grammars.
>
>That's a pity; there are ways in which SGML could be made context-
>sensitive without making it harder to parse.
>
>There _has_ been a bit of work on automatically learning DTDs from
>examples of marked-up documents, and there is a _lot_ of work going
>on trying to learn approximate natural language grammars from
>corpora.

_________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com.

Share information about yourself, create your own public profile at 
http://profiles.msn.com.
Received on Monday, 6 November 2000 22:18:08 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:44 GMT