Re: DTD of natural language

Peter Johns <johnspeter@hotmail.com> wrote:
	I hear you laughing out loud, but do you know about research in
	this topic?  I mean, like building the DTD of the whole English
	Language, or at least a good attempt at it?
	
The English language as such does not contain tags.  In full SGML there
is a feature called "short references" that you can use to do clever
things with punctuation.  Basically, you can say "if you see this string
inside this element, turn it into a reference to that entity", and of
course that entity can expand to pretty much anything.

For example, you could have

+---------------------------------------------------------------+
|                                                               |
|"Now is the time", the Walrus said, "to talk of many things."  |
|But the shoes and ships and sealing-wax had to wait.           |
|                                                               |
+---------------------------------------------------------------+

automatically mapped to

<P><S><Q><W/Now/ <W/is/ <W/the/ <W/time/</Q><K/,/
<W/the/ <W/Walrus/ <W/said/<K/,/ <Q><W/to/ <W/talk/
<W/of/ <W/many/ <W/things/</Q><K/./</S>
<S><W/But/ <W/the/ <W/shoes/ <W/and/ <W/ships/
<W/and/ <W/sealing-wax/ <W/had/ <W/to/ <W/wait<K/./</S></P>

(Mind you, handling '."' is a bit tricky.)

Anything finer grained than that would have to be sensitive to
properties of the words.  As it happens, it _is_ "technically
possible to give name characters delimiter roles", so in principle
you could map words to arbitrary SGML forms.

However, there are some pervasive features of natural languages that
SGML does rather badly with:
 - lexical ambiguity.  Does "lie" mean "recline" or "deceive"?
   It is common for different senses of a word to have different
   grammatical properties.
 - long-distance dependencies (such as agreement).
 - extraction and free word or phrase order
SGML DTDs are simply the wrong _kind_ of grammar to conveniently
describe natural languages.  It was finally proven about 15 years
ago that some natural languages cannot be described by context-
free grammars, and SGML is not even as powerful as general
context-free grammars.

That's a pity; there are ways in which SGML could be made context-
sensitive without making it harder to parse.

There _has_ been a bit of work on automatically learning DTDs from
examples of marked-up documents, and there is a _lot_ of work going
on trying to learn approximate natural language grammars from
corpora.

Received on Monday, 6 November 2000 16:47:45 UTC