- From: Richard A. O'Keefe <ok@atlas.otago.ac.nz>
- Date: Tue, 7 Nov 2000 10:47:05 +1300 (NZDT)
- To: html-tidy@w3.org, johnspeter@hotmail.com
Peter Johns <johnspeter@hotmail.com> wrote: I hear you laughing out loud, but do you know about research in this topic? I mean, like building the DTD of the whole English Language, or at least a good attempt at it? The English language as such does not contain tags. In full SGML there is a feature called "short references" that you can use to do clever things with punctuation. Basically, you can say "if you see this string inside this element, turn it into a reference to that entity", and of course that entity can expand to pretty much anything. For example, you could have +---------------------------------------------------------------+ | | |"Now is the time", the Walrus said, "to talk of many things." | |But the shoes and ships and sealing-wax had to wait. | | | +---------------------------------------------------------------+ automatically mapped to <P><S><Q><W/Now/ <W/is/ <W/the/ <W/time/</Q><K/,/ <W/the/ <W/Walrus/ <W/said/<K/,/ <Q><W/to/ <W/talk/ <W/of/ <W/many/ <W/things/</Q><K/./</S> <S><W/But/ <W/the/ <W/shoes/ <W/and/ <W/ships/ <W/and/ <W/sealing-wax/ <W/had/ <W/to/ <W/wait<K/./</S></P> (Mind you, handling '."' is a bit tricky.) Anything finer grained than that would have to be sensitive to properties of the words. As it happens, it _is_ "technically possible to give name characters delimiter roles", so in principle you could map words to arbitrary SGML forms. However, there are some pervasive features of natural languages that SGML does rather badly with: - lexical ambiguity. Does "lie" mean "recline" or "deceive"? It is common for different senses of a word to have different grammatical properties. - long-distance dependencies (such as agreement). - extraction and free word or phrase order SGML DTDs are simply the wrong _kind_ of grammar to conveniently describe natural languages. It was finally proven about 15 years ago that some natural languages cannot be described by context- free grammars, and SGML is not even as powerful as general context-free grammars. That's a pity; there are ways in which SGML could be made context- sensitive without making it harder to parse. There _has_ been a bit of work on automatically learning DTDs from examples of marked-up documents, and there is a _lot_ of work going on trying to learn approximate natural language grammars from corpora.
Received on Monday, 6 November 2000 16:47:45 UTC