- From: Kim Liu <KLIU@us.oracle.com>
- Date: 14 Aug 96 11:04:58 -0700
- To: www-lib@w3.org, msa@hemuli.tte.vtt.fi
Yeah, I basically agree with you. However, I think the idea of a SGML tokenizer is not feasible in the general sense. In SGML, there are places where the tokenizer is supposed to recognize all the markups and entities. And in a <XMP>, only the end tag </XMP> is recognized. This has to do with the CDATA, RCDATA, PCDATA, mixed content things in SGML. So theorectically, you cannot build a SGML tokenizer without specifying a particular DTD. And that's why the parsing and lexical part have to go together. Ideally, we should have a SGML parser that takes a DTD (and data of course) as input and outputs a tree of SGML elements. What we are talking about here is essentially a SGML tokenizer for common HTML DTD. Given this, the only tricky part is to make sure we only recognize the end tag in a <XMP>, <LISTING>, and <PLAINTEXT>. So we still need the SGML_MIXED and SGML_LITERAL thing in the tokenizer. And that's exactly what I have here. I stripped off the ad-hoc tag-matching part in SGML.c and put a more SGML-style tag matching engine down the stream. BTW, I do notice that tags like LI, DD, DT, etc should be SGML_MIXED instead of SGML_EMPTY. Could someone confirm that this is a typo? The Amaya parsing module seems to be very useful since it conforms to the concept of a stream. When will it be available to the public? -Kim
Received on Wednesday, 14 August 1996 14:32:29 UTC