Re: WWWLIB Parser, is there going to be any update on it?
Yeah, I basically agree with you. However, I think the idea of a SGML
tokenizer is not feasible in the general sense. In SGML, there are places
where the tokenizer is supposed to recognize all the markups and entities. And
in a <XMP>, only the end tag </XMP> is recognized. This has to do with the
CDATA, RCDATA, PCDATA, mixed content things in SGML. So theorectically, you
cannot build a SGML tokenizer without specifying a particular DTD. And that's
why the parsing and lexical part have to go together. Ideally, we should have
a SGML parser that takes a DTD (and data of course) as input and outputs a
tree of SGML elements.
What we are talking about here is essentially a SGML tokenizer for common HTML
DTD. Given this, the only tricky part is to make sure we only recognize the
end tag in a <XMP>, <LISTING>, and <PLAINTEXT>. So we still need the
SGML_MIXED and SGML_LITERAL thing in the tokenizer. And that's exactly what I
have here. I stripped off the ad-hoc tag-matching part in SGML.c and put a
more SGML-style tag matching engine down the stream.
BTW, I do notice that tags like LI, DD, DT, etc should be SGML_MIXED instead
of SGML_EMPTY. Could someone confirm that this is a typo?
The Amaya parsing module seems to be very useful since it conforms to the
concept of a stream. When will it be available to the public?