[Prev][Next][Index][Thread]

Re: WWWLIB Parser, is there going to be any update on it?



Yeah, I basically agree with you. However, I think the idea of a SGML 
tokenizer is not feasible in the general sense. In SGML, there are places 
where the tokenizer is supposed to recognize all the markups and entities. And 
in a <XMP>, only the end tag </XMP> is recognized. This has to do with the 
CDATA, RCDATA, PCDATA, mixed content things in SGML. So theorectically, you 
cannot build a SGML tokenizer without specifying a particular DTD. And that's 
why the parsing and lexical part have to go together. Ideally, we should have 
a SGML parser that takes a DTD (and data of course) as input and outputs a 
tree of SGML elements.  
 
What we are talking about here is essentially a SGML tokenizer for common HTML 
DTD. Given this, the only tricky part is to make sure we only recognize the 
end tag in a <XMP>, <LISTING>, and <PLAINTEXT>. So we still need the 
SGML_MIXED and SGML_LITERAL thing in the tokenizer. And that's exactly what I 
have here. I stripped off the ad-hoc tag-matching part in SGML.c and put a 
more SGML-style tag matching engine down the stream.  
 
BTW, I do notice that tags like LI, DD, DT, etc should be SGML_MIXED instead 
of SGML_EMPTY. Could someone confirm that this is a typo? 
 
The Amaya parsing module seems to be very useful since it conforms to the 
concept of a stream. When will it be available to the public? 
 
-Kim