- From: Chris Lovett <chris@walkaboutsoft.com>
- Date: Fri, 19 Jan 1996 23:13:43 -0800
- To: www-html@w3.org
Hi. I've been playing with Daniel Connolly's FLEX grammar for SGML which he posted in his paper: "A Lexical Analyzer for HTML and Basic SGML" at: "http://www.w3.org/pub/WWW/MarkUp/SGML/sgml-lex/sgml-lex.html". I've been using it in a C++ FLEX skeleton created by version 2.5.1 of the FLEX app for the Macintosh written by Jeff Laing of Tristero Computer Systems. I've been compiling it with CodeWarrior from Metrowerks. Both the FLEX tool and the lexical analyzer work quite nicely and it seems to be able to successfully parse everything I've thrown at it so far - including all the tricky stuff in Dan's paper. Dan thought it would be a good idea for me to let you all know about a nit of a bug I found in his FLEX grammar to do with identifying auxilliary processing instructions of the form "<?xxx>". The symptom is that the parser identifies most processing instructions as SGML_DATA which is incorrect. Note: NETSCAPE version 1.1 and 2.0b4 both display SGML_PI tags as data - try it out yourself. To identify SGML_PI tokens, the LEX spec contains the following statement: {PIO}[^>]*{PIC} { ... } where PIO is "<?" and PIC is ">". The problem comes from another statement for identifying SGML_DATA which is: ([^<&]|(<[^<&a-zA-Z!->])|(&[^<&#a-zA-Z]))+|. { ... } The middle part of this "(<[^<&a-zA-Z!->])" says that any string beginning with "<" followed by any character except the following "<&a-zA-Z!->" is SGML_DATA. This is why "<?" is identified as SGML_DATA. By adding "?" to the list of excluded characters fixes the problem. I've also extended Dan's grammar to include special support for the HTML tags "<xmp>" and "<plaintext>" (which was rather painful) and I have this hooked up to an HTML yacc parser which seems to work nicely. PS: If anyone has an HTML 3.0 yacc grammar I would be very interested. Chris Lovett - Walkabout Software chris@walkaboutsoft.com / www.walkaboutsoft.com
Received on Saturday, 20 January 1996 02:11:33 UTC