Re: Daniel Connolly's SGML Lex Specification

Hi.  I've been playing with Daniel Connolly's FLEX grammar for SGML which
he posted in his paper:
"A Lexical Analyzer for HTML and Basic SGML"
at:
"http://www.w3.org/pub/WWW/MarkUp/SGML/sgml-lex/sgml-lex.html".

I've been using it in a C++ FLEX skeleton created by version 2.5.1 of the
FLEX app for the Macintosh written by Jeff Laing of Tristero Computer
Systems.  I've been compiling it with CodeWarrior from Metrowerks.  Both
the FLEX tool and the lexical analyzer work quite nicely and it seems to be
able to successfully parse everything I've thrown at it so far - including
all the tricky stuff in Dan's paper.

Dan thought it would be a good idea for me to let you all know about a nit
of a bug I found in his FLEX grammar to do with identifying auxilliary
processing instructions of the form "<?xxx>".  The symptom is that the
parser identifies most processing instructions as SGML_DATA which is
incorrect.  Note: NETSCAPE version 1.1 and 2.0b4 both display SGML_PI tags
as data - try it out yourself.

To identify SGML_PI tokens, the LEX spec contains the following statement:

{PIO}[^>]*{PIC} { ... }

where PIO is "<?" and PIC is ">".  The problem comes from another statement
for identifying SGML_DATA which is:

([^<&]|(<[^<&a-zA-Z!->])|(&[^<&#a-zA-Z]))+|.  { ... }

The middle part of this "(<[^<&a-zA-Z!->])" says that any string beginning
with "<" followed by any character except the following "<&a-zA-Z!->" is
SGML_DATA.  This is why "<?" is identified as SGML_DATA.  By adding "?" to
the list of excluded characters fixes the problem.

I've also extended Dan's grammar to include special support for the HTML
tags "<xmp>" and "<plaintext>" (which was rather painful) and I have this
hooked up to an HTML yacc parser which seems to work nicely.

PS: If anyone has an HTML 3.0 yacc grammar I would be very interested.

Chris Lovett - Walkabout Software
chris@walkaboutsoft.com / www.walkaboutsoft.com

Received on Saturday, 20 January 1996 02:11:33 UTC