anyone got a YACC HTML grammar?

Hello everyone,

A quick question: does anyone have a decent YACC grammar for HTML (any recent
version)?  If at all possible, I'd prefer not to get into SGML and DTD parsing
as they seem substantially over-complex..

I'm particularly interested in an empty YACC grammar which I can add suitable
Abstract Data Tree construction calls to, to fully represent the structure
of a single HTML file.

This is for a final year undergraduate project I'm supervising, looking at
building a set of flexible tools to do cross-document checking and updating
for a group of Web pages.  The sort of things I have in mind:

	-	"set a corporate style on all these pages"
		(eg. add a body background tag, add a logo, add a button bar)

	-	"locate all misplaced headers"
		(eg. an h4 straight under an h2! where's the h3!)

	-	(re)number all headers.

	-	add link names to all headers.

	-	make a table of contents indexing all headers.

	-	build an index of all links for later checking.

We'd like to start with a good HTML parser, build parse trees, store them and
then investigate tree-rewriting routines and tree-printing routines to build
up a reusable toolkit of tools, perhaps being able to glue them together like
Unix pipelines..

cheers
duncan

Received on Thursday, 20 November 1997 19:43:17 UTC