Comments on "A Lexical Analyser for HTML and Basic SGML"

David Megginson (
Wed, 24 Jan 1996 07:15:59 -0500

Date: Wed, 24 Jan 1996 07:15:59 -0500
Message-Id: <>
Subject: Comments on "A Lexical Analyser for HTML and Basic SGML"
From: David Megginson <>

I have just read Dan Connelly's report, and I think that it is quite
well done.  He is absolutely right to take a pragmatic and tolerant
approach to parsing SGML-based documents, and I think that his
analyser will provide information at exactly the level of abstraction
required by parser designers.  I would however, like to make a few

1) It seems unnecessary to ban the DTD subset altogether, since this
 is the logical place to declare entities.  Why not allow the subset,
 but limit its contents to <!ENTITY..> and <!NOTATION...> declarations?
 I realise that there is a danger in allowing authors to define
 parameter entities in the subset, since those can affect the structure
 of the DTD, but browsers are free to ignore such fiddling.  As a
 compromise, you could allow only internal entities or (optionally)
 external data entities with the URL as the system identifier.

 In fact, it would not even be necessary to return any of the DTD
 subset information directly to the caller -- instead, you could simply
 store it and feed it out when queried
 (ie. lookup_general_entity("foo");).

2) Marked sections might be too useful to leave out: all you need to
 do is return events for the start and the end of a marked section, and
 to keep track yourself of the special cases when you are parsing CDATA,
 RCDATA, or an ignored section.  If you allow parameter entities to be
 declared in the DTD subset, then you will be able to look them up
 yourself and decide on the type of the marked section.

If the analyser handles most of this internally (storing entity
values, etc) these suggestions would increase the complexity of the
user interface only _very_ slightly, by introducing functions to
lookup the values and types of entities and by introducing events for
the beginning and end of marked sections (the browser could simply
ignore these, since the analyser will know how to do the right thing
with their contents).  Some browsers might even want to allow users to
set their own parameter entities (they could do so at the beginning of
the parse) for certain standard types of marked sections:

    <p>Here is a picture of Newt's butt <img src="newtbutt.gif"></p>
    <p>Here are the lyrics to all of the songs from

(I'd be more concerned with protecting my daughters from the second

  <!ENTITY % inanity "IGNORE">


David Megginson                Department of English, University of Ottawa,       Ottawa, Ontario, CANADA  K1N 6N5      Phone: (613) 562-5800 ext.1203
WWW:  FAX: (613) 562-5990