Re: Comments on "A Lexical Analyser for HTML and Basic SGML"

Daniel W. Connolly (
Wed, 24 Jan 1996 11:36:32 -0500

Message-Id: <>
Subject: Re: Comments on "A Lexical Analyser for HTML and Basic SGML" 
In-Reply-To: Your message of "Wed, 24 Jan 1996 07:15:59 EST."
Date: Wed, 24 Jan 1996 11:36:32 -0500
From: "Daniel W. Connolly" <>

In message <>, David Megginson wr
> I think that his
>analyser will provide information at exactly the level of abstraction
>required by parser designers.

I've had some feedback on the API. It's going to change a little bit, but
not much.

Mostly, it needs to provide lossless parsing. Case folding is optional,
and whitespace trimming will become optional. This allows for what
I call "structured stream editing" -- changing all the links in
a document, for example (without changing anything else).

>1) It seems unnecessary to ban the DTD subset altogether,

On the other hand it's unnecessary to support a DTD subset in the
document entity, since it can always be accomodated in a separate
entity. This, combined with the "Keep it simple" principle (aka
occam's razor) and the fact that the deployed base of HTML user agents
don't grok this today convinced me that this is not an issue to tackle
just yet.

> since this
> is the logical place to declare entities.  Why not allow the subset,
> but limit its contents to <!ENTITY..> and <!NOTATION...> declarations?

Note that this is in the "future work" section. I just updated that section,
and a few related sections. Have a look:

$Date: 1996/01/24 16:34:03 $ 

Marked Sections

Support for marked sections is an integral part of a strategy for
interoperability among HTML user agents supporting different HTML
dialects[HTMLDIALECT]. It has other valueable applicatoins, and it is
a straightforward addition to the lexical analyzer in this report.


Support for character encodings and coded character sets other than
ASCII is a requirement for production use. Support for the X Windows
compound text encoding (related to ISO-2022) and the UTF-8 or perhaps
UCS-2 encoding of Unicode (ISO-10646), with extensibility for other
character encodings seems most desirable.

Internal declaration subset support

Internal declaration subsets are not expected to become a part of
HTML. But the technology in this report is applicable to other SGML
applications, and internal declaration subsets are a straightfoward
addition to this lexical analyzer. Relavent mechanisms include:

	General entity declarations with URIs as system identifiers

	General entity declarations as "macros"

	Parameter entity declarations for "switches" and "hooks"


>2) Marked sections might be too useful to leave out.


The only reason that they're not in there yet is that I want
to concentrate on the bugs in existing HTML parsers before I start
adding new stuff.

Again, see "future work."