HTML/SGML parsing (re: sgml-lex)

Jim Taylor (
Wed, 10 Jul 1996 18:19:14 -0800

Message-Id: <>
Date: Wed, 10 Jul 1996 18:19:14 -0800
From: Jim Taylor <>
Subject: HTML/SGML parsing (re: sgml-lex)

I urge anyone writing a browser or any sort of HTML parser to read Dan
Connolly's excellent "A Lexical Analyzer for HTML and Basic SGML" (a
work in progress at (If you haven't read it
since June 1996, go read it again -- it has improved substantially
from earlier versions.)

As recommended in the document, I'm addressing my comments to this
group with "sgml-lex" in the subject.

It's unclear to me exactly why the tag <abc_def> is an error but <abc
def> (a tag with a value) is not. I'm sure this is from my limited
knowledge of SGML, but since I'm probably a fair representative of
the intended audience of this document perhaps this should be

The question is, what separates tag names from attribute
specifications? Does SGML explicitly state that attribute
specifications must be delimited by whitespace, or can any unexpected
character act as delimiter? 

And what exactly is whitespace? SPACE, RE, RS, and SEPCHAR only? Does
that mean
is ok? If not, this could be dangerous if an editor or other process
hard wraps HTML at spaces. 

The flex input file seems to indicate that spaces but not other
whitespace can come after the attribute name and the =. Is this part
of SGML syntax?

Dan, it might be helpful to add a short discussion of whitespace in
the context of SGML and HTML, including where its
appropriate/necessary and where it's not.

Jim "The Frog" Taylor, Director of Information Technology
Videodiscovery, Inc. - Multimedia Education for Science and Math
Seattle, WA, 206-285-5400 <>