HTML/SGML parsing (re: sgml-lex)

Jim Taylor (JHTaylor@videodiscovery.com)
Wed, 10 Jul 1996 18:19:14 -0800


Message-Id: <s1e3f52e.009@videodiscovery.com>
Date: Wed, 10 Jul 1996 18:19:14 -0800
From: Jim Taylor <JHTaylor@videodiscovery.com>
To: connolly@beach.w3.org
Cc: www-html@w3.org
Subject: HTML/SGML parsing (re: sgml-lex)

I urge anyone writing a browser or any sort of HTML parser to read Dan
Connolly's excellent "A Lexical Analyzer for HTML and Basic SGML" (a
work in progress at
http://www.w3.org/pub/WWW/TR/WD-sgml-lex). (If you haven't read it
since June 1996, go read it again -- it has improved substantially
from earlier versions.)

As recommended in the document, I'm addressing my comments to this
group with "sgml-lex" in the subject.

It's unclear to me exactly why the tag <abc_def> is an error but <abc
def> (a tag with a value) is not. I'm sure this is from my limited
knowledge of SGML, but since I'm probably a fair representative of
the intended audience of this document perhaps this should be
clarified.

The question is, what separates tag names from attribute
specifications? Does SGML explicitly state that attribute
specifications must be delimited by whitespace, or can any unexpected
character act as delimiter? 

And what exactly is whitespace? SPACE, RE, RS, and SEPCHAR only? Does
that mean
<abc
def>
is ok? If not, this could be dangerous if an editor or other process
hard wraps HTML at spaces. 

The flex input file seems to indicate that spaces but not other
whitespace can come after the attribute name and the =. Is this part
of SGML syntax?

Dan, it might be helpful to add a short discussion of whitespace in
the context of SGML and HTML, including where its
appropriate/necessary and where it's not.

______________________________________________
Jim "The Frog" Taylor, Director of Information Technology
<mailto:jhtaylor@videodiscovery.com>
Videodiscovery, Inc. - Multimedia Education for Science and Math
Seattle, WA, 206-285-5400 <http://www.videodiscovery.com/vdyweb>