Re: sgml-lex: White space in tags?

Sunil Mishra (smishra@cc.gatech.edu)
Thu, 19 Sep 1996 17:30:25 -0400 (EDT)


Date: Thu, 19 Sep 1996 17:30:25 -0400 (EDT)
Message-Id: <199609192130.RAA18855@cleon.cc.gatech.edu>
From: Sunil Mishra <smishra@cc.gatech.edu>
To: www-html@w3.org
In-reply-to: <199609192106.OAB22538@toad.com> (message from John Gilmore on
Subject: Re: sgml-lex: White space in tags?

\\ A question came up at my site about whether white space is acceptable
\\ in tags, and I was unable to figure out from the stuff I could find at
\\ the W3.org web site whether this is valid or not.
\\ 
\\ It's extremely unfortunate that HTML is based on a proprietary spec
\\ that we can't distribute online.  I hope W3C is trying to remedy this
\\ situation.  How much money would it take to pry loose the SGML spec
\\ from ISO for public distribution without restriction?  I can attempt
\\ to provide or raise this money, if they have a price.  If they refuse
\\ to permit public use at any price, I think the HTML community should
\\ duplicate the work (to the extent that we need it) and separate from
\\ the SGML community.

I believe SGML is an ISO standard, and there is nothing proprietary about
it. I have found more information about correctly parsing SGML out there
than I could handle, so much so that I had to give up on them and fall back
on the flex spec at w3c while writing a parser.

\\ I tried reading the HTML lexical analyzer to answer the question, but
\\ it uses features of flex that I've never seen before and don't
\\ understand.
\\ 
\\ Here's the specific issue:
\\ 
\\     When doing HTML anchors (links), the closing ">" on the <A HREF...> 
\\     element needs to be in contact with the rest of it:
\\ 
\\     <A HREF="/pub/join/index.html">Join EFF today</A>!
\\ 
\\     not:
\\ 
\\     <A HREF="/pub/join/index.html"
\\     >Join EFF today</A>!
\\ 
\\     Netscape is smart enough to parse the 2nd example, but many other 
\\     browsers aren't.

The way it works is that < cannot be followed by whitespace. Other than
that there are no restrictions. Any parser that can't handle this is,
well, broken.

\\ I think this is incorrect; I hope the spec allows arbitrary white-space
\\ inside the < ... > delimiters.  But, it's sad but true, I can't find
\\ a spec for this.
\\ 
\\ Besides answering the question, can someone on this list put the
\\ answer where other people can find it?  It would be nice if a
\\ human-readable and definitive lexical standard for HTML was available,
\\ and w3.org seems like a good place to put it.

The definitive HTML parser would be good to have. It's not entirely clear
what SGML constructs are valid HTML, and what are not, implemented and
otherwise. The lexer is easy enough to understand, but the information from
the lexer then has to be fed into the parser, which is not at all
documented. This is what I would like to know more about.

The w3c has standard libraries available that make this task somewhat
easier, but when parsing HTML for less standard tasks (and languages) a
library is simply not enough.

Sunil